ecdl 2006 workshop report: the use of digital object repository systems in digital libraries (dorsdl) search   |   back issues   |   author index   |   title index   |   contents d-lib magazine october 2006 volume 12 number 10 issn 1082-9873 ecdl 2006 workshop report the use of digital object repository systems in digital libraries (dorsdl)   gert schmeltz pedersen technical knowledge center of denmark technical university of denmark kostas saidis department of informatics and telecommunications university of athens hans pfeiffenberger alfred wegener institute for polar and marine research the 1st european workshop on the use of digital object repository systems in digital libraries (dorsdl) , held in conjunction with ecdl 2006 , took place on september 21, 2006, in alicante, spain. the workshop was attended by 50 people (including speakers) from 10 countries. during recent years, digital object repositories (dors) have gained significant attention from the digital libraries / digital publishing community. the dorsdl workshop's goal was to investigate the use of digital object repositories in the context of digital libraries. are dors becoming a standard component of digital libraries, or is the term (and the notion behind it) used as a replacement for "digital libraries"? depending on the approach, both may stand. under this perspective, the workshop aimed to provide a foundation for discussion about practical issues that come up in both systems, in order to foster further developments and collaborations between them. the full-day workshop was comprised of four sessions and a panel discussion. during the first session ("new developments in repository systems"), carl lagoze (cornell university) talked about interoperability among scholarly repositories and the development of workflows across distributed information. scholarly communication is seen as a global workflow across distributed repositories, which will be obtained by richer cross-repository services, so that digital objects can be used and reused in many contexts. robert tansley (google) reported on the status of the dspace repository software and its community, and also described the china digital museum project as an application of dspace. he also discussed the directions of new developments in dspace, with emphasis on the data model challenges that are not yet addressed by the plethora of standards used in dspace. the speakers that participated in the second session ("applications and experiences"), reported on various projects from around the globe. david groenewagen (monash university library) presented the arrow project, a consortial institutional repository solution that successfully combined open source and proprietary software. hans pfeiffenberger (alfred-wegener-institute for polar and marine research) reported on plankton-net, an ongoing project concerned with the taxonomy of plankton. real world issues that occurred when dealing with aggregation of information from "old", preexisting databases or repositories were observed. challenges range from standardization of protocols to meeting new "ad-hoc" competition, like flickr, while striving for long-term preservation and scientific quality control. also, some lessons about semantics can be inferred from the quite old science of naming biota. the third talk, given by diane berkovits (european organization for nuclear research), discussed managing institutional repositories with the cerndocument server digital library software. bo alro (atira a/s) presented the pure service as "becoming a central piece in the research administration and dissemination efforts of most of the danish universities". although non-open-source, pure supports a number of standard interfaces to its content. emphasis was put on supporting the processes and workflows in the universities. interestingly, the organization of the development process through an inter-university board has led to homogenization of processes at the universities. finally, jonathan crabtree (university of north carolina) presented the data-pass project, a partnership that is being led by an inter-university consortium. the project aims to ensure the long-term preservation of holdings and materials of various universities. the "services" session focused on the development of services in dors. judith wusteman (university college dublin) presented ojax, an ajax-based metasearch framework that facilitates dynamic and interactive access to repository metadata. gert schmeltz pedersen (technical university of denmark) talked about the development of services in the fedora service framework. a generic search service will allow various search engines to be plugged into fedora repositories. a peer-to-peer service will utilize results from the alvis eu project in the context of fedora repositories. in the "foundations and infrastructure" session, matthias razum (fiz karlsruhe), discussed the advanced requirements posed by the escidoc project with regard to the versioning of digital objects in a fedora-based repository. he concluded by proposing a whole-object versioning approach that builds upon fedora's datastream versioning. kostas saidis (university of athens) presented a brief overview of digital object prototypes (dops), the first framework that provides type-consistent digital objects, and discussed the benefits of a repository-independent implementation of dops. he also announced the open source release of the framework in the following months. in the last talk of the session, paolo manghi, (istituto di scienza e tecnologie dell' informazione "alessandro faedo" cnr) argued the advantages of the use of a typed data model in dors, using an example drawn by relational database systems. he concluded by outlining the future development of a typed opendlib repository service. at the end of the day, a panel discussion entitled "requirements for next generation digital library systems" took place, the participants of which where the following: donatella castelli (istituto di scienza e tecnologie dell' informazione "alessandro faedo" – cnr), carl lagoze, robert tansley, matthias razum and kostas saidis. a number of interesting issues came up during the workshop, including interoperability, the use of standards, versioning strategies and digital object types. the panelists at first discussed the digital object notion in a general fashion, elaborating on the dependency of the context in which a given digital object is represented and manipulated. dors refer to systems developed to handle digital objects, and a lot of work needs to be done to provide a generic solution that suits the needs and requirements imposed by each digital library development. should we consider digital objects as general-purpose named information units, it seems that we've come up with a description of digital artifacts that is richer than the "digital document" notion. this richer concept assists us in developing richer services. copyright © 2006 gert schmeltz pedersen, kostas saidis, and hans pfeiffenberger top | contents search | author index | title index | back issues previous article | next conference report home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/october2006-pedersen   collected wisdom: some cross-domain issues of collection level description search  |    back issues  |    author index  |    title index  |    contents d-lib magazine september 2000 volume 6 number 9 issn 1082-9873 collected wisdom some cross-domain issues of collection level description   paul miller interoperability focus united kingdom office for library and information networking (ukoln) [email protected] as the articles in this september 2000 issue of d-lib magazine illustrate, the desire to describe collections of material is evident across all our memory institutions: libraries, museums, archives, and beyond. the importance of "the collection", however, and the manner by which it is most often described, differs. the collection is king in the next article in this issue, sweet and thomas, for example, demonstrate the fundamental importance of collection description to the explanation of archival resources. in this sector, the large quantities of often similar material have historically made it impractical to catalogue item-level descriptions, and archival resources have more commonly been described at the level of the archive — or collection — itself. thus, the archive of a famous person might be listed as comprising 230 personal letters, 11 photographs, and two diaries. most of the other information associated with the archive would be about the person him/herself, rather than about the items physically contained within the archive. in their article, sweet and thomas describe the ways in which modern technology, specifically the web and the encoded archival description (ead), allow today's archives to move past this traditional practice and towards description of the individual items for which users so often search. a bag full of stuff for museums, too, the notion of the collection has long been important. indeed, the very structure of a museum is based around collections and their curators, with the coin collection, the ceramic collection, the butterfly collection, and so on. the name given by museums to their equivalent of the library catalogue, too, reflects this viewpoint, as museums have collection management systems. amongst the many other meanings given to a "collection" by museums, the most significant in many cases is, perhaps, the notion of a benefactor's collection; thus, the holdings of a museum may largely be comprised of donated bodies of material, each spanning many or all of the subject collections of a museum, but considered as a whole. examples are the burrell collection, donated to the city of glasgow and now housed in a purpose-built structure, and the guggenheim collection at the core of the guggenheim museums' current holdings. in the third article in this magazine, dunn, looks at museum approaches to collection level description, illustrating this with practical examples from the aggregation work of the canadian heritage information network (chin). views from the library in the world of libraries, the collection has not traditionally played such a key role. the notion of collection is certainly present (for example, in the special collections of unusual or significant material), but the underlying structure of library information tends to follow a different form. with the current interest in building digital libraries, attention is focussing upon new directories describing collections of resources. in the fourth article in this issue of d-lib, pearce outlines some of the initiatives underway to enhance the functionality of these services, and introduces the context within which they might operate. in the next article, brack, palmer and robinson, too, discuss work in this area, focussing upon the development and deployment of the collection level description standard of the uk's electronic libraries programme (elib). transcendence with the growth of web-based resources, the various experiences from each sector become increasingly relevant, both within the sectors concerned and for communities newly introduced to the challenges of describing large and diverse bodies of material, who stand to learn a great deal from the three curatorial traditions. in the final article in this issue, powell, heaney and dempsey introduce the important oclc-funded work on modelling the nature of collections, and its realisation within the tools of the research support libraries programme (rslp) collection description project. although developed from a library perspective, and subsequently extended to archives within the rslp work, the model has a great deal to offer museums and other information managing communities. towards a collective view as the quantity of material available online continues to grow, and services aiming to describe these resources in any meaningful fashion fall further and further behind, the archival community's traditional dependence upon collection level descriptions as a means of managing quantity becomes relevant once more. in this content-rich online environment, collection level descriptions fulfill important purposes. firstly, they serve to provide relatively superficial overviews for large bodies of otherwise uncatalogued material. secondly, they play an important role in reducing the quantity of material returned in the initial response to a broadcast query across multiple services. in this second scenario, where a user might be searching several repositories of high quality resource descriptions (a number of library opacs, a museum catalogue, and a collection of archival finding aids, for example), the collection level description serves to prevent the return of too many hits by grouping the returned results in various ways, such that numerous examples of one resource do not obscure important, but less numerous, returns. in a variant of this role, collection descriptions may also be used during the process of selecting those services to search in the first place. here, the notion of collection strengths utilised by conspectus and the like comes to the fore, with the collection descriptions containing information on the materials most likely to be found within a given collection, such that the national library of australia, for example, might be excluded from a search across a number of libraries for items of canadiana. museums, archives and libraries have emerged from very different traditions, yet are increasingly being asked to maximise exploitation of their similarities in working together to deliver content to the user. all three have a shared heritage of considering the grouping of their materials into collections, albeit along notably different lines. nevertheless, the experiences gained in describing and managing diverse collections of physical material appear highly relevant to today's digital resources. the three sectors stand to benefit from closer examination of practices amongst themselves if they are to adapt most effectively to the new challenges that they face. copyright© 2000 paul miller top | contents search | author index | title index | monthly issues editorial | next article home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/september2000-miller   "earth system science data" (essd) a peer reviewed journal for publication of data search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine january/february 2011 volume 17, number 1/2 table of contents   "earth system science data" (essd) — a peer reviewed journal for publication of data hans pfeiffenberger alfred wegener institut, germany hans.pfeiffenberger@awi.de david carlson unavco, usa ipy.djc@gmail.com doi:10.1045/january2011-pfeiffenberger   printer-friendly version   abstract in 2008, essd was established to provide a venue for publishing highly important research data, with two main aims: to provide reward for data "authors" through fully qualified citation of research data, classically aligned with the certification of quality of a peer reviewed journal. a major step towards this goal was the definition and rationale of article structure and review criteria for articles about datasets.   introduction much has been said about the need to make research data available [1]. meanwhile, this insight has found its way into policies set by research governing [2] and funding bodies at the highest levels, and even into high-level papers commissioned and authored [3] by the european commission. this short paper cannot but briefly, in the first section, discuss how scientists can be brought to actually publish their data in a meaningful way. in short, the point is "reward", or actually: recognition of data publishing as an academic achievement. lately, additional emphasis has been placed on quality assured data [4] (as well as reliable data repositories / data libraries). this topic is addressed in the second section on peer review of data as one means of making sure that other scientists can re-use published data reliably and "economically", i.e. without duplicating effort.   rewarding scientists who publish data hesitation still prevails among data creators, which feel themselves to be owners of the data: what do they have to gain, what would they loose by publishing their hard won data, now? which is, of course, the dilemma faced by early scientists, who tended to accumulate their new findings until they were worth a book — until the 17th century. as mabe [5] richly illustrates, this dilemma was solved brilliantly in one sweep: in 1665 oldenburg, the first editor of the philosophical transactions of the royal society, introduced all the essential elements of a modern scientific journal, including peer review and establishing priority. it is mabe's theme, and in fact most telling, that despite all those imaginable technical freedoms and added value possible in online publishing, the essential form and elements of recognized scientific publication has not changed at all. making data — technically — citable has been a theme for a number of years. to name a few, the claddier [6] project and green [7] explored how a citation of a dataset should be derived from the parties involved and from the (de-)composition of data elements actually used in an article citing the data. datacite, the agency to assign doi® names to datasets, derives from the project std-doi, which explored this agenda as early as 2004 [8]. although technical and formal means to cite data thus exist, a wave of data publishing has not happened. this is frequently attributed to a restrictive "culture" or tradition in science: what to cite, not just how to cite. one must not look down on this conservative behaviour; it is a matter of survival for an author and even of credibility of science to cite reliable sources, only. regarding reliability, the projects and methods mentioned provide stability and precision of reference to (specific parts of) datasets — but not certification of its content. the journal earth system science data (essd) aspires to provide datasets with this missing element (from the phil. trans. set). it will not replace data repositories, since data will not be stored or made accessible by the publisher (not to speak of preservation); rather, the editors require the dataset to reside in a reliable repository and to have a stable way of linking to it. that is, for all practical purposes: we wish to see dois for datasets (but reserve the right to lower this barrier, for the time being). a future place of essd or other data publishing journals in the workflow and infrastructure for publishing a scientist's new results is depicted in figure 1. figure 1: ideal sequence and venues of submission and citation of scientific information, with essd. (for details, see text.) it would of course be ideal if those data on which an ("traditional") article — with interpretation of the data — relies were published before submission of the article (so that anonymous peer review can take place). essd enables the scientist to do this without giving up on receiving a reward when someone else publishes his or her own conclusions from this dataset. if the author, in her or his "traditional" article, cites the dataset as a reference to a journal article in essd, any other scientist ("data reuser") would practically and morally be forced to do likewise. publishing data in essd requires additional effort over submitting it to a reliable repository. however, the potential impact should in many cases be much higher than that from traditional papers, neutralizing the perceived need to keep the data under lock, for personal use only.   peer review of data in recent years and even months there has been discussion whether to ask reviewers of classical journals to "look at" the data. when, not so long ago [9], data were presumed to be stored on cds in cardboard boxes, it was definitively not practical to ask reviewers to do this. beyond this problem of easy (and anonymous) accessibility, it is obvious that a review of many datasets and data types will require skills different from those of the reviewer of interpretations and probably even more effort than consumed by the review of the article itself. if a journal "publishes" an author's data supplement, what is a reviewer supposed to do with it? there are considerable differences answering this question, leading to editorial policies from requiring supplementary data as is now usual in genomics, to a thoughtful rejection of supplementary material as part of the material published under editorial/reviewers scrutiny. in his announcement of the later policy, maunsell [10] finds this move necessary "because supplemental material has begun to undermine the peer review process in important ways." and that "supplemental material also undermines the concept of a self-contained research report..." authors of this journal "will be allowed to include a footnote with a url that points to supplemental material on a site they support and maintain, together with a brief description..." this radical position, obviously oriented at first principles of scientific communication, may actually make it easier to understand that we need to distinguish between two reasons — and associated modalities — to provide data: the first and most prevalent reason is to underpin an article with additional evidence. this would more typically be a dataset as small as possible, an excerpt or derived values only. it is this kind of data to which the "radical" position refers. if, however, a potentially large number of articles are expected to rely on a (comprehensive or exceptional) dataset — also known as re-use of data — there is no way around the need to make sure, as far as possible, that this dataset itself is reliable. there may be communities of practise, e.g., in remote sensing or monitoring of environmental data, which work by established practises of documenting, testing and calibration of instruments, complemented by methods of (semi-)automatic validation of results. as long as those instruments and methods are operated by experienced, professional staff it may suffice for quality assurance to affirm just that, and by making all necessary documentation available. one might think of a priori quality assurance, here. however, especially in pure research, there are many innovative and evolving and therefore less thoroughly documented and tested methods, which nevertheless produce substantial results, i.e., valuable data. it is this subset, which needs to be subject to quality assurance a posteriori. how can this, to put it loosely, "quality assessment with somewhat incomplete and/or ingenious documentation/proof" be done? one "obvious" answer is: peer review, a method already practised and reasonably well understood by the parties involved.   review criteria and article structure the editors of essd therefore believe that authors as well as all topical editors and reviewers will be quite confident in interpreting and applying the review criteria available from the journals' website [11]. deliberately, the structure and wording has been adapted or even copied from traditional journals' criteria, so that just a mapping is needed of what these words could mean in the context of data. some of these mappings are straightforward ("uniqueness" becomes "originality") others, possibly not. as a speciality in data publishing, it would be much too easy to mass-produce articles, based on data series. we tried to capture a criterion to counter this under the slightly awkward label "completeness". this, together with other criteria should result in the better-known "significance". many of these well-known criteria are simply an instrument to remove redundancy or other "noise" from the body of articles published, thus reducing the effort of reading and digesting the information received. possibly even more important, because this is the bottleneck in current scientific communication, it also cuts the workload of reviewers. of course, the paragraphs on "data quality" are the most important part of the criteria. first, there is a seemingly trivial requirement for the data published to be "easily accessible in a useful format" — which, in real life, is neither trivial nor typical. this requirement is not only important for later re-use, but, again, even more for the work of the reviewers: they must be able to concentrate on the core of scientific data quality. within the article "claimed accuracy, the instrumentation employed and methods of processing should reflect the 'state of the art' or 'best practises' ". the reviewers will employ their best tools and competence to make sure that the claims are, at least, plausible and consistent. to facilitate this, there is a manuscript template, which requires from the authors sections on instruments and provenance, which should present and support claims of accuracy and circumstances of data acquisition or processing which might be significant in order to judge quality and reliability of the data published. (for a nice example, see the very first article in essd [12].)   conclusion today, quality-related information is frequently not present in data or metadata, rendering both practically useless. essd requires it and adds reliability through the scrutiny of peers. both are needed so that future users of these data can build on them, sceptically as always, but confidently. the criteria and methods of essd will not be directly applicable to all types of data — for practical as well as intellectual reasons. however, it is the vision of founders and editors that the added value it provides to datasets will help foster substantial data publishing, re-use and mandatory citation of data in a subset of this journal's domain, earth system sciences.   acknowledgements hannes grobe and manfred reinke from awi contributed to the initial discussions about complementing doi-referenced datasets in pangaea with peer reviewed articles. arne richter and martin rasmussen from copernicus publications encouraged and helped establish the journal. the review criteria themselves were drawn up together with sünje dallmeier-tiessen at awi.   references [1] data's shameful neglect, nature 461, 145, 2009. doi:10.1038/461145a [2]alliance of german science organizations, principles for the handling of research data. http://www.allianzinitiative.de/en/core_activities/research_data/data_policy/ [3] high-level expert group on scientific data (web page), http://cordis.europa.eu/fp7/ict/e-infrastructure/high-level-group_en.html, and references therein [4] eurohorcs-esf task force, eurohorcs and esf vision on a globally competitive era and their road map for actions, 2009. http://www.eurohorcs.org/sitecollectiondocuments/esf_road%20map_long_0907.pdf [5] michael a. mabe, the more things change, the more they stay the same... – why digital journals differ so little from paper. third bloomsbury conference on e-publishing and e-publications, 2009, http://www.ucl.ac.uk/infostudies/e-publishing/e-publishing2009/1b-mabe.ppt [6] brian matthews, katherine bouton, jessie hey, catherine jones, sue latham, bryan lawrence, alistair miles, sam pepler, katherine portwin, cross-linking and referencing data and publications in claddier, proc. uk e-science 2007 all hands meeting, 10-13 sep 2007. http://epubs.cclrc.ac.uk/work-details?w=37696 [7] toby green, we need publishing standards for datasets and data tables, oecd publishing white paper, oecd publishing, 2009. doi:10.1787/603233448430 [8] jan brase, using digital library techniques – registration of scientific primary data, lecture notes in computer science 3232, 488-494, springer, 2004. doi:10.1007/b100389 [9] emma marris, should journals police scientific fraud?, nature, 439, 520-521, 2006. doi:10.1038/439520a [10] john maunsell, announcement regarding supplemental material, journal of neuroscience, 30(32):10599-10600, 2010. http://www.jneurosci.org/cgi/content/full/30/32/10599 [11] essd review criteria, http://www.earth-system-science-data.net/review/ms_evaluation_criteria.html [12] gert könig-langlo, hartwig gernandt, compilation of ozonesonde profiles from the antarctic georg-forster-station from 1985 to 1992, earth syst. sci. data, 1, 1-5, 2009, doi:10.5194/essd-1-1-2009   about the authors hans pfeiffenberger is head of it infrastructure at the alfred wegener institut for polar and marine research (awi) and speaker of the helmholtz association's open access working group, where he specializes in access to data. dr. pfeiffenberger represents helmholtz' interest in access to data in various bodies, such as the priority initiative "digital information" by the alliance of german science organizations and the alliance for permanent access (apa). in 2008, dave carlson and he established essd. he holds a phd in physics.   david carlson directed the international programme office for the international polar year. ipy, with more than 50,000 participants from 60 nations, covered a wide range of science topics at a critical time for polar regions. dr. carlson has devoted more than 15 years to guiding and managing large international science programmes, starting from the very large tropical ocean global atmosphere programme in 1992 and 1993. he holds a phd in oceanography and led successful research teams focused on upper ocean physics and chemistry, oceanic microbiology and carbon cycling, and marine chemical ecology. dr. carlson now serves as science communication director for the non-profit geodesy consortium unavco in boulder colorado.   copyright © 2011 hans pfeiffenberger and david carlson border crossings: reflections on a decade of metadata consensus building search   |   back issues   |   author index   |   title index   |   contents d-lib magazine july/august 2005 volume 11 number 7/8 issn 1082-9873 border crossings reflections on a decade of metadata consensus building   stuart l. weibel senior research scientist oclc research in june of this year, i performed my final official duties as part of the dublin core metadata initiative management team. it is a happy irony to affix a seal on that service in this journal, as both d-lib magazine and the dublin core celebrate their tenth anniversaries. this essay is a personal reflection on some of the achievements and lessons of that decade. the oclc-ncsa metadata workshop took place in march of 1995, and as we tried to understand what it meant and who would care, d-lib magazine came into being and offered a natural venue for sharing our work [16]. i recall a certain skepticism when bill arms said "we want d-lib to be the first place people look for the latest developments in digital library research." these were the early days in the evolution of electronic publishing, and the goal was ambitious. by any measure, a decade of high-quality electronic publishing is an auspicious accomplishment, and d-lib (and its host, cnri) deserve congratulations for having achieved their goal. i am grateful to have been a contributor. that first dc workshop led to further workshops, a community, a variety of standards in several countries, an iso standard, a conference series, and an international consortium. looking back on this evolution is both satisfying and wistful. while i am pleased that the achievements are substantial, the unmet challenges also provide a rich till in which to cultivate insights on the development of digital infrastructure. the achievements when we started down the metadata garden path, the term itself was new to most. the known web was less than a million pages, people tried to bribe their way into sold-out web conferences, and the term 'search engine' was as yet unfamiliar outside of research labs. the oclc-ncsa metadata workshop brought practitioners and theoreticians together to identify approaches to improve discovery. in two and a half days, an eclectic gang of 52 (we affectionately described ourselves as 'geeks, freaks, and people with sensible shoes') brought forward a core element set upon which many resource description efforts have since been based. the goal was simple, modular, extensible metadata – a starting place for more elaborate description schemes. from the thirteen original elements we grew to a core of fifteen, and later elaborated the means for refining those rough categories. in recent years much work has been done on the modular and extensible aspects, as application profiles have emerged to bring together terms from separate vocabularies [9]. a consensus community the workshop series coalesced as a community of people from many countries and many domains, drawn by the appeal of a simple metadata standard. openness was the prime directive, and early progress was often marked by the contentious debate of consensus building. but our belief that value would emerge from many voices informed our deliberations, and still does. not without difficulty: in one early meeting, participants spent an hour of scarce plenary time talking about type before realizing that the librarians and the computer scientists had been talking about completely different concepts. crossing borders is often difficult. this open, inclusive approach to problem solving helped the dublin core community to frame the metadata conversation for the past decade. the dublin core brand has been for some years the first link returned for the google search term "metadata", and for a time, it outranked all other results for the search "dublin" (as of this writing, it is #6). with only moderate irony, we might say "i feel lucky!" process as a workshop series evolved into a set of standards and a community, the need for rules and governance evolved as well. dcmi developed a process for evaluating proposed changes and bringing them into conformance with the overall standard [5]. the dcmi usage board is comprised of knowledgeable, experienced metadata experts from five countries who exercise editorial guidance over the evolution of dcmi terms and their conformance with the dcmi abstract model [13]. this model itself is among the most important of the achievements of the initiative, representing as it does the convergence of theory and practice over a decade of vigorous debate and practical implementation. it emerged from early intuition and experience, informed by an evolving sense of grammatical structure [2,6] and further refined by a long co-evolution with the w3c's resource description framework (rdf) and the semantic web. at a higher level, dcmi has a board of trustees [1], who oversee operations and do strategic planning, and an affiliate program and governance structure that distributes the cost of the initiative and assures that the needs of stakeholders are accommodated [3]. at the time of this writing, there are four national dcmi affiliates and several more in discussion. internationalization the global nature of the web demands commitment to internationalization. the difficulties of achieving system interoperability in multiple languages are immense, and still only partially solved (anyone used iris recently?). nonetheless, dcmi has succeeded in attracting translations of its basic terms in 25 languages and offers a multilingual registry infrastructure of global reach [14]. the venues for the workshops and conferences have been chosen to make the initiative accessible to people in as many places as possible. workshops and conferences are held in the americas, europe, and austral-asia on a rotating basis, and dublin core principals have given talks on every continent save antarctica. this policy of international inclusion has been a philosophic mainstay for the initiative, attracting long-term participation from around the world. where we were confused confusions and unmet challenges are both interesting and instructive. a few of these are historical curiosities, and interesting mostly as a source of wry humility. others represent unsolved dilemmas that remain prominent challenges for the metadata world in general. author-created metadata the idea of user-created metadata is seductive. creating metadata early in the life cycle of an information asset makes sense, and who should know the content better than its creator? creators also have the incentive of their work being more easily found – who wouldn't want to spend an extra few minutes with so much already invested? the answer is that almost nobody will spend the time, and probably the majority of those who do are in the business of creating metadata-spam. creating good quality metadata is challenging, and users are unlikely to have the knowledge or patience to do it very well, let alone fit it into an appropriate context with related resources. our expectations to the contrary seem touchingly naïve in retrospect. the challenge of creating cost-effective metadata remains prominent. as erik duval pointed out in his dc-2004 keynote, 'librarians don't scale' [7]. we need automated (or at least, hybrid) means for creating metadata that is both useful and inexpensive. what is metadata for? another naïve assumption was that metadata would be the primary key to discovery on the web. while one may quibble about the effectiveness of unstructured search for some purposes, it is the dominant idiom of discovery for web resources, and may be expected to remain so. what then, is metadata for? there are many answers to this question, though given the high stakes in the search domain, expect these answers to shift and weave for the foreseeable future. searching the so-called 'dark web' remains a function of gated access, and metadata is a central feature of such access. one might simply say – harvest and index. oclc's exposure of worldcat assets in search engines such as google and yahoo is exemplary of this approach [11]. indexed metadata terms connect users to the location of the physical assets via holdings records, but it is reasonable to ask... would simple, full-text indexing of these assets be better still? we may argue the fine points today but in the future, we'll know the answer, for the day of digitization is fast upon us. structured metadata remains important in organizing and managing intellectual assets. the canadian government's approach to managing electronic information illustrates this strategy [4]. metadata becomes the linkage relating content, legislative mandates, reporting requirements, intended audience, and various other management functions. one does not achieve this sort of functionality with unstructured full text. the international press telecommunications council is exploring embedding dublin core in their new generation of news standards [17]. no domain is more digitally now than this one. if you want to know the value of structured metadata, look to the requirements and business cases in such communities [10]. similarly, in the management of intellectual property rights, well-structured data is essential, and as these requirements become ubiquitous, the creation and management of metadata will be central to the story. metadata for images is a critical use. association of images with text makes them discoverable. when the asset is a stand-alone image, metadata is the primary avenue by which they can be accessed. picture australia is an early and enduring (and widely copied) model in this area, showing how a photo archive can become a primary cultural heritage asset through the addition of systematic search tools and web accessibility [12]. there is much talk of taxonomies, their strengths, and deficiencies these days and in fact the emergence of 'folksonomies' hints at a sea change in the use of vocabularies to improve organization and discovery [9]. the dublin core community has struggled with the role of controlled vocabularies, how to declare and use them, and how important (or impotent?) they might be. the notion that uncontrolled vocabularies – community-based, emergent vocabularies – might play an important role in aggregation and discovery occasions a certain discomfort for those schooled in formal information management. whether it is just the latest fad, or an important emerging trend, remains to be seen. a major unmet challenge entropy is an arrow. in the absence of constant care and fussing, our greatest successes break down. failures, however, remain potent without much attention, retaining their power to impede. one of the yet-unsolved problems in the metadata community is the railroad gage dilemma. the first editor of d-lib, amy friedlander, introduced me to the notion of train gages as metaphor for interoperability challenges [8]. last year i rode that metaphor from beijing to ulan bator, mongolia. a cursory knowledge of asian history reminds us that relations between mongolia and china have been less-than-cordial from time to time, and this history remains manifest at the gobi border crossing today. in the dark of night, the beijing train of the trans-siberian railway pulls into a longhouse of clanking and switching as the entire train is raised on hydraulic jacks. chinese bogeys (wheel carriages) are rolled out, and mongolian bogeys of a different gage are rolled in. border guards with comically high hats (and un-comical sidearms) work their way through the train cars in the manner of border guards everywhere. after a couple of hours, the train is rolling through the gobi anew. it is a fascinating display of technological diplomacy – a kind of maginot line that helps those on both sides of the border sleep better. these images belong to a bogart movie or a clancy novel, but their abstraction pervades the metadata arena. stacked bogeys, ready to be rolled into use. photo by stuart weibel. a railroad car raised on one of dozens of hydraulic jacks that raise an entire train at once for the exchange of bogeys. photo by stuart weibel. we load our metadata into structures in one domain and when we cross borders we unload it, repackage it, massage it to something slightly different, and suffer a measure of broken semantics in the bargain. we're running on different gages of track, manifested in different data models, slightly divergent semantics, and driven by related, but meandering, often poorly-understood functional requirements. crosswalks are the hydraulic jacks – quieter, but no more efficient than the clanking and grinding in the train longhouse. metadata standards specify the means to make (mostly) straightforward assertions about resources. many of these assertions are as simple as attribute-value pairs. others are more complex, involving dependencies or hierarchies. none are so complicated that they cannot be accommodated within a common formal model. yet we do not have such a model in place. why? nih (not invented here) syndrome is often blamed for disparities that emerge in solutions from separate domains targeted at similar problems. certainly our propensity to like our own ideas better than those of others plays a role, but my view is that it is not such a large role. developments take place in parallel. it is unusual to have the luxury of waiting to see how another group is approaching a particular problem before tackling it yourself. it is quite hard enough to know what is happening in one's own community, let alone to follow related developments in others, whose differences in terminology obscure what we need to know. the functional requirements of various metadata standards are often ambiguous and always focused slightly differently. dcmi focuses on simple, extensible, high-level metadata. ieee lom (learning object metadata) also concerns itself with discovery metadata, but focuses more strongly on educational process descriptors. mpeg is about media, where technical image metadata is central, and intellectual property rights management is crucial. mods is grounded firmly in the legacies of marc (and the world's largest installed base of resource discovery systems). the cost of collaboration – in intellectual as well as financial terms – is high. people have to know and trust one another, which generally requires face-to-face engagement: transporting ourselves and our ideas to other time zones, surviving frequent-flyer-flues, finding the means to support travel costs, and missing baseball games of our children. the problems are more complicated than we imagine at the outset. the recent approval of the dublin core abstract model by dcmi is the culmination of a journey that began almost at the outset of the initiative. early attempts, under the guise of the dc data model working group, rank among my most contentious professional experiences. to borrow from the oldest joke of the dismal profession, put all the data modelers in the world end to end, and you won't reach a conclusion (we did, but it took ten years to manage it). the idea of achieving similar consensus across communities with their own legacies of such conflict is daunting in the extreme, though recent discussions on this topic with colleagues in another metadata community remind me that hopefulness and optimism are as much a part of our domain as contention [18]. collaboration and consensus in the digital environment the web demands an international, multicultural approach to standards and infrastructure. the costs in time and treasure are substantial, and the results are uncertain. paying for collaboration that spans national boundaries, language barriers, and the often-divergent interests of different domains is a major part of these challenges. doing this while sustaining forward progress and attracting a suitable mix of contributors, reviewers, implementers, and practitioners, is particularly difficult. a recent presentation by google's adam bosworth, referenced in the blandiose blog [15], makes for provocative reading for those debating the costs and benefits of heavy-weight versus light-weight standards. the tension between these approaches sharpens designers and practitioners (and especially, entrepreneurs), to the eventual benefit of users. any standards activity ignores this balancing act at its peril. as we try to foment change and react to it at once, we are like escher's hands – designing the future as it, in turn, designs us... except that there are often implements other than pencils in those hands. ever try explaining what you do for a living to your mother? in the internet standards arena, conveying an appropriate balance of glee, terror, satisfaction, frustration, and pure wonder is no easy task. i just tell her i'm not a real librarian, but i play one on the internet. it seems enough. acknowledgements i wish to acknowledge my personal debt to uncountable colleagues in the dublin core community, and my deep sense of gratitude for the opportunity to have played the role i have. the patience, forbearance, and generosity of the support of oclc management in supporting my efforts and dcmi in general, have been singular and essential. thomas baker reviewed and improved this manuscript with several insightful suggestions. amy friedlander and bonnie wilson, successive editors of d-lib, have made me look better than i am in these pages for 10 years. congratulations to them and to all who have helped make this journal (and its authors) what they are. references and notes [1] about the initiative dcmi website, accessed june 23, 2005 . [2] baker, thomas "a grammar of dublin core" d-lib magazine, october 2000 volume 6 number 10 . [3] dcmi affiliate program dcmi website, accessed june 23, 2005 . [4] committee of federal metadata experts metadata action team, council of federal libraries. government of canada metadata implementation guide for web resources 3rd edition july 2004 . [5] dcmi usage board dcmi usage board mission and principle dcmi website, june 11, 2003 . [6] dcmi usage board dcmi grammatical principles dcmi website, 2003-11-18 . [7] duval, erik and wayne hodgins "making metadata go away: hiding everything but the benefits" keynote address at dc-2004 shanghai, china, october 2004 . [8] friedlander, amy emerging infrastructure: the growth of railroads infrastructure history series, cnri, 1995 . [9] mathes, adam folksonomies cooperative classification and communication through shared metadata computer mediated communication lis590cmc graduate school of library and information science, university of illinois urbana-champaign. december 2004 . [10] news architecture version 1.0 metadata framework business requirements iptc standards draft, 2005 . [11] open worldcat program oclc website, accessed june 23, 2005 . [12] picture australia hosted by the national library of australia website accessed june 23, 2005 . [13] powell, andy; mikael nilsson, ambjörn naeve, and pete johnston. dcmi abstract model. dcmi website, 2005-03-07 . [14] wagner, harry and stuart weibel "the dublin core metadata registry: requirements, implementation, and experience" journal of digital information accepted for publication, may, 2005. [15] "web of data" blandiose blog, 2005-04-21 . [16] weibel, stuart metadata: the foundations of resource discovery. d-lib magazine, july, 1995 volume 1, number 1 doi:10.1045/july95-weibel [17] wolf, misha dc in xhtml2 semantic web and dc-general mailing lists, june 7, 2005 . [18] the author has been party to discussions with erik duval and wayne hodgins of the ieee lom effort centered around the possibility of cross-standard data modeling that might promote convergence among various metadata activities. the means and methods for carrying such work forward are presently undetermined. copyright © 2005 oclc online computer library center, inc. top | contents search | author index | title index | back issues previous article | jcdl conference report home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/july2005-weibel   the nsf national science, technology, engineering, and mathematics education digital library (nsdl) program: new projects and a progress report search  |    back issues  |    author index  |    title index  |    contents d-lib magazine november 2001 volume 7 number 11 issn 1082-9873 the nsf national science, technology, engineering, and mathematics education digital library (nsdl) program new projects and a progress report   lee l. zia1 division of undergraduate education national science foundation arlington, va 22230 [email protected] introduction the national science foundation's (nsf) national science, technology, engineering, and mathematics education digital library (nsdl) program2 comprises a set of projects engaged in a collective effort to build a national digital library of high quality science, technology, engineering, and mathematics (stem) educational materials for students and teachers at all levels, in both formal and informal settings. by providing broad access to a rich, reliable, and authoritative collection of interactive learning and teaching resources and associated services in a digital environment, the nsdl will encourage and sustain continual improvements in the quality of stem education for all students, and serve as a resource for lifelong learning. though the program is relatively new, its vision and operational framework have been developed over a number of years through various workshops and planning meetings [1-6]. the nsdl program held its first formal funding cycle during fiscal year 2000 (fy00), accepting proposals in four tracks: core integration system, collections, services, and targeted research. twenty-nine awards were made across these tracks in september 2000. brief descriptions of each fy00 project appeared in an october 2000 d-lib magazine article [7]; full abstracts are available from the awards section at . in fy01 the program received one hundred-nine proposals across its four tracks with the number of proposals in the collections, services, and targeted research tracks increasing to one hundred-one from the eighty received in fy00. in september 2001 grants were awarded to support 35 new projects: 1 project in the core integration track, 18 projects in the collections track, 13 in the services track, and 3 in targeted research. two nsf directorates, the directorate for geosciences (geo) and the directorate for mathematical and physical sciences (mps) are both providing significant co-funding on several projects, illustrating the nsdl program's facilitation of the integration of research and education, an important strategic objective of the nsf. thus far across both fiscal years of the program fifteen projects have enjoyed this joint support. following is a list of the fy01 awards indicating the official nsf award number (each beginning with due), the project title, the grantee institution, and the name of the principal investigator (pi). a condensed description of the project is also included. full abstracts are available from the awards section at the nsdl program site at . (grants with shared titles are formal collaborations and are grouped together.) the projects are displayed by track and are listed by award number. in addition, six of these projects have explicit relevance and application to k-12 education (indicated with a * below). six others clearly have potential for application to the k-12 arena (indicated with a ** below). the nsdl program will have another funding cycle in fiscal year 2002 with the next program solicitation expected to be available in january 2002, and an anticipated deadline for proposals in mid-april 2002. core integration track due 0127298, 0127308, and 0127520. collaborative project: core integration of the national smete digital library. institutions: university corporation for atmospheric research (ucar), cornell university, and columbia university. pis: dave fulker, bill arms, and kate wittenberg, respectively. this collaborative project is developing the key organizational and technical infrastructure to support the coordination and management of the digital library's distributed collections and services. three primary areas of effort are 1) engaging the community, 2) providing technology, and 3) operating core services. technical components include a flexible portal architecture, a central metadata repository, an open source tool kit for access to rich content, and a database for authentication and user profiles, all emphasizing openness and long-term evolution. support is also being provided for a very wide array of educational and library services, emphasizing in this initial phase: operation of the primary portal and specialized portals for the nsdl community and the nsdl partners, comprehensive information retrieval services to search for collections or individual items, and an optional service for user profiles and authentication. collections track due 0121390*. collection and dissemination of geoscience data and knowledge for the national smete digital library. institution: cornell university endowed. pi: dogan seber. this project is collecting solid earth science data sets and complementing them with a system of easy-to-use, interactive user tools to provide a dynamic and engaging learning environment for students at all levels ranging from k-12, to undergraduate, graduate, and lifelong learners. in addition, evaluation of the system for use in undergraduate and high school classes is underway. due 0121518. analytical sciences digital library. institution: university of kansas center for research inc. pi: ted kuwana. this award is establishing the analytical sciences digital library (asdl) to classify, catalog, link and reference information or discovery material pertinent to innovations in curricular development and supporting resources in the analytical sciences. due 0121540. a digital library of ceramic microstructures. institution: university of dayton. pi: mattison ferber. the goal of this project is to create a digital library of microstructures for functional ceramics with emphasis on materials used for structural, electronic, and thermal applications in undergraduate and graduate science and engineering education. due 0121545. green's functions research and education enhancement network (green). institution: kent state university. pi: gregory shreve. to facilitate research and education in green's functions and their applications, this project is developing an education-oriented digital library focusing on green's functions and their applications to serve undergraduate, graduate and professional education user communities and support green's functions research and application development in academia and industry. due 0121623. thematic real-time environmental data distributed services (thredds). institution: university corporation for atmospheric research. pi: ben domenico. the ucar unidata program is developing an organizational and software infrastructure, thematic real-time environmental distributed data services (thredds), to enable educators and researchers to locate, analyze, visualize, and publish a wide variety of environmental data in both their classrooms and laboratories. due 0121636*. teachers domain collection: life sciences, k-12. institution: wgbh educational foundation. pi: michelle korf. teachersdomain.org (w.t.), a digital library collection, is being developed to harness wgbh's extensive broadcast, video, and interactive programming resources in the life sciences to support standards-based teaching and learning from elementary through secondary school. due 0121650*. national digital library for technological literacy. institution: international technology education association. pi: brigitte valesey. the international technology education association (itea) and the eisenhower national clearinghouse (enc) are establishing a broad and deep digital collection of resources that supports the teaching and learning of technological literacy at the k-12 levels. due 0121669. an active mathematical software collection for inquiry-based computational science and engineering education. institution: university of tennessee knoxville. pi: jack dongarra. this project is aggregating high-quality numerical software for science and engineering education to support a rich, highly interactive, and inquiry-based learning environment needed to enable learners at various levels to master the use of numerical methods and software libraries. due 0121677*. gender and science digital library. institution: education development center. pi: katherine hanson. the gender and science digital library (gsdl), a collaborative project between the gender and diversities institute at education development center, inc. and the eisenhower national clearinghouse (enc) at ohio state university, is creating a high-quality, interactive library of k-16 gender and science resources. the gsdl assists educators in promoting and implementing gender-equitable science education in both formal and informal settings. due 0121679**. computing and information technology interactive digital educational library (citidel). institution: virginia polytechnic institute and state university. pi: ed fox. a consortium of institutions is building the computing and information technology interactive digital education library (citidel) to serve the computing education community in all its diversity and at all levels. domain areas of coverage include computer engineering, computer science, information science, information systems, information technology, and software engineering. due 0121684. bridging the gap between libraries and data archives. institution: university of california-san diego. pi: brian schottlaender. an oceanography collection that provides access to nearly 50 years of shipboard data and integrates historical documents, samples, research publications, and maps from global databases is being developed into a searchable digital library. modern information technologies are employed to bridge the gap between content-related, but disparate holdings within libraries, data archives, and historical archives. due 0121691. geotechnical, rock and water resources library towards a national civil engineering resources library. institution: university of arizona. pi: muniram budhu. this project is promoting widespread access to quality information, resources, and activities in support of learning, teaching, and research in the areas of geotechnical engineering, rock engineering, and water and its use. the effort is envisioned as a component of a larger national civil engineering digital library providing "one-stop" access to resources to meet the learning, teaching and research needs of a wide audience including higher education, professionals, and the community at large. due 0121699. reciprocal net -a distributed molecular database. institution: indiana university bloomington. pi: john huffman. the reciprocal net project is constructing and deploying an extensive distributed and open digital collection of molecular structures. associated with the collection are software tools for visualizing, interacting with, and rendering printable images of the contents; software for the automated conversion of local database representations into standard formats which can be globally shared; tools and components for constructing educational modules based on the collection; and examples of such modules as the beginning of a public repository for educational materials based on the collection. due 0121703**. electronic encyclopedia of earthquakes. institution: university of southern california. pi: tom jordan. this project is expanding a pilot version of the web-based electronic encyclopedia of earthquakes (e^3) into a major portal for students, educators, and others seeking information about the science of earthquakes, earthquake engineering, and the practical aspects of hazard characterization and loss reduction. due 0121709. the digital archive network for anthropology (dana). institution: north dakota state university fargo. pi: clark. this award is supporting a network infrastructure to link distributed databases with content of relevance to the domain of anthropology. due 0121724*. water in the earth system (wes): an nsdl k-12 collection project. institution: colorado state university. pi: edward geary. a collection of digital k-12 materials and resources built around the theme of "water in the earth system (wes)" is being created to enhance the ability of k-12 teachers, students, and parents to easily find, access, and use high-quality, standards-based water resources in their classrooms, at home, and in informal learning environments. due 0121749**. ethnomathematics digital library. institution: pacific resources for education and learning. pi: nancy lane. this project is identifying, collecting, cataloging, and organizing high quality ethnomathematics curriculum and instructional materials, research articles, and other professional resources of interest to elementary, secondary and tertiary students and teachers, curriculum developers, researchers, and members of institutions of higher education. the library provides users with a premier and readily accessible source of documents and materials describing the mathematical constructs created and used by indigenous cultures around the world. due 0127580. collaborative project: enhancing the interoperability of collections and services. institution: university of california berkeley. pi: alice agogino. faculty at a number of institutions are collaborating to develop and implement several information technological solutions aimed at enhancing the interoperability of both collections and services for the nsdl. a particular emphasis is on exploring the requirements for supporting "tightly federated" collections, featuring close adherence to particular metadata frameworks that enable federated search services to be built. in this collaborative work a team from the university of california berkeley is working primarily on collection interoperability while a team from the university of missouri columbia is focusing its efforts on enhancing the interoperability of services (see due 0126690). services track due 0121267. the internet scout project's personalized content delivery system. institution: university of wisconsin. pi: john strikwerda. this project is developing a personalized content delivery system (pcds) to promote and showcase nsdl materials and services, while at the same time collecting and disseminating information about the best new online smet resources from outside the nsdl. project objectives include: i) development of a set of "current awareness" services that deliver information about high quality, online smet resources (from both nsdl projects and outside sources) to nsdl users in a fashion that best suits their needs and ii) extension and improvement of access to nsdl collections and services. due 0121460. a lightweight, flexible, and web-based approach to supporting workflow in digital libraries. institution: university of colorado at boulder. pi: ken anderson. this project is using recent advances in event messaging systems, hypermedia, and web-based technologies (such as xml), to create a lightweight workflow technology to help the stakeholders of a digital library coordinate their shared activities when distributed across different organizations and different times. due 0121525**. developing virtual reference desk capabilities for the nsdl. institution: syracuse university. pi: richard lankes. the goal of this project is to integrate human expertise with virtual reference desk capabilities to support the library. a variety of evaluation methodologies are being used to help develop a better understanding of the digital reference process. due 0121531**. virtual telescopes in education (tie). institution: university of maryland baltimore county. pi: susan hoban. this project is seamlessly integrating telescopes equipped with remote access and control capabilities into one virtual observatory by providing the services required to operate this facility, including a scheduling service, tools for data manipulation, an online proposal review environment, an online "virtual tie student astrophysical journal" for publication of results, and access to related educational materials provided by the tie community. this effort leverages ongoing projects of the collaborators to increase access to astronomy education at the k-12 and undergraduate levels, especially among traditionally under-served communities. due 0121543 and 0121717. collaborative project: standard connection mapping nsdl educational objects to content standards. institutions: syracuse university and university of washington. pis: liz liddy and stuart sutton. in this project a team of investigators is building a middleware tool for the automatic assignment of content standards and benchmarks from the compendium of standards and benchmarks developed by mid-continent research for education and learning (mcrel) and the achieve standards database to educational resources in the current collections nsdl program and to others harvested from the internet. due 0121550. data discovery toolkit and foundry. institution: new media studio. pi: bruce caron. this nsdl services project is developing an initial set of "data discovery" tools to enable students to manipulate real (and real-time if needed) data sets for visualization purposes. in addition, an internet-based community center for further tool building, the "foundry", is being created. due 0121575. increasing effective student use of the scientific journal literature. institution: university of tennessee knoxville. pi: carol tenopir. this project is developing features and sample exemplars of interactive learning modules to help undergraduate users recognize, access, and evaluate high quality scientific and technical information such as those contained in the so-called "grey literature" of technical reports and abstracts generated by various government laboratories. due 0121578**. textual geospatial integration services for nsdl. institution: university of california-santa barbara. pi: jim frew. operational services are being constructed that make it easy for users of the nsdl to find and integrate semantically related information in heterogeneously represented items. of particular interest is integrated access to information occurring in collections of texts and collections of maps and images that is semantically related in terms of the geospatial features and regions. users can find maps or images containing features or places that are referred to implicitly within texts; or they can find texts that implicitly reference features and places contained in maps and images. due 0121596. decentralized image retrieval for education (direct). institution: university of virginia main campus. pi: scott acton. a peer-to-peer content based image retrieval (cbir) service is being developed to enable a user to designate a query image so that the library may be searched for images of similar content. image matching is done not by text metadata but by the color, texture, and shape of the image objects. due 0121656. an oai-compliant federated physics digital library for the nsdl. institution: old dominion university research foundation. pi: kurt maly. this project is designing and building a cross-archive searching service, based on the open archives initiative, that federates heterogeneous collections having metadata that differ in richness, syntax, and semantics. the initial focus is on federating collections related to physics; however, the techniques and algorithms developed under this proposal will be general enough to work for building a federation service for other communities as well. due 0121692. digital library services for american indians. institution: american indian higher education consortium. pi: carrie billy. this award supports a planning conference to engage a group of tribal colleges in dialogue and spark project ideas to ensure that the collections and services of the nsdl are both sensitive to the needs of the native-american population and reflective of that community's contributions to the smet educational enterprise. due 0121741. intelligent collection services for and about educators and students: logging, spidering, analysis and visualization. institution: university of arizona. pi: hsinchun chen. the goal of this project is to develop and enhance retrieval, analysis, categorization, and visualization tools to augment the knowledge, skills and abilities that users bring in various degrees to the information search process. in addition, transaction logs from the library are used by teachers to gauge the use of resources and students' paths through the learning experience; such transaction logs can also be valuable to students as a record of their learning experience within a given course or throughout their educational career. due 0126690. collaborative project: enhancing the interoperability of collections and services. institution: university of missouri at columbia. pi: su-shing chen. in this collaborative work a team from the university of california berkeley is working primarily on collection interoperability while a team from the university of missouri columbia is focusing its efforts on enhancing the interoperability of services (see due 0127580). targeted research track due 0121527. design and evaluation of maintenance tools for distributed digital libraries. institution: texas engineering experiment station. pi: frank shipman. this targeted research project is investigating and developing tools and social protocols to make more feasible the management and maintenance of "author-based" distributed digital library collections, in which material is placed into the library by the author, and librarians (collection managers) organize and annotate it for the library patrons. due 0121635** and 0121743. collaborative research: developing a learner-centered metathesaurus for science, mathematics, engineering and technology education. institutions: university of maryland baltimore county and university of california berkeley. pis: bill wood and alice agogino. this collaborative project is investigating the creation of a learner-centered metathesaurus based on the analysis of transactions between learners and different types of learning resources. three modes of interaction are being considered. the first involves data in which many students respond to the same problem (the "problem of the week" at the math forum). the second interactional mode involves data from transactions between learners and experts (the math forum's "ask dr. math" feature). the final interactional mode under study uses collections at a third resource, www.smete.org, where learner usage is more in the traditional library mode of search-retrieve. due 0121769. developing a national science digital library (nsdl) libqual+ protocol. institution: association of research libraries. pi: duane webster. under the auspices of the association of research libraries this project is modifying and repurposing the existing libqual+ protocol for assessing the services provided for the user community of the nsdl program. notes 1. all views expressed in this article are solely those of the author and do not represent an official nsf policy statement. 2. nsdl program . references [1] "information technology: its impact on undergraduate education in science, mathematics, engineering, and technology." (nsf 98-82), april 18-20, 1996. available at . [2] "developing a digital national library for undergraduate science, mathematics, engineering, and technology education." nrc workshop, august 7-8, 1997. available at . [3] "report of the smete library workshop." (nsf 99-112), july 21-23, 1998. available at . [4] "serving the needs of pre-college science and mathematics education: impact of a digital national library on teacher education and practice." nrc workshop, september 24-25, 1998. available at . [5] "digital libraries and education working meeting." january 4-6, 1999. available at . [6] "portal to the future: a digital library for earth system education." august 8-11, 1999. available at . [7] zia, lee l. "the nsf national science, mathematics, engineering, and technology education digital library (nsdl) program: a progress report." d-lib magazine, october 2000. available at .   top | contents search | author index | title index | back issues editorial | next article home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/november2001-zia   cross-language evaluation forum clef 2006 search   |   back issues   |   author index   |   title index   |   contents d-lib magazine october 2006 volume 12 number 10 issn 1082-9873 cross-language evaluation forum clef 2006   carol peters istituto di scienza e tecnologie dell'informazione consiglio nazionale delle ricerche, pisa the results of the seventh campaign of the cross-language evaluation forum were presented at a two-and-a-half day workshop held in alicante, spain, 20-22 september, immediately following the tenth european conference on digital libraries (ecdl 2006). the workshop was attended by over 130 researchers and system developers from academia and industry. the main objectives of the cross-language evaluation forum (clef) are to stimulate the development of monoand multilingual information retrieval systems for european languages and to contribute to the building of a research community in the multidisciplinary area of multilingual information access (mlia). these objectives are realised through the organisation of annual evaluation campaigns and workshops. the scope of clef has gradually expanded over the years. while in the early years, the main interest was in textual document retrieval, the focus is now diversified to include different kinds of text retrieval across languages and on different kinds of media (i.e., not just plain text but collections containing images and speech as well). in addition, attention is given to issues that regard system usability and user satisfaction with tasks to measure the effectiveness of interactive systems. evaluation tracks in clef 2006 eight tracks were offered to evaluate the performance of systems for: mono-, biand multilingual document retrieval on news collections (ad-hoc) monoand cross-language structured scientific data (domain-specific) interactive cross-language retrieval (iclef) multiple language question answering (qa@clef) cross-language retrieval on image collections (imageclef) cross-language speech retrieval (cl-sr) multilingual web retrieval (webclef) cross-language geographic retrieval (geoclef). test suites most of the tracks adopt a corpus-based automatic scoring method for the assessment of system performance. the test collections consist of sets of statements representing information needs known as topics (queries) and collections of documents (corpora). system performance is evaluated by judging the documents retrieved in response to a topic with respect to their relevance (relevance assessments) and computing recall and precision measures. the following document collections were used in clef 2006: clef multilingual comparable corpus of more than 2 million news documents in 12 european languages clef domain-specific corpora: english/german and russian social science databases malach collection of spontaneous speech in english and czech, derived from the shoah archives eurogov, ca. 3.5 m web pages crawled from european governmental sites the imageclef track used collections for both general photographic and medical image retrieval: iapr tc-12 photo database; ltu photo collection for image annotation imageclefmed radiological database; irma collection for automatic image annotation diverse sets of topics or queries were prepared in many languages according to the needs of the various tracks. for example, the ad hoc track used document collections in nine different european languages and topics were offered in more than fifteen languages including non-european languages such as amharic, hindi, indonesian, oromo and telugu. participation participation was up again this year with 90 groups submitting results for one or more of the different tracks: 60 from europe, 14 from north america, 10 from asia, 4 from south america and 2 from australia. figure 1 workshop the campaign culminated in the workshop held in alicante, 20-22 september. the workshop was divided between plenary track overviews and, parallel, poster and breakout sessions. in her opening talk (see figure 1), carol peters, the clef coordinator, stressed the need for more technical transfer activities. she commented that although many advances had been made in the multilingual information access research field, there were still few real-world operational cross-language systems. in her opinion, clef should be paying more attention to issues that directly regard the user and the needs of the application communities rather than focusing most attention on system performance only in terms of precision and recall. in fact, one of the most interesting activities this year was the real-time question answering exercise, organised on-site by fernando llopis and elisa noguera, u. alicante (figure 2). here the aim was to examine the ability of question answering systems to respond within a time constraint. the need for more technical transfer was taken up again in the final session in two talks. martin braschler, u. applied sciences winterthur, switzerland, gave an insightful talk on "what mlia applications can learn from evaluation campaigns" while fredric gey from u.c. berkeley, usa, summarised some of the main conclusions of the mlia workshop at sigir 2006 in seattle, where much of the discussion was concentrated on problems involved in building and marketing commercial mlia systems. there was also an invited talk by noriko kando, national institute of informatics, tokyo, japan, on new evaluation activities at the ntcir evaluation initiative for asian languages. figure 2 results clef has been running since 2000. the main results over the years can be summarised in the following points: stimulation of research activity in new, previously unexplored areas, such as cross-language question answering, image and geographic information retrieval study and implementation of evaluation methodologies for diverse types of cross-language ir systems documented improvement in system performance for cross-language text retrieval systems creation of a large set of empirical data about multilingual information access from the user perspective quantitative and qualitative evidence with respect to best practice in cross-language system development creation of important, reusable test collections for system benchmarking building of a strong, multidisciplinary research community the presentations given at the clef workshops and detailed reports on the experiments of clef 2006 and previous years can be found on the clef website at . the preliminary agenda for clef 2007 will be available from mid-november. clef is an activity of the delos network of excellence for digital libraries. the clef website is at . copyright © 2006 carol peters top | contents search | author index | title index | back issues previous conference report | next conference report home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/october2006-peters   author addenda: an examination of five alternatives search   |   back issues   |   author index   |   title index   |   contents d-lib magazine november 2006 volume 12 number 11 issn 1082-9873 author addenda an examination of five alternatives   peter b. hirtle cornell university library the problem when an author publishes a book or a paper, many publishers ask the author to transfer all copyrights in the work to the publisher. but that is not always to the author's advantage. when authors assign to publishers all of the rights that comprise the bundle of rights known as copyright, they lose control over their scholarly output. assignment of copyright ownership may limit the ability of authors to incorporate elements into future articles and books. authors may not be able to use their own work in their teaching, or to authorize others at the institution or elsewhere to use materials. unless addressed in the transfer agreement, the publisher may forbid an author to do the following: post the work to the author's own web site, an institutional repository, or a subject-based repository. copy the work for distribution to students. use the work as the basis for future articles or other works. give permission for the work to be used in a course at the author's institution. grant permission to faculty and students at other universities to use the material. for all of the above reasons, many organizations and institutions have encouraged authors to better manage their copyrights. the board of regents of the university of texas, for example, requests that authors manage their copyrights for the benefit of "the authors, the citizens of texas, state government, the component institutions, and the u. t. system" [1]. maintaining some rights may be to the benefit of the author and his or her institution, but how can this be done? one solution: the author's addendum until recently, the primary method that authors could use to retain some rights in their writings was to rewrite the contract with the publishers themselves. thanks to the development of standardized author addenda, the task has become much simpler. an author's addendum is a standardized legal instrument that modifies the publisher's agreement and allows the author to keep key rights. the addenda usually spells out what rights the author does or does not have in several key areas: the extent of the author's ability to continue to use the copyrighted work even after the transfer of copyright to a publisher, including the ability of the author to make copies of the work or prepare new works based on the copyrighted work. the author's ability to authorize others to use the work. whether and when the author's institution can make any use of the work. whether and when the author's funding agency can make use of the work. when and under what circumstances, if any, people at other institutions can use the work. what legal protections are available to the author. three different organizations – mit , science commons (through its scholar's copyright project), and sparc – have worked with lawyers to develop self-sufficient addenda that address these issues. these addenda can be attached to the publishing contracts received by publishers and are likely to be legally binding. the five addenda and my abbreviations for them are: mit amendment to publication agreement [2] [mit] scholar's copyright openaccess-creativecommons 1.0 addendum [3] [sc1] scholar's copyright openaccess-publish 1.0 addendum [4] [sc2] scholar's copyright openaccess-delay 1.0 addendum [5] [sc3] sparc author's addendum [6] [sparc] in addition to these five addenda, other flavors of publication amendments are beginning to appear. for example, ohiolink has developed its own modification of the mit amendment [7]. ohiolink's most important change was to increase the scope of mit's amendment. the mit amendment, like the other four addenda listed above, is limited to contracts for the publication of articles. ohiolink modified the agreement so that it can also apply to books, book chapters, poems, musical compositions, and other creations [8]. at heart, however, the ohiolink template is the same as the mit amendment and so will not be considered separately in this analysis. similarly, deborah r. gerhardt of the university of north carolina has slightly modified the sparc author's addendum for the use by its faculty [9]. in addition, kenneth crews, the director of the copyright management center at indiana university – purdue university – indianapolis, has devised two addenda that can be attached to publishing contracts [10]. the privileges found in crews's addenda are also found in many of the addenda discussed in this article, but crews's proposals lack the comprehensive structure that would make them easy to implement. for that reason, crews's addenda will not be considered here. a comparison of the five author addenda the following charts highlight the major features of each of the addenda, with the addenda ranked in order from the one that offers the most rights to the author to the one that is the least permissive: author's use of the work key: "+" = permitted; "-" = not allowed or not addressed; "o" = allowed in certain cases; "?" = unclear. order is from most permissive → least permissive rights sc1 sparc mit sc2 sc3 author can:           reproduce, distribute, perform, and display the work                   in teaching & professionally + o + + +         non-commercial + + o o o         unrestricted o o o   make derivates                   in teaching & professionally + + + + +         non-commercial + + o + +         unrestricted + + o + + table 1: author's rights table 1 addresses what rights the author retains under each addendum to use her work even after copyright is transferred to the publisher. all of the addenda allow authors to reproduce, distribute, perform and display their work in conjunction with their teaching and professional activities. the sparc agreement, however, only authorizes non-commercial use by the author; it would not authorize an author to use the work in a for-profit education endeavor, consulting business, or other commercial professional enterprise. all of the agreements would also allow authors to make non-commercial use of the material, though in the case of mit, sc2, and sc3, these non-commercial purposes are limited to the author's "teaching, conference presentations, lectures, other scholarly works, and professional activities." an author, for example, could not make copies of her article to give to her family and friends. this would be a non-commercial distribution unrelated to her professional activities. none of the addenda allow the author to make totally unrestricted use (including commercial use) of the work. because the mit, sc2, and sc3 addenda do allow authors to use their works in conjunction with teaching and professional activities, in theory an author could sell copies of an article – or perhaps even authorize another publisher to republish it – if she felt it contributed to her professional development. being able to use some or all of one's work as a basis for another publication is an important right most authors want to retain. at my institution, for example, a faculty member recently was asked to pay $400 in order to include a figure from one of his articles in a subsequent publication. another faculty member was told that he could not publish a second edition of a book because the publisher of the first edition, to whom he had transferred copyright, worried it would compete with another title on their list. the right to utilize a work in later creations is defined as the right to make derivatives. all of the addenda ensure that an author can freely use prior work in subsequent publications – though once again the mit agreement limits the author's rights to scholarly, academic, and professional activities. for example, other addenda would allow an author to turn his article into a hollywood movie; the mit agreement would not. authorizing others to use one's work rights sc1 sparc mit sc2 sc3 author can authorize others to:           reproduce, distribute, perform, and display the work                   non-commercial + +         unrestricted make derivates                   non-commercial + +         unrestricted table 2: author's authorization rights table 2 addresses the ability of an author to authorize others (outside the author's own institution) to make use of a work even after the author has transferred copyright to a publisher. an author may, for example, want to give a colleague at another institution permission to distribute an article to students in the colleague's class. similarly, the author might want to be able to grant permission to include a chart or figure from the article in a colleague's subsequent publication. only the sc1 and sparc addenda secure for the author the right to authorize others to use her work. both allow the author to retain the right to authorize others to make non-commercial use of her work so long as the author receives credit as the author and the journal in which an article was originally published is cited as the source of first publication of the article. the other addenda do not allow authors to authorize others to use their works, and no addendum permits the author to authorize commercial use of the work. rights sc1 sparc mit sc2 sc3 institutional/funding agency           author's institution can use the work (if authorized):                   in teaching & professionally + o + +         non-commercial + + o + +         unrestricted + o + + author's funding agency can use the work (if authorized):                   in teaching & professionally + + +         non-commercial + + + +         unrestricted + + + table 3: use by the author's institution and/or funding agency table 3 depicts the rights of the author's employing institution and the funding agency that supported the research that led to the creation of the article to use the work (other than in an institutional or subject repository, which is discussed in table 4). while most universities concede ownership of scholarly work produced by their faculty to the authors of those works, there is still a sense that to some extent the university contributes to the creation of the work and should be able to have some interest in it. similarly, agreements with funding agencies will often stipulate that the funding agency retains some rights in the work (such as the right to distribute copies if it wishes). yet a standard copyright transfer agreement may not recognize or permit these grants of rights by the author prior to the transfer of copyright to the publisher. the copyright transfer agreement might state explicitly that there have been no such prior grants of copyright rights or, more commonly, that the publisher will have the exclusive right to exercise all copyright rights. an author who licensed some non-exclusive rights to her institution or funding agency, and then later asserted to a publisher that she was transferring all rights of copyright to that publisher, could potentially be held liable for breach of contract with the publisher. the science commons licenses are the best by far in acknowledging that an author may be subject to license terms made prior to the transfer of copyright. all three stipulate that any non-exclusive grants of copyright rights to the author's employing institution or funding agency made prior to the transfer of copyright to the publisher are still valid. this protects authors from any warranty clauses that may exist in the transfer agreement with the publisher. there is no limitation on the scope of rights that the author can give to his institution or funding agency – other than that they be non-exclusive. the special grant of rights to the author's institution or funding agency must have taken place prior to the transfer of copyright to the publisher. after the transfer takes place, the institution or funding agency has no more rights than any other user, and the terms in table 2 would apply. the mit amendment also singles out the employing institution for special treatment. it allows the author to grant his employing institution the right to use the work "in connection with teaching, digital repositories, conference presentations, lectures, other scholarly works, and all academic and professional activities." there is no requirement that these uses be limited to non-commercial activity, and the ability of the author to grant these rights extends even after the transfer of copyright to the publisher takes place. the mit amendment's focus on the "employing institution" could be a source of some confusion in the future, however. a faculty member who left mit for cornell, for example, should under the mit amendment be able to authorize use at cornell of an article written by the author while she was at mit (since cornell is now her employing institution). she may not, however, be able to authorize any new uses of the work at mit, the institution that originally supported the work. the mit amendment is silent on any special rights for funding agencies. the sparc addenda is silent on the issue of special rights for the author's institution or funding agency. in the absence of specific language, each would be treated like any other user according to the terms found in table 2. by not recognizing that prior licenses may have occurred, it also leaves the author open to charges of violating boilerplate warranties about exclusivity that may be found in the copyright transfer agreement. rights sc1 sparc mit sc2 sc3 institutional and subject open access repositories           immediate deposit                   mss prior to editorial review o o o +         mss with editorial changes + + + + 0         publisher's copy + o + + table 4: institutional and repository rights table 4 addresses a special category of institutional and funding agency rights: namely, the ability of authors to deposit copies of articles in institutional or subject based open access repositories, such as dspace or pubmed central. only sc2 distinguishes between preprints (versions of articles prior to refereeing and acceptance) and post-prints (versions of articles incorporating the changes made as a result of journal review and acceptance) and stipulates that an author can distribute either from a non-commercial web site. it is possible to read between the lines of the other addenda, however, to see how they address the issue of whether preprints can be posted to preprint servers. the mit amendment would seem to limit preprints to personal websites; sc1 and sparc would seem to allow authors to post their articles to personal, institutional, or open access web sites for non-commercial purposes. both require that the postings of preprints be in support of the author's academic and professional activities. sc3 speaks only of the post-print version of the paper, and so an author using this addendum only has clear authority to distribute that version (though an author could use her rights described in table 3 to post a pre-print to an institutional repository – but not a subject repository). all of the addenda allow the posting of the post-print version of the article on a personal, institutional, or open access web server, i.e., "the version of the paper after peer-review, with revisions having been made" [11]. sc3 has an important caveat: the published version of articles may not be posted any earlier than six months after publication. what about the article as published by the publisher? does the author have the right to distribute the publisher's pdf version of the article? sparc, sc2, and sc3 all stipulate that authors can distribute the published version of the article. sparc states explicitly that it can be the publisher's pdf. sc2 and sc3 define the "published version" to mean the "the version of the article distributed by publisher to subscribers or readers of the journal." the mit amendment states that once an article has been published, the author has the right to make "the final published version of the article available in digital form," but does not make clear whether by final published version they mean the post-print manuscript or the actual pdf or html document as published. even if we assume that mit, sc2, and sc3 allow authors to post the pdf version of an article as published by the publisher, authors face a problem on how to legally acquire the pdf. sparc is the clearest on this: the publisher must supply the author with a pdf copy of an article within fourteen days of publication. an author using the sc2 or sc3 addendum presumably could use an electronic copy received as part of a subscription to the journal – or even scan a printed copy and distribute the scans. a problem could arise, however, if authors attempted to distribute from a web site the version of the article taken from one of the databases that are often licensed by libraries for campus use. many of these databases consist of content from a number of publishers that has been aggregated by a vendor, and the library signs a license agreement not with the publishers but with the aggregator. it is quite possible that while downloading and then distributing a copy of an article from an aggregator's database would not be a copyright violation (since an addendum could give the author that right), it might still be a violation of the terms and conditions associated with the use of the database and could leave the faculty member and/or the university open to a possible contract infringement suit by the aggregator. what is missing from the addenda? any of these addenda would offer authors rights that they may not have in the standard publication agreement. nevertheless, there are important lacunae in all of them. for example, none of these addenda explicitly address the rights of third parties in general to use the material once copyright has been transferred to the publisher. the mit amendment is particularly noteworthy in this regard – it ensures that people at mit can use articles in their teaching and research, but the amendment does not consider at all similar uses by the larger scholarly and educational communities. creative commons licenses [12] are becoming a standardized means to stipulate what rights the author thinks a third party should have to be able to use the article without having to ask permission. it would have been possible to include in the sc1 and sparc addenda that the articles are being published under a creative commons attribution-noncommercial license, and that this notice should appear in the article. rather than specifying the extent of user rights in the transfer agreement, the addenda leave it up to the authors to decide on a case-by-case basis whether the author wishes to authorize non-commercial use on an attribution basis. to their credit, however, they at least give the author the option of granting the rights found in the creative commons attribution-noncommercial license. in addition, while each of the addenda allow authors to distribute articles from institutional or subject repositories, none of them give explicit permission to authors to grant to those repositories the rights they need in order to be able to preserve the articles over time. each repository will need to make backup copies of the articles, migrate the content of the articles over time (as current file formats become obsolete), and in some cases allow users to select from a variety of file formats. the standard dspace installation includes a license in which the author authorizes the dspace repository to make these preservation copies, but none of the addenda under discussion explicitly gives the author the authority to sign such a license after a copyright transfer has been made. and it is unclear if the terms of the addenda will map perfectly to the terms of the repository deposit agreement. lastly, some of these addenda are highly technology specific. the sc2 and sc3 addenda, for example, state that the author can distribute copies of an article "by means of any web server from which members of the general public can download copies without charge." web servers are the distribution technology of today, but fifteen years ago the solution would have been a gopher server or anonymous ftp indexed by archie. there is no assurance that fifteen years hence, web servers (or even the internet) will be the preferred form of author-based distribution. which addendum to use? which of these addenda should authors seek to employ? first, rather than using any addenda, authors should consider retaining copyright and giving the publisher instead an exclusive license for first publication. a recent british survey of scientific publishers indicated that 39% of publishers do not require authors to transfer copyrights in order to publish [13]. retaining copyright and only licensing to the publishers those rights the publisher needs in order to publish and manage an article is clearly the solution that is in the author's interest [14]. it would be nice to know if publishers are willing to accept any of these addenda, or if there is one that most publishers preferred. unfortunately, all of these addenda are so new that there is no evidence as to which, if any, of them are acceptable to publishers (though science commons hopes to collect such data). anecdotal evidence from cornell seems to be indicating that the sparc addendum's requirement that the publisher provide a pdf within two weeks of publication is enough of a managerial headache that publishers are refusing to accept the addenda – but this is based on a very small number of cases. in the absence of any hard data on which addenda is the most acceptable to publishers, it makes sense for authors to seek to have the most control possible over the use of their materials. the analysis conducted for this paper would suggest that the sc1 addendum is the solution that would provide the greatest flexibility to authors while at the same time possibly being acceptable to publishers. on the other hand, sparc recently launched a major education initiative entitled "author rights" centered on its addendum [6]. it may be that the visibility of sparc's campaign will force publishers to accept the sparc addendum – even though this analysis suggests that sc1 offers authors slightly more protections. what if the publisher rejects the author addendum? here are some good suggestions from sparc [15]: explain to the publisher why it is important for you to retain these rights in your own work. ask the publisher to articulate why the license rights provided under the author's addendum are insufficient to allow publication. evaluate the adequacy of the publisher's response in light of the reasonable and growing need for authors to retain certain key rights to their works. consider publishing with an organization that will facilitate the widest dissemination of their authors' works, to help them fulfill their personal and professional goals as scholars. summary and recommendations while not perfect, author addenda can be an important tool that authors can use to retain the rights they want or that their employing institutions request that they retain. they can be an important component in ensuring open access to articles from either an institutional or a subject repository. the scholar's copyright addenda can also provide legal protection to authors who have contributed preprints to such repositories and then do not notice that the contract they sign with the publisher asks them to guarantee that the publisher will be able to exercise exclusive rights in the work. of the addenda analyzed, sc1, the scholar's copyright openaccess-creativecommons 1.0 addendum, provides the greatest benefit to authors. in the absence of data on acceptance by publishers, authors may wish to submit it to the publishers along with the publisher's copyright transfer agreement. publishers should as well indicate to authors which addenda they will accept or indicate in their copyright transfer agreements which addendum's terms have been incorporated into the agreement. finally the sponsors of addenda should stop issuing their own custom versions of the documents, which legal departments at publishers must then analyze on a case-by-case basis, but instead sponsors should agree on a few standard addenda that all can use. acknowledgments i thank john saylor and thinh nguyen for their advice and support of this project, though the opinions, findings, conclusions, and mistakes found in this article are mine alone. the author is not a lawyer, and his analysis of author addenda, which are contracts, is for informational and educational purposes only and should not be taken as legal advice. authors who wish legal advice on the exact rights they have under a copyright transfer agreement and/or an addendum to that agreement should consult with a lawyer. notes [1] copyright management: policy and guidelines for management and marketing of copyrighted works, . [2] . [3] . [4] . [5] . [6] . [7] . [8] other changes made by ohiolink to the mit amendment include dropping the ability of authors to make the final published version of works available and increasing the rights granted to employing institutions by allowing them to use the work outside the confines of the employing institution. in addition, the ohiolink template includes as optional two provisions found in the sparc addendum: the requirement that the publisher provide a pdf of the article as published within fourteen days of publication, and language stipulating the acceptance of the terms of the agreement by the publisher. [9] . [10] crews, k. and wong, d., "reserving rights of use in works submitted for publication: negotiating publishing agreements," available at: . [11] my definition of pre-print and post-print follows the sherpa/romeo project. see . [12] . [13] cox, j. and cox, l, "scholarly publishing practice: academic journal publishers' policies and practices in online publishing. second survey, 2005, executive summary." (pp.3-4) worthing, west sussex: alpsp (association of learned and professional society publishers. . [14] jisc and surf recently collaborated on a tool box that can help authors generate a license to publish. see . [15] . copyright © 2006 peter b. hirtle. this work is licensed under the creative commons attribution-noncommercial-sharealike 2.5 license: http://creativecommons.org/licenses/by-nc-sa/2.5/. top | contents search | author index | title index | back issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/november2006-hirtle   growing institutional support for data citation: results of a partnership between griffith university and the australian national data service search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine november/december 2013 volume 19, number 11/12 table of contents   growing institutional support for data citation results of a partnership between griffith university and the australian national data service natasha simons griffith university, brisbane, australia n.simons@griffith.edu.au karen visser australian national data service, canberra, australia karen.visser@ands.org.au sam searle griffith university, brisbane, australia samantha.searle@griffith.edu.au doi:10.1045/november2013-simons   printer-friendly version   abstract data is increasingly recognised as a valuable product of research and a number of international initiatives are underway to ensure it is better managed, connected, published, discovered, cited and reused. within this context, data citation is an emergent practice rather than a norm of scholarly attribution. in 2012, a data citation project at griffith university funded by the australian national data service (ands) commenced that aimed to: enhance existing infrastructure for data citation at the university; test methodologies for tracking impact; and provide targeted outreach to researchers about the benefits of data citation. the project extended previous collaboration between griffith and ands that built infrastructure at the university to assign doi names (digital object identifiers) to research data produced by griffith's researchers. this article reports on the findings of the project and provides a case study of what can be achieved at the institutional level to support data citation.   introduction there is a growing recognition that research data is a first class output of research and, as such, needs to take its place amongst more traditional measures of scholarly endeavours, such as journal articles. the amount of money poured into research on a global scale is significant and major funders are increasingly recognising data as a valuable product of research. as the volume and complexity of data continues to grow, various governments and research councils have responded to the "data deluge" by outlining policy and principles that support better management of research data and by providing funding to assist research institutions in achieving this goal. the result is a range of international initiatives that have a shared goal of ensuring that research data can be well managed, connected, published, discovered, cited and reused. within the context of improved management of data produced through research, data citation is a relatively new concept. data citation refers to the practice of citing research datasets and collections in the same way that other types of information, such as articles and books, are cited. citation is the norm for scholarly acknowledgement of publications and as a first class output of research, ideally data would be treated in the same way. in addition to providing a reward structure for sharing data, data citation allows for the identification, retrieval, replication and verification of data underlying published studies [1]. the international not-for-profit organisation datacite suggests, "data citation can help by: enabling easy reuse and verification of data; allowing the impact of data to be tracked; creating a scholarly structure that recognises and rewards data producers." over the past few years, various governments and funding agencies have included or expanded references to data management, access and more recently, citation, in their policies and programs. for example, the research councils uk state in their "common principles of data policy" that "all users of research data should acknowledge the sources of their data" [2] and the national science foundation (us), now allows for citable data to be listed as a product of research, like a journal article [3]. the dryad data repository, in partnership with various journals, provides a framework for data deposit and data citation in conjunction with article publication. however, this type of initiative is more the exception than the rule. at the level of cross-disciplinary research institutions, such relationships with journals are rare and developing policy and improving technical infrastructure to enable routine data deposit and long-term data management is the much-needed prerequisite to data citation. on a global scale, a cultural shift is required within the scholarly community in order for data citation to become the norm. as the practice of data citation evolves, partnerships and shared learning among institutions and between countries are critical. an outstanding example of collaboration and shared learning is the partnership between the australian national data service (ands) and griffith university in developing infrastructure, assessing impact measurement tools, and engaging with researchers to support data citation practices.   australian initiatives in data management and citation enabling support for researchers and institutions in building a culture of data citation is an objective of ands. an initiative funded by the australian government, ands is building the australian research data commons: a cohesive collection of research resources from all research institutions, to make better use of australia's research data outputs. research data australia, ands' flagship service, provides a comprehensive window into the australian research data commons with a rapidly growing collection of almost 90,000 australian records of australian research data collections. this discovery service is designed to provide rich connections between data, projects, publications, researchers and institutions, and promote visibility of australian research data collections in search engines. within the overall goals of the organisation, "an important aim of ands is to enable more researchers to re-use research data more often. to achieve this aim ands is engaged in activities that will make it easier to share data, to recognise the importance of making data available and to make data citation a standard procedure." [4] figure 1: data citation poster by the australian national data service [larger version of figure 1.] to assist institutions in managing persistent access to data, and facilitating data citation practices, ands provides a "cite my data" service. this machine-to-machine service allows ands-partner institutions to mint doi® names (digital object identifiers) free of charge for their datasets using datacite (of which ands is a partner) as the doi registration agency. this process has been facilitated by the provision of extensive support materials on the ands website, including step-by-step guides and responses to faqs from australian research institutions. ands has also facilitated the growth of an australian data citation community of practice by sponsoring a series of workshops, meetings and webinars that has drawn on the experience of the international and local doi and data citation community. further, ands has provided funding to a small number of institutions to assist them in developing infrastructure and guidance for researchers that supports building a culture of data citation.   ands and griffith university: a case study in collaboration in 2012, griffith university's division of information services began a new project funded by the australian national data service that aimed to: enhance existing infrastructure for data citation at the university; test methodologies for tracking impact; and provide targeted outreach to researchers about the benefits of data citation. the project, known as the data citation project, extended previous collaboration between griffith and ands that built infrastructure at the university to assign dois to research datasets and data collections produced by griffith's researchers. concluding in may 2013, the data citation project was the first of its kind at an australian university, though similar initiatives have been made at other australian research institutions such as the csiro and the australian antarctic data centre. the project generated new experiences and findings regarding data citation infrastructure, impact and practice.   developing doi infrastructure dois are globally unique, resolvable persistent identifiers that contribute an important component to the research data infrastructure. persistent identifiers are critical in managing access to online resources so that links are not broken and resources are not lost. there are a large number of persistent identification schemes available for use and it is useful to select which ones to use against criteria that include uniqueness, trustworthiness, reliability, scalability, flexibility, and transparency to users of the scheme. within the global context of improving management of, and access to, research data, there is a growing international effort to improve citation of research data using the doi system. datacite is the international not-for-profit organisation playing a leading role in this effort; and as a partner of datacite, ands are also key contributors. datacite promote the use of dois in data citation as a way of: helping researchers track reuse of their data; helping data centres in establishing a mechanism that supports discovery and reuse; and supporting publishers with an elegant link between an article and its underlying data [5]. the benefits of assigning dois to datasets and data collections also extend beyond their value in the context of data citation. assigning dois to research data collections enhances the concept of data as being a valued research output, to be managed persistently for the long-term. dois require a commitment to maintaining links to the data and therefore signal an institution's willingness to manage the data for the foreseeable future. dois are also routinely assigned to publications, in fact the system began in the publishing industry. when applied to data, they indicate that data is to be treated with the same respect as publications — to be well managed, persistently available and cited over the scholarly lifecycle of the research. finally, dois are key to the collection of citation metrics and altmetrics. while citation metrics track formal citations, altmetrics tools such as impactstory use dois to track mentions in social media and non-traditional scholarly communications across the web. without a doi, this tracking is made more difficult.   developing resources, policy and procedures for dois since september 2011, ands and griffith university have worked closely together to ensure that emergent practices associated with dois attached to australian research datasets are aligned with world practices and are, in fact, setting standards for this global movement. griffith university was the first to test the cite my data service, providing feedback and advice to ands and contributing their experiences and ideas to the emerging data citation community of practice. the griffith experience revealed that minting dois was technically straightforward, however it raised a number of questions such as: what material should have a doi, how to manage versioning, level of granularity, landing page resolution, metadata requirements and data citation format [6]. griffith sought answers to these questions from ands, as well as the national and international communities of practice. this resulted in the development of the "digital object identifiers (dois): introduction and management guide", a document that provides a framework for minting and maintaining dois at the institution. the target audience for the document is internal; it begins with an overview of the doi system, data citation and the ands service. the doi management guide section outlines the business rules for minting dois and makes clear the agreed approach to granularity, versioning, citation format and so on. the document includes a section on the datacite metadata schema, as a minimal amount of metadata is required to mint a doi, and concludes with a technical summary that points to the scripts developed to mint dois for griffith data collection and to information on the ands service. drawing heavily on the griffith university experience, ands developed an extensive matrix of publicly available, support materials for both data citation and dois. these resources include general technical and informational materials as well as in-depth materials on specific topics. the materials are presented in a variety of formats including "how to" checklists, guides, current examples of national and international practice, academic articles, webinars, recordings and linked webpages. several webinars featured international perspectives and speakers, such as dr heather piwowar (impactstory), dr louise corti (uk data archive) and ryan scherle (dryad). as data citation is of interest to both researchers and institutions, and sometimes for quite different reasons, there are resources specifically for researchers [7] and institutions [8]. once a doi has been assigned, it is used as the web link in a citation element that is included in the metadata record describing the data collection. for example: cara beal, rodney anthony stewart ([2013]): south east queensland domestic water usage collection (1st spring 2012 read). smart water research centre. http://doi.org/10.4225/01/513e57e0f1577. this record is made available for discovery in the griffith research hub. the hub has addressed the need for a comprehensive view of the institution's research output and contains profile pages for researchers and their associated publications, projects, collections, groups and so on. the citation element is formed according to the basic datacite guidelines for a data citation [9] and it is included in the data collection metadata record that griffith then provides to the ands research data australia discovery portal. griffith has a roadmap for future doi-related activities including: further work to embed doi minting into automated workflows for data deposit in the repository; assigning dois to grey literature such as theses; a watching brief on a number of international developments including the odin project and altmetrics tools; and a future review of guidelines and rules.   assessing impact tools as part of the data citation project, griffith arranged a one-month trial of the thomson reuters data citation index during april 2013. the dci forms part of the web of knowledge platform and was launched in november 2012. it provides subscription-only access to metrics associated with research data from global repositories covering multiple disciplines. the altmetrics tool impactstory was also trialled as part of the project and in connection with orcid identifiers [10]. results were shared with the data citation community of practice via a webinar in june 2013 [11] and via the project blog [12]. a key conclusion of these trials is that bibliometric and altmetrics tools may yield few, if any, results for datasets at this point in time. the reasons for this are multi-faceted and include: the lack of mandates for deposit of data; the early stage of data citation as a practice among researchers; delay between the availability of a dataset, its use and subsequent citation; and the need to expand the bibliometric product to include datasets from the ands research data australia service. however, it is clear that data citation is a new scholarly practice and metrics for data citation will change considerably in the near future.   engaging with researchers and librarians as part of the ands-funded data citation project, griffith provided targeted outreach to researchers about the benefits of data citation. the outreach strategy included conversing with subject librarians about citation practices in different disciplines, introducing data citation as part of a standard consultation with a specific research group, and engaging with researchers at the point of data deposit into the institution's data repository. while a formal qualitative study was not part of the project, anecdotal evidence gathered from outreach activities suggests that a number of factors contribute to researchers' openness to discussing how sharing their data might contribute to the impact of their own research. this could include the types of publishing outlets in their discipline, their target audiences, and the processes by which their work is currently assessed. age and career stage may also be a factor, with early career researchers seemingly more receptive. subject librarians may also benefit from data citation awareness sessions, as this is still a new and evolving area. figure 2: screenshot from youtube video of griffith university webinar on data citation the project found that adequate guidance for constructing a citation for data is an area that requires further attention in formal guides and manuals. citation guides do not cover the processes of citing data well, if at all. mooney and newton (2012) conducted a content analysis of journal articles, author instructions, style manuals and data publishers finding that "roughly half of journals point citation toward a style manual that addresses data citation, but the majority of journal articles failed to include an adequate citation to data used in secondary analysis studies" [13]. additionally, in many universities training for citation practices is targeted at new undergraduates, a group that are fairly unlikely to be generating or re-using datasets until later in their academic careers. the shortcomings of journal policies and citation style guides (which may also feed into the templates commonly used in reference management software) combined with a lack of training opportunities are likely to inhibit the growth of data citation as a scholarly norm. action in addressing these issues requires international attention in conjunction with small-scale institutional initiatives. in terms of workflows for data deposit and citation, the project looked at the way that dryad promotes data citation through their notifications to researchers following data deposit. as part of their community outreach, dryad provides the text of the author notifications in their submission workflow. the dryad model allows for data deposit and the linking of a publication to the underlying data to occur as part of the journal article publication. at griffith, however, there is currently no mechanism for alerting the eresearch services team as to when a researcher is about to publish an article. while the dryad model is therefore not applicable at the institutional level, griffith aims to adopt author notifications that are similar to those used by dryad in the future, as part of the automated self-deposit process for research data that is currently under development.   the big picture the formal evidence base for the benefits of data citation is still minimal. griffith university is a cross-disciplinary institution and generalising the evidence base for data citation to researchers from disciplines not included in the studies risks easy dismissal of the conclusions. benefits for researchers are not the same as benefits for institutions or for funders, and this needs to be kept in mind when communicating about citation benefits with people who may feel increasingly pressured by the multiple efforts already in place to measure the value of their research. griffith's experience suggests there is a need for more comprehensive and compelling evidence of the benefits of data citation for researchers from all disciplines. developing a culture of routine data citation is intricately linked to routine data deposit and data management practices. griffith's new best practice guidelines for researchers: managing research data and primary materials [14] incorporates data citation as part of a holistic view of data management, and over time, information and training materials that reflect this concept and practice are expected to improve. as a result of collaboration with ands, griffith has the infrastructure to mint dois for datasets, and include a data citation element in records made available through the griffith research hub and research data australia.   future plans given the success of their collaboration to date, griffith university and ands have a range of initiatives planned for 2013 and beyond. these include: increase the uptake of the cite my data service to all australian research institutions with the capacity to mint and maintain dois. strengthen and expand the data citation community of practice through the data citation webinar series and at various workshops, conferences and events. continue to improve the cite my data service and respond to community feedback. continue international contributions and ideas exchange with datacite, particularly regarding the odin project. work together to enable a single feed from research data australia to the thomson reuters data citation index and other developing products.   figure 3: overview of data citation activities at griffith university   conclusion data citation is an emergent practice, yet it has the potential to become a norm of scholarly attribution, in line with improved data management, access, discovery and reuse. at the institutional level, support for data citation includes developing infrastructure and providing guidance for researchers, as part of a broader strategy of improved data management and deposit. the partnership between ands and griffith university provides a case study of what can be achieved through collaboration and mutual support. however, there are a number of factors beyond the capability of a single institution to address that will determine how well established a culture of data citation can become, such as funding agency mandates for data deposit and citation, the policies of scholarly publishers and the provision of adequate style guides. collective action and strong leadership is required to address these issues over a period of time.   references [1] mooney, h, newton, mp. (2012). the anatomy of a data citation: discovery, reuse, and credit. journal of librarianship and scholarly communication 1(1):ep1035. http://doi.org/10.7710/2162-3309.1035 [2] research councils uk. rcuk common principles of data policy. [3] national science foundation. (2013). proposal and award policies and procedures guide. [4] ands data citation. [5] datacite: creating a global citation framework for data. [6] simons, n. (2012). implementing dois for research data. d-lib magazine. volume 18 issue 5/6. http://doi.org/10.1045/may2012-simons [7] ands data citation for researchers. [8] ands data citation for institutions. [9] datacite. why cite data? [10] orcid connecting research and researchers. [11] what griffith university are doing to establish a culture of data citation. youtube. [12] data citation and impact at griffith university blog. [13] mooney, h, newton, mp. (2012). the anatomy of a data citation: discovery, reuse, and credit. journal of librarianship and scholarly communication 1(1):ep1035, p. 1. http://doi.org/10.7710/2162-3309.1035 [14] griffith university. (2013). best practice guidelines for researchers: managing research data and primary materials.   about the authors natasha simons is senior project manager in the division of information services at griffith university, brisbane, australia. she has managed the griffith research hub project, in addition to other projects funded by the australian national data service. previously, natasha worked at the national library of australia in a variety of roles, including manager of australian research online. natasha is a member of the council of australian university librarians research advisory committee.   karen visser is program leader for skills and policy at the australian national data service (ands), where she works to develop awareness of the benefits to researchers and institutions of making data citation a routine scholarly practice.   sam searle has been the eresearch senior specialist (information management) at griffith university since october 2012. she contributes information management expertise to software development projects, and coordinates a range of activities related to research data management. she has previously worked at monash university library, the office of the information commissioner (qld), victoria university of wellington, and the national library of new zealand, and in other universities in australia, new zealand and scotland in a range of research, archives and publishing roles.   copyright © 2013 natasha simons, karen visser and sam searle electronic journals and changes in scholarly article seeking and reading patterns search   |   back issues   |   author index   |   title index   |   contents d-lib magazine november/december 2008 volume 14 number 11/12 issn 1082-9873 electronic journals and changes in scholarly article seeking and reading patterns   carol tenopir, chancellor's professor university of tennessee donald w. king, distinguished research professor university of north carolina at chapel hill abstract a recent article by james evans in science (evans 2008) is being widely discussed in the science and publishing communities. evans' in-depth research on citations in over 34 million articles and how online availability affects citing patterns, found that the more issues of a journal that are available online, the fewer numbers of articles in that journal are cited. if the journal is available for free online, it is cited even less. evans attributes this phenomenon to more searching and less browsing (which he feels eliminates marginally relevant articles that may have been found by browsing) and the ability to follow links to see what other authors are citing. he concludes that electronic journals have resulted in a narrowing of scientific citation patterns. this brief article expands on the evidence cited by evans (boyce et al. 2004; tenopir et al. 2004) based on the authors' ongoing surveys of academic readers of scholarly articles. reading patterns and citation patterns differ, as faculty read many more articles than they ultimately cite and read for many purposes in addition to research and writing. the number of articles read has steadily increased over the last three decades, so the actual numbers of articles found by browsing has not decreased much, even though the percentage of readings found by searching has increased. readings from library-provided electronic journals has increased substantially, while readings of older articles have recently increased somewhat. ironically, reading patterns have broadened with electronic journals at the same time citing patterns have narrowed. introduction over the past four decades (1977 to 2005) the authors have conducted surveys of thousands of scientists including engineers, medical researchers, and social scientists (in addition to other professionals and students). these surveys included two national surveys under national science foundation (nsf) contract, and surveys in universities and other settings.1 all of these surveys applied the same questionnaire (with modest modifications to reflect technological changes) first developed under contract to the national science foundation in 1977 (king et al. 1981). (examples of the questionnaires can be found at .) the surveys ask some questions about the scientist-respondents, such as how many articles were read in the past month and demographics, but most questions deal with a "critical incident" of the last article read. questions about the last article read focus on factors such as the age of the article, time spent reading this article, how the reader became aware of it, where it was obtained, format of the article, purpose of reading, and so on. trends of the information-seeking and reading patterns of university science faculty over nearly thirty years reveal where reading patterns have potentially broadened or narrowed science. in a study published in science, evans (2008) examined a massive set of over 34 million articles and their citations to track the range of articles and journals cited in scientific articles and to compare changes in citing patterns with the growing availability of e-journals. ironically, evans found that there is a negative correlation between online availability of journals and how often they are cited. he attributed this narrowing of science to many things, including a decline in browsing and rise in searching or citation linking (citing boyce et al. 2004 and tenopir et al. 2004). perhaps the process of citation linking and relevance ranking based on citation counts provides either a quality filter or a convenient tendency to follow what others have cited, hence resulting in a narrowing of citation patterns of scientists (for further discussion see tenopir 2008). for astrophysicists, this narrowing of citing patterns began even before the widespread availability of e-journals, but it continues today at the same time reading is increasing (kurtz et al. 2000, kurtz et al. 2005; henneken et al. 2008). while citing patterns are clearly narrowing as these recent studies show, a more in-depth look at reading patterns adds to the picture and to the overall question of how electronic journals may be impacting the scientific endeavor. do citing patterns follow reading patterns, or are they quite distinct? this article provides evidence that reading patterns follow a different path than citing patterns.2 analysis of reading surveys conducted from 1977 through 2005 show that university science faculty on average: read more in less time per reading, rely less on browsing and more on searching, rely more on library provided articles than from other sources, and, because they make choices based on what helps them get their work done, will readily adapt to new technologies that are convenient to their information-seeking, reading, and work patterns. amount of reading since 1977 university science faculty have increased their number of readings in each survey time period observed. annual average reading is calculated by multiplying the average monthly amount of reading that was reported times 12 to get an average yearly number of readings (see figure 1). university faculty in 2005 report nearly twice as many readings as they did thirty years ago. some of this corresponds to the concomitant growth in the number of journals and journal articles, meaning that faculty must read more just to read the same percentage of the literature in their subject discipline (tenopir and king 2000). figure 1. average number of article readings per year per u.s. university science faculty member by year of survey time spent reading while the average number of article readings per scientist is increasing, the average time spent per reading is declining. in the 2005 surveys, scientists reported spending an average of 31 minutes per reading, down from 48 minutes in 1977. multiplying the average time spent per reading by the number of article readings shows that the total commitment to reading by u.s. science faculty increases from approximately 120 hours annually in 1977 to 144 hours annually in 2005. it appears that the amount of time available for reading scientific articles may be reaching a maximum capacity. information seeking patterns means of identifying articles read science faculty members use many ways to become aware of and locate articles. for example, they browse through the tables of contents in print or electronic journals, typically for current awareness; they search for information in online search engines, e-journal systems, aggregated full-text databases, and indices to identify new topics or for teaching, research, and writing. following citation links in print and electronic journals is also employed, as are recommendations from a colleague or some other person. the relative importance of those various means has changed since the 1970s (see table 1). table 1. how u.s. science faculty learned about articles they last read: proportion of reading (%) by year of survey method of learning about articles survey year(s) 1977 (n=2,350) 1984 (n=865) 1993 (n=70) 2000-03 (n=397) 2005 (n=884) browsing 58.5 54.1 56.6 48.7 33.9 automated searching 0.7 1.1 9.4 23.7 23.1 other searches (e.g., a & i) 10.4 11.0 11.4 3.1 4.9 citations 6.7 13.1 7.5 11.5 14.9 persons (e.g., colleagues, authors, etc.) 17.7 15.3 11.3 13.0 18.5 other 6.0 5.4 3.8 4.7 total 100.0 100.0 100.0 100.0 100.0 as evans noted, the proportion of reading by u.s. science faculty from browsing decreased in recent years, replaced by other means of learning about articles that are read. while the proportion of readings decreased over the years, however, that number of readings found by browsing remains about the same: 88 readings in 1977 and 95 in 2005. readings from searches increased from 17 to 78 readings between these two years. in 2005, over half of browsing (58.4% of browsed readings) continue to be from print subscriptions. searching in 1977 was primarily from a&i publications, tables-of-contents and other alerting tools, and by other means. most searching in 2005 was from electronic sources (92.6% of readings from searching), although some searching continues from a&i print publications (1.9% of readings from searching). sources of articles read the average number of personal subscriptions reported in surveys of scientists in non-university settings has decreased steadily over time, from six in 1977 to under three personal subscriptions on average per scientist by 2003. for u.s. university science faculty, the average number of personal subscriptions has remained about the same: 4.2 subscriptions per scientist in 1977 and 4.1 in 2005. the proportion of readings by university scientists that come from personal subscriptions has steadily declined, at a much steeper rate than the decline in number of personal subscriptions. readings from library collections in particular have made up for the decline in reading from personal subscriptions, followed by readings from other sources such as web sites and separate articles from colleagues. the changes in proportion of article readings from personal subscriptions, library provided sources, and other sources are shown in table 2. table 2. sources used by u.s. science faculty to obtain article they last read. proportion of readings (%) by years of survey source of article survey year 1977 (n=2,350) 1984 (n=865) 1993 (n=70) 2000-03 (n=397) 2005 (n=924) personal subscriptions 60.0 53.0 35.5 38.6 22.6 library-provided articles 24.8 30.1 53.8 43.6 62.0 other 15.2 16.9 10.7 17.8 15.4 total 100.0 100.0 100.0 100.0 100.0 library-provided articles include those obtained from library collections or school or department collections (often supported from the main library) and from interlibrary loan or document delivery. "other sources" in 2005 include article copies obtained from a colleague, author, etc., preprints or reprints, or from an author or other website. in 1977, "other sources" were largely reprints, preprints and photocopies provided by authors or publishers. library-provided articles are replacing personal subscriptions as a source for articles. the picture is even clearer when comparing the number of readings in 1977 and 2005 (figure 2). figure 2. sources used by u.s. science faculty to obtain article they last read by number of readings in 1977 and 2005 the number of readings increased by an estimated 130 readings per scientist from 1977 to 2005. the "other sources" increased by about 20 readings (from 23 to 43), which might be attributable to open access initiatives since the current 43 readings are 11 from preprints; 19 copies provided by authors, colleagues, etc.; 4 from an author website and 2 from other websites (plus 7 from an unspecified source). reading from personal subscriptions decreased about 27 readings per faculty scientist. some of the decrease represents a drop in personal subscriptions, but most is attributable to less reading per subscription (i.e., about 21 readings per subscription in 1977 to 15 in 2005). most remarkable is that readings from library-provided articles increased by 137 readings, which is comparable to the net increase in readings overall (130 readings). much of this increase is due to electronic journal or aggregation collections in libraries. in 1977 university scientists read on average at least one article from about 13 journal titles. in 2005 at least one article was read from 33 different journal titles. this increase is due in part to the availability of more journal titles in library electronic collections, but it is also because there are more access points available to obtain separate copies of articles such as author websites and preprint databases (i.e., about 43 readings in 2005 compared with 23 in 1977). reading patterns are highly skewed; some of these specific sources are used infrequently, thus expanding the number of sources used at least once. over half of readings in 2005 are from electronic sources (59.5% vs. 40.5% from print sources). however, as shown in figure 3, personal subscription readings frequently continue to be from print issues. on the other hand, most library-provided articles and other sources are read from electronic versions. figure 3. format of articles read by u.s. science faculty in 2005 by format of sources (n=923) purpose of reading the most frequent principal purpose of reading is research (48.5% of readings), followed by teaching (22.5%), writing (articles, reports, proposals, etc. – 10.8%), and current awareness/keeping up (8.0%). those articles read for research or writing are most likely to be cited now or in the future, so over one-third of readings are not likely to be read for purposes that would involve citing. whatever the purpose of reading, articles are important to that purpose – about 37.8% of readings are said to be "absolutely essential" in achieving the principal purpose of reading the last article. age of article readings evans found that the age of articles cited is shortening, that is older articles are being cited less frequently. reading patterns appear to differ from citing patterns in respect to age of articles. table 3 provides evidence of age of reading by university scientists working in non-university settings and scientists in surveys in which location is not established. results do not suggest a shortening of the age of articles read for these three sets of surveys conducted over the years.3 table 3. age of articles read by u.s. scientists by year(s) of observation. proportion of reading (%) by age. type of scientists year(s) of observation sample size age of article (%) 1st year 2nd year 3 to 5 years 6 to 10 years 11 15 years over 15 years university 1993 [2,3] 70 58.5 12.3 13.9 9.3 1.5 4.6 2000-03 [4] 327 64.7 10.3 11.6 6.2 3.2 4.0 2005 [5] 920 55.6 14.2 15.3 7.8 3.1 4.0 non-university 1984-96 [6] 982 69.4 13.6 9.5 4.8 1.1 1.7 1993-98 [3] 792 66.9 15.1 6.9 2.9 2.9 5.2 2000 [7] 76 71.0 9.0 10.0 4.0 2.0 3.0 both 1960 [8] unknown 61.5 13.3 11.0 10.2 1.7 2.3 1977 [1] 2,350 67.2 14.3 10.8 4.0 1.6 2.0 1993-98 [3] 862 63.2 14.5 8.3 4.2 2.6 5.1 2002 [9] 506 63.8 9.9 13.3 3.8 2.8 4.5 [1] [king et al. 1981] [2] [belefant-miller and king 2001] [3] [tenopir and king 2000] [4] [king et al. 2003] [5] [tenopir et al. 2005] [6] [griffiths and king 1993] [7] [tenopir and king 2002] [8] [case institute of technology 1960] [9] [tenopir et al. 2005] the university surveys show some variability in reading articles the past year after publication, but not in those over 15 years after publication. it should be noted that the survey in 1993 included only 70 respondents who were scientists, and in 2005 the proportion of readings within the first year of publication by faculty at the university of tennessee was on average greater than those at the four universities in ohio that are served by the ohiolink consortium. in 2005 university faculty were asked: "did you cite this article or do you plan to cite it in a paper or report?" the articles cited or intended to be cited tend to be older than other articles read by this group of scientists (see table 4.) table 4. age of articles cited or intended to be cited by university scientists compared with other articles read in 2005 (n=857) year published article "cited" other articles 2005 45.2% 54.8% 2004 17.4% 10.6% 2001-03 22.3% 14.6% 1996-2000 7.7% 8.7% prior to 1996 9.1% 6.9% keeping in mind that the citations may be for non-article publications, these self-reported results from reading do not suggest that age patterns of citations are changing over time. since reading of older articles is increasing slightly, perhaps the perceptions of citing reported here differ from the reality of citing observed by evans. in the late 1970s, under nsf contract, a large random sample of articles was examined as part of a journal tracking study of journal and article attributes (king et al. 1981). (the articles were also used to survey authors). one observation was the average age of citations varied considerably for nine fields of science designated by nsf at that time (table 5). the half-life of articles cited also varied by the field of science. these data may serve as evidence to compare age of articles cited in the 1970s or earlier with articles cited more recently. table 5. average age of articles cited by u.s. scientist in 1977 and half-life of articles cited by field of science average age of articles cited half-life of articles cited field age field/discipline age physical science 9.7 physics 4.6 mathematics 15.0 chemistry 8.1 computer sciences 5.9 mathematics 10.5 environmental science 8.8 geology 11.8 engineering 7.4 chemical engineering 4.8 life science 11.2 mechanical engineering 5.2 psychology 10.4 metallurgy 3.9 social sciences 11.9 physiology 7.2 other sciences 12.2 botany 10.0 all sciences 10.8     information-seeking patterns by age of articles read readings of older articles are different from more current readings in several respects, including method of finding out about the reading, source of the reading, and format of reading. age of articles has a bearing on how they are identified and where they are obtained as shown in tables 6 and 7 below. table 6. how u.s. science faculty in 2005 became aware of articles by age of article read (n=835)   how readers became aware of articles (%)   year published browsing searching citation another person total 2005 52.6 20.4 7.8 19.2 100 2004 24.6 37.3 16.9 21.2 100 2001-03 21.3 39.4 19.4 20.0 100 1996-2000 15.5 39.4 25.4 19.7 100 prior to 1996 3.1 32.8 46.9 17.2 100 articles published in 2005 (prior to october/november when the surveys were done) were largely identified through browsing (52.6%), but as the articles became older, readers more frequently became aware of them by other means. articles published prior to 1996 were mostly identified through citations (46.9%) and searching (32.8%). articles published in 2005 were most often provided by libraries (46.7% of readings of these articles), but libraries become increasingly prominent as a source as age increases (prior to 1996 69.0% of readings are provided by libraries) while reading from personal subscriptions diminish from 33.3% in 2005 to 8.5% prior to 1996. the proportion of reading from other sources (i.e., copies from authors, colleagues, etc., free web journals, preprint copies, etc.) remains about the same regardless of age (table 8). table 7. source of articles by the age of article read by u.s. science faculty in 2005 (n=913)   source of last read article (%)   year published personal subscription library-provided other total 2005 33.3 46.7 20.0 100 2004 13.4 68.5 18.1 100 2001-03 13.3 64.5 22.3 100 1996-2000 6.8 73.0 20.3 100 prior to 1996 8.5 69.0 22.5 100 an important issue is how much impact retrospective conversion to electronic format has had on information seeking. the format of the source for the last article reading also varies by age, with the most recent and the oldest articles more likely to be from print journals than other readings (table 8.) table 8. format of articles read by u.s. science faculty in 2005 by the age of articles (n=911)   format of last read article (%)   year published print electronic total 2005 46.3 53.7 100 2004 29.1 70.9 100 2001-03 33.3 66.7 100 1996-2000 31.5 68.5 100 prior to 1996 49.3 50.7 100 while reading of print and electronic versions are roughly equal for articles published in 2005, over the next nine years (1996 to 2004) electronic versions are much more often read. prior to 1996 reading tends to revert somewhat to print, probably reflecting lower availability of electronic versions beyond that time. this likely will change as more backfiles of articles are digitized. conclusions the advent of digital technologies on searching and publishing over the past three decades has had a dramatic impact on information seeking and reading patterns in science. evidence from surveys conducted with u.s. science faculty show that: scientists have substantially increased their number of readings.4 most additional readings are from: searching (as opposed to browsing) and citations library-provided articles. scientists read from a much broader range of sources of articles due to: access to enlarged library electronic collections online searching capabilities access to other new sources such as author websites. age of articles read appears to be fairly stable over the years, with a recent increase in reading of older articles. electronic technologies have enhanced access to older articles, demonstrated as follows: nearly 80% of articles over ten years old are found by online searching or from citation (linkages) nearly 70% of articles over ten years old are provided by libraries (mostly electronic collections) articles over ten years old are obtained about equally from print and electronic articles, although the trend in age appears to increase in electronic format and then decrease around ten years old. scientists clearly have vastly improved capabilities and resources available to them for identifying and obtaining the articles they read. such capabilities and resources should result in finding and obtaining articles that better meet scientists' information needs. while citing patterns may be narrowing, reading patterns are not. this is due to several differences between why scientists read and why they cite articles. reading is done for many purposes: current awareness, teaching, and administrative in addition to research and writing. only those readings for research and writing are cited. scientists read many articles for every one that they cite. choosing the best article to cite may be subject to peer pressure in the form of choosing more often to cite those that are cited by others. following citation links in electronic journal articles may have proportionately more influence on citation behavior than reading behavior. reference list belefant-miller, helen, and donald w. king. "how, what and why science faculty read." science and technology libraries 19, no. 2 (2001): 91-112. boyce, peter, carol montgomery, carol tenopir, and donald w. king. "how electronic journals are changing patterns of use." the serials librarian 46, no. 1/2 (march 2004): 121-41. case institute of technology. an operations research study of the dissemination and use of recorded scientific information. case institute of technology, report to nsf, cleveland, ohio, 1960. evans, james a. "electronic publication and the narrowing of science and scholarship." science 321, no. 5887 (2008): 395-99. griffiths, jose-marie, and donald w. king. special libraries: increasing the information edge. washington, d.c.: special libraries association, 1993. henneken, edwin a., michael j. kurtz, alberto accomazzi, carolyn s. grant, donna thompson, elizabeth bohlen, and stephen s. murray. "use of astronomical literature – a report on usage patterns." arxiv e-prints. eprint 0808.0103. 808 (august 2008). (submitted to journal of informetrics). . king, donald w., carol tenopir, carol hansen montgomery, and sarah e. aerni. "patterns of journal use by faculty of three diverse universities." d-lib magazine 9, no. 10 (2003). . king, donald w., nancy k. roderer, and dennis d. mcdonald. scientific journals in the united states: their production, use and economics. stroudsburg, ps: hutchinson ross publishing co. (division of academic press), 1981. kurtz, michael j., guenther eichhorn, alberto accomazzi, carolyn s. grant, stephen s. murray, and joyce m. watson. "the nasa astrophysics data system: overview." astronomy and astrophysics supplement, 143 (april 2000): 41-59. eprint arxiv: astro-ph/0002104. kurtz, michael j., guenther eichhorn, alberto accomazzi, carolyn s. grant, markus demieitner, edwin a. henneken, and stephen s. murray. "the effect of use and access on citations." information processing & management 41, no. 6 (december 2005): 1395-1402. tenopir, carol. "are electronic journals good for science?" library journal 133 (november 1, 2008). . tenopir, carol, donald w. king, and amy bush. "medical faculty's use of print and electronic journals: changes over time and comparison with other scientists." journal of the medical library association (jmla) 92, no. 2 (april 2004): 233-41. . tenopir, carol, donald w. king, peter boyce, matt grayson, and kerry-lynn paulson. "relying on electronic journals: reading patterns of astronomers." journal of the american society for information science and technology (jasist) 56, no. 8 (2005): 786-802. tenopir, carol, sheri edwards, lei wu, and donald w. king. "use of scholarly journals and articles by university faculty: changes in information seeking and reading patterns over nearly three decades." in press, aslib proceedings, february 2009. tenopir, carol, and donald w. king. "reading behaviour and electronic journals." learned publishing 15, no. 4 (october 2002): 259-65. —. towards electronic journals: realities for scientists, librarians and publishers. washington, d.c.: special libraries association, 2000. notes 1. universities specifically surveyed prior to 2005 include university of tennessee, university of pittsburgh, and drexel university. universities surveyed in october/november 2005 include university of tennessee and case-western reserve, university of akron, ashland university, and malone university (all located in ohio). surveys in other settings include companies such as at&t bell laboratories, air products and chemicals, inc., baxter heathcare, bristol-myers squibb, colgate-palmolive co., dupont, eastman chemicals co., eastman kodak co., and johnson & johnson (2 divisions); three energy companies, seven government agencies (e.g., national institutes of health, national oceanographic and atmospheric administration, department of labor) and three federal labs (oak ridge national labs, rockwell international and rocky flats). 2. more details on reading patterns and changes over time are forthcoming in "electronic journals and changes in scholarly article seeking and reading patterns" , carol tenopir, donald w. king, sheri edwards, and lei wu. aslib proceedings, 2009. some of the figures also appear in this article. 3. the "year published" in 2005 represents an age of up to 10 to 11 months, since the survey was conducted in october and november 2005. the ages given in table 3 were established by interpolating age on a log-normal scale, since each survey was done at different times in a year. 4. number of readings do not equate to number of articles read, because an article may be read many times with each time being counted as a reading. copyright © 2008 carol tenopir and donald w. king top | contents search | author index | title index | back issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/november2008-tenopir   safekeeping: a cooperative approach to building a digital preservation resource search  |    back issues  |    author index  |    title index  |    contents d-lib magazine january 2002 volume 8 number 1 issn 1082-9873 safekeeping a cooperative approach to building a digital preservation resource   hilary berthon, susan thomas, colin webb national library of australia canberra act 2600 [email protected]; [email protected]; [email protected]; abstract in may 2001 a project commenced that aimed to build a distributed and permanent collection of digital resources from the field of digital preservation. all resources incorporated in this 'safekeeping' project have been selected from the padi (preserving access to digital information) subject gateway database. this article describes the first phase of the safekeeping project that is being undertaken by the national library of australia, with funding from clir (council on library and information resources). this project aims to identify significant resources in digital preservation early in their lifecycle. it also aims to facilitate the cooperative development of a distributed network of 'safekept' material with resource owners, or parties nominated by them, providing long-term access to their material. we anticipate that a diversity of technical and organizational solutions will be employed within this project that relies on cooperation within the digital preservation community, rather than on formal agreements, to realize an asset of communal value. this article discusses some early findings and outcomes of the safekeeping project; however, a full assessment of this approach must await evaluation over an extended period. introduction the padi (preserving access to digital information) initiative aims to provide mechanisms that will help to ensure that information in digital form is managed with appropriate consideration for preservation and future access. its website (http://www.nla.gov.au/padi) is a subject gateway to resources about digital preservation. now that digital preservation has emerged, if not into maturity, at least into a kind of toddlerhood of trial and discovery, it has become platitudinous to assert that unless consideration is given to preservation, much of our heritage in digital form will be lost. but what of the papers, projects, policies, discussions and other 'documents', accessible through the padi subject gateway, that record our evolving understanding of the challenges and solutions of digital preservation? will tomorrow's digital heritage include any account of progress made towards keeping digital information accessible? it was with the aim of facilitating the preservation of this record that the national library of australia's safekeeping project originated. in its unique role of selecting, describing and 'bringing together' digital preservation resources, padi appeared to be well-positioned to provide a basis for such an undertaking. cooperation has characterized the development of the padi subject gateway. this has been seen as one of the initiative's strengths—according to one user, "it is extremely valuable in sharing knowledge around the world in a rapidly changing field. it helps prevent individual programs from being isolated and falling behind on new developments." padi receives advice and guidance from an advisory group, comprising experts in digital preservation from a number of countries. following the launch of padiupdate (http://www.nla.gov.au/padiupdate/) in mid-2001, resources may now be entered onto the padi database by registered users from all over the world. national library of australia staff, gerard clifton and susan thomas, inputting data on the padi database (nla photo). the safekeeping project—to build a library of digital preservation resources that will be accessible in the long term—extends this model of collaboration, relying on the application of many safekeeping strategies to form a distributed network of safekept material. it is founded on the understandings made between padi and a range of resource owners (or their providers) with the latter indicating their intention to preserve access to their own networked resources. selection a significant element of this project is the selection of resources for the padi subject gateway that provides an initial step in the selection of resources for safekeeping. however, one of the most challenging and resource intensive parts of the safekeeping project has been making decisions about which material, of all that is contributed to the padi database, will be of long-term interest or value. our 'highest significance' category includes documents that we consider to be seminal or which record a 'turning point' in thinking about digital preservation. this category includes resources such as the final report and recommendations of the us task force on archiving of digital information published in 1996 and available through the rlg website, which provides a foundation for much subsequent work. we have also included resources that define or describe an important issue, approach, project or study; or which summarize or raise important issues in digital preservation. finally, we have selected material that, while it will probably not be considered important in 10 20 years' time, we believe will have some ongoing interest for reference purposes as examples of approaches or opinions from a particular time. material selected for safekeeping may therefore be less current than other material on padi and may not reflect current practice in digital preservation. padi links to many types of resources. these include papers and articles; policies, strategies and guidelines; websites describing relevant projects, or organizations with an interest in digital preservation, and links to information about conferences, workshops and seminars. the padi database also incorporates resources such as bibliographies, glossaries, discussion lists, journals and newsletters. an 'archive' of padi's discussion list, padiforum-l, is accessible through the website. some of these resources are dynamic; others are quite 'static'. some rely heavily on external links, while others contain few or no links at all. with 118 items, the broad category of 'articles' comprised the largest component of selected materials. resources assigned to this category include handbooks, reviews, reports and conference papers, as well as journal articles pertaining to a variety of digital preservation issues. in addition to these materials, our selection incorporated a further 32 resources of the type 'policy, strategy or guidelines', 15 items documenting 'projects' or 'case studies', and five 'websites'. we even chose a small number of items that have printed versions, not because we believed that the information was in immediate danger of being lost, but because preserving access to digital resources is a key focus of our project. digital preservation topics included on padi range from emulation and legal deposit to topics such as intellectual property rights management, persistent identification and preservation metadata. only those resources that very closely relate to digital preservation have been selected for safekeeping, with those providing contextual information only being excluded. padi's home page. the development of selection criteria and the selection of material for safekeeping have been carried out by national library of australia staff for the initial phase of this project. however, it is intended that future selections will be distributed, with participation from overseas partners. we see immense value in this kind of cooperative collection development and anticipate that 'peer' appraisal of resources will provide an important underpinning to commitments from safekeepers. a decision to select material for preservation relies on both a knowledge of the value of an item to a whole collection as well as the technical considerations and costs associated with preserving access.1 it is in the former aspect of selection that our project is able to make a contribution. flagging significant documents early in their life—when they are identified through selection for the padi database—is a key element supporting their long-term accessibility. pilot study many of the issues that needed to be considered before our project started emerged during the course of a pilot study that commenced in september 2000. this study aimed to search for a feasible model for building a distributed digital preservation archive. over the course of several months, a number of discussions were held with the owners of eight resources published outside australia. (the national library of australia's electronic unit had agreed, in principle, to archive within pandora2 australian resources selected for safekeeping). we are very grateful to those whose generous sharing of information about how safekeeping might be applied within their own organizational and technical environment proved immensely valuable in shaping the project and provided us with much encouragement to proceed. our discussions with these pilot participants as well as the helpful guidance of padi's advisory group helped us identify some of the issues that would need to be addressed in developing safekeeping understandings—aspects such as defining roles and responsibilities, developing a common understanding of the strategies involved in safekeeping and a model for commitment to safekeeping. roles and responsibilities over the past years, the issue of who has the responsibility for preserving access to digital material has received considerable attention. our project has adopted an open stance about which party should ideally assume this responsibility. we have tried to encourage owners to consider making arrangements for long-term access, whether by adopting strategies themselves, or entrusting the archiving of their resources to a third party. our discussions with owners of resources have, in many instances, indicated that they have a strong interest in ensuring their material remains accessible. in the course of our project, we have been encouraged to learn that raising the issue of long-term access with resource owners has been a catalyst for defining preservation responsibility. while in some cases this has involved negotiation within an institution—for example, a library and a research department—in others, it has led to discussions between institutions. this has occurred, for example, in cases of multiple ownership or where owners are unable to make their own long-term access provisions. we have also learned that many organizations already have well-developed digital preservation strategies and that relationships and mechanisms that will facilitate the capture and preservation of digital resources have already been formed. model for cooperation another set of issues that early comments on our work raised were focussed on levels of commitment, trust and reliability. what kinds of agreements and arrangements would be necessary to support long-term access? related to this was the role that the national library of australia would play. our pilot study indicated that, while the intention of most was to provide long-term access to their material, some would be unwilling to make formal written agreements; another respondent indicated that it would be unnecessary. the model for cooperation adopted in this project is based on a shared understanding of the types of strategies needed to ensure long-term access and goodwill, rather than being reliant on formal written agreements. the nla's role, in this model, is one of encouraging owners/publishers to make arrangements to preserve their resources and describing this networked 'library' of resources. we chose to use the term 'safekeeping' in preference to 'archiving' to avoid confusion over the type of function that we were proposing to provide. our discussions prior to the project indicated that many believe that a certified archive must incorporate a fail-safe mechanism for ensuring preservation if information providers fail to maintain access to their resources.3 certified archiving agreements, in this view, must also specify when such fail-safe mechanisms would come into play. in this model, formal legally valid contractual agreements underpinned by a common understanding of archiving policy and practices would be required in order for the archive to be considered 'reliable'. at this stage, padi does not propose to assume the role of fail-safe archive, to monitor compliance with safekeeping elements or to prescribe the strategies that need to be adopted by safekeepers. while we follow with interest the progress towards implementations of such a 'reliable' archiving model, we are keen to test an alternative model. this model, built on a clear understanding of requirements for safekeeping, is, like the rest of padi, based on cooperation—collaborating because of the mutual benefits of doing so. safekeeping understandings made as part of our project are not intended as a replacement for 'secure' archiving arrangements, but our experience with padi, and with other cooperative activities, has led us to believe that the safekeeping model shows sufficient promise to pursue. we were keen to explore this in the absence of more reliable arrangements, believing it could encourage publishers/owners to take at least an initial responsibility that could support more reliable archiving arrangements later. the safekeeping model relies on the digital preservation community's recognition of the benefits of this cooperative endeavor, and this is perhaps assisted by the strong identification between the owners/publishers and the user community. of the 170 resources selected for safekeeping in the first phase of our project, over half are published by libraries or library organizations. the next largest group of owners/publishers is the higher education institutions (16 percent). the remainder are published by government, e-journal publishers, private organizations, research organizations and independent scholars. overlap between some of these groups made assignment to these categories difficult in some cases. interestingly, responsibility for the safekeeping of 14 resources that were not originally produced within libraries has in fact been assumed by two such institutions. in some ways the owner/publisher group in our project represent a unique community, with not only a strong interest in contributing to the development of a library of safekept material, but in many instances with safekeeping functions closely allied with institutional strategic goals and often a strong appreciation of the nature of the challenge. as noted above, the safekeeping project is underpinned by a growing body of knowledge—both theoretical and practical—gained through studies and projects that have been conducted over the past couple of years.4 we anticipate that, under the umbrella of the safekeeping project, a range of solutions—both technical and organizational—will be employed and that this diversity will be advantageous to long-term access. the notion of a distributed network of archives is a widely accepted one5 and is fundamental to a number of recent novel initiatives such as the lockss project6 and the open archives initiative.7 we believe that the issues surrounding gaining commitment are among the trickiest to negotiate, and we will not know how feasible our model is until some years have elapsed. but we believe that such attempts are worth pursuing, not just because of the value of the resource that we could build together, but also because they go to the heart of what padi aims to be—a global cooperation in digital preservation. in addition to coordinating the safekeeping of digital preservation resources on padi, the national library of australia (nla) has offered to support safekeepers' preservation strategies by providing back-up storage. in no way is this intended to replace the long-term access arrangements being made by safekeepers—rather, it is intended to further assist resource owners by supporting 'distributed redundancy' as an element in their long-term preservation plans. common understanding of long-term access requirements a shared understanding of what is to be achieved and, broadly, how it will be done is an important element of this initiative. while we did not wish to be prescriptive about strategies, our understandings need to be based on a common understanding of the types of strategies involved. as a way of encouraging discussion about preservation methods, we shared information about the types of strategies that the nla was using—or intended to use—with each of the pilot study participants. we invited the participants to comment on how their institutions' strategies conformed or differed from these or any other methods they considered to be helpful. arising from these discussions and also from work undertaken within the nla on these issues,8 we used the following elements to define our understanding of 'providing long-term access': make regular back-up copies of the material, with copied data validated and copies stored on more than one medium and in more than one 'safe' location. regularly refresh data. ensure that metadata supporting discovery, use and management of the material is created, stored and maintained. the metadata for management should include information about resource types, file formats, software required for operation, and information about changes and processes that might have been applied to the resource and their effects. ensure that persistent access to the resources is maintained even if their location changes. ensure that procedures to overcome technological obsolescence are followed; that unacceptable changes following these procedures have not been generated; and that these procedures are recorded. ensure that dynamic resources are copied according to a specified collection schedule and that 'saved' versions are distinguishable. ensure that software required to access resources is maintained until long-term strategies (4) have been applied. ensure that the permission of copyright owners has been obtained prior to copying. ensure that, should the [responsible organization] be unable to continue to provide long-term access to the resource, that it hands over responsibility to some other body and notifies padi of this action. agree to the national library of australia recording and displaying information on its padi website that long-term access arrangements for the resources have been made and point to the location of any 'archive'. in the case of websites, indicate which areas are to be safekept (if the website is not to be safekept in its entirety). while in some instances we have encouraged safekeepers to consider preserving a record of significant changes to dynamic resources, we have left decisions such as frequency of capture to the safekeeper. similarly, assessment of the significant properties of digital resources is left to the owner/publisher in this model. recording information about safekeeping padiupdate is a shared database for inputting resources onto the padi database. safekeeping information is recorded in a number of fields in the padiupdate database that are currently available only to the system administrators at the national library of australia. this information includes the safekeeping status of the resource and the 'safekeeper'; the url or other identifier that points to a safekept version of the resource; and a field for any additional information about the safekeeping of the resource, including the date of any safekeeping understanding or notes provided by safekeepers on their ability to fulfil the elements of the provision of long-term access outlined above. resources for which safekeeping strategies have been put in place are displayed with a symbol on the padi website. outcomes in some ways, attempting at this early stage to draw conclusions from a project that is focussed on long-term benefits might seem a fruitless exercise. however, there are some outcomes that we believe are worth mentioning. as noted above, we have received indications that our project has already served as a catalyst for some to consider providing long-term access to their materials. in some instances, our communications with resource owners have precipitated negotiations with a third party to take on the safekeeper role; in others, it appears that they have initiated the process of defining responsibilities. we have also found institutions that already have well-developed strategies for preserving their digital resources. the safekeeping project has provided a vehicle for sharing information about preservation strategies. many of the owners with whom we have communicated have offered information about further activities and resources related to digital preservation that, in turn, have been disseminated through padi. the safekeeping project has provided a practical test of some of the ideas and assumptions about digital preservation. this point, expressed in a number of our communications with resource owners was articulated by one respondent who said, 'although i have been active in the field of digital preservation for several years, the issues it raises have never struck so close to me.' having selected a set of 'highly significant' material in digital preservation and set out to record details of owners' ability to preserve access, we should, theoretically, be in a position to assess how much of our record is 'at risk'. in fact, we have found that the process of developing safekeeping understandings is a slow one, and we are still far from concluding our first round of communications. in some instances, this is the result of the significant number of parties involved or the difficulty of identifying who might assume safekeeping responsibility. often institutions need to do a considerable amount of work before they can undertake a safekeeping role. at the time of writing, only four months have elapsed since our initial contact with resource owners. in mid-december 2001, of a total of 170 resources belonging to 70 owners, safekeeping arrangements have been made for 77 items. plans have also been confirmed for the australian resources to be archived on pandora, or to be safekept as part of the nla's overall strategy for the long-term preservation of its website. for 21 of the 77 safekept documents, their owners have also taken up our offer to store an extra copy of each on the nla's digital object storage system. with the remaining component of materials, safekeeping negotiations are in progress with 20 resource owners, whilst alternative safekeepers are being sought for a handful of items. four resource owners have replied so far that they lack the appropriate infrastructures and funding to enable them to safekeep the selected materials, at least for the foreseeable future. further assessment of our model for building a distributed and safekept digital preservation 'library' will need to await evaluation over a much longer period. future work as an endeavor based on cooperation and goodwill, maintaining relationships with the community of safekeepers will be a crucial element of our project and will also be important in our ongoing monitoring and evaluation of the project. we are also keen to further explore, in our model of collaboration, possible 'natural' paths for transferring responsibility for preserving access for material for which owners cannot, or choose not to, assume a safekeeping role. as padi progresses into a second round of selection of material for safekeeping, we would like to explore ways of moving towards a more broad-based peer selection process with participation from partners around the world. we are also interested in examining ways of integrating selection of resources for safekeeping with the selection or review of material for padi. so, for example, when a contributor adds a resource to the padi database, it may be flagged as potentially having long-term significance, or a resource may be identified as having high significance when listed by padi's service, highlighting significant new additions to the padi site. finally, we will be interested in observing how our safekeeping model interacts with other models and discovering what the primary driving forces behind successful maintenance of long-term access are. acknowledgements the authors would like to acknowledge the financial support this project has received from the council on library and information resources (clir). we would also like to acknowledge the valuable information and comments provided by participants in our pilot study and by members of padi's international advisory group. we are particularly indebted to neil beagrie, peter hirtle and don waters for their comments provided prior to the commencement of this project; however, responsibility for interpretation and use of their assistance rests entirely with the authors. notes and references [1] russell, kelly and weinberger, ellis. cost elements of digital preservation. curl, 2000. available at . [2] pandora archive: preserving and accessing networked documentary resources of australia. nla, 2001. available at . [3] task force on archiving of digital information. preserving digital information : final report and recommendations. rlg, 1996. available at . [4] e.g., cedars: curl exemplars in digital archives project. curl, 2001. available at ; nedlib: networked european deposit library. national library of the netherlands, 2000. available at ; reference model for an open archival information system (oais) ccsds 650.0-r-2, red book. ccsds, 2001. available at . [5] preserving digital information: final report and recommendations. [6] lockss. stanford university libraries, 2000. available at . [7] the open archives initiative. oai, 2001. available at . [8] safeguarding australia's web resources: guidelines for creators and publishers. national library of australia, 2000. available at ; managing web resources for persistent access. nla, 2001. available at . (çorrected coding for link to padi update on 8/31/05.) copyright 2002 national library of australia top | contents search | author index | title index | back issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/january2002-berthon   lockss: a permanent web publishing and access system search  |    back issues  |    author index  |    title index  |    contents d-lib magazine june 2001 volume 7 number 6 issn 1082-9873 lockss: a permanent web publishing and access system   vicky reich stanford university libraries [email protected] david s. h. rosenthal sun microsystems laboratories lockss: abstract lockss (lots of copies keep stuff safe) is a tool designed for libraries to use to ensure their community's continued access to web-published scientific journals. lockss allows libraries to take custody of the material to which they subscribe, in the same way they do for paper, and to preserve it. by preserving it they ensure that, for their community, links and searches continue to resolve to the published material even if it is no longer available from the publisher. think of it as the digital equivalent of stacks where an authoritative copy of material is always available rather than the digital equivalent of an archive. lockss allows libraries to run web caches for specific journals. these caches collect content as it is published and are never flushed. they cooperate in a peer-to-peer network to detect and repair damaged or missing pages. the caches run on generic pc hardware using open-source software and require almost no skilled administration, making the cost of preserving a journal manageable. lockss is currently being tested at 40+ libraries worldwide with the support of 30+ publishers. the problem the web is an effective publishing medium (data sets, dynamic lists of citing papers, e-mail notification of citing papers, hyperlinks, searching). increasingly, web editions are the "version of record" and paper editions of the same titles are merely a subset of the peer reviewed scholarly discourse. scientists, librarians, and publishers are concerned that this important digital material, the record of science, will prove as evanescent as the rest of the web. in addition, each of these communities has specific needs: future generations of scientists need access to this literature for research, teaching, and learning. current and future librarians need an inexpensive, robust mechanism, which they control, to ensure their communities maintain long-term access to this essential literature. current and future publishers need assurances that their journals' editorial values and brands will be available only to authorized and authenticated readers. therefore, the problem is to preserve an authorized reader's access to the web editions of scientific journals while staying within libraries' budgets and yet respecting publishers' rights. requirements technically, any solution must satisfy three requirements: the content must be preserved as bits; access to the bits must be preserved; the ability to parse and understand the bits must be preserved. there is no single solution to this problem and, furthermore, having only a single solution would mean that materials would still be vulnerable to loss or destruction. diversity is essential to successful preservation. by proposing lockss we are not discounting other digital preservation solutions; other solutions must also be developed and deployed. in particular, we believe that there are needs in this area for both centralized "archives" and a distributed library system like lockss. for digital materials, the terms archiving, library, and preservation tend to be used interchangeably. for our purposes, we follow the dictionary in defining an archive to be a place where public records are stored and a library to be a place where people read and/or study materials. preservation is the action of keeping materials from injury or destruction. both library and archive materials need preservation. models of repositories there are two models of digital preservation and archiving repositories: centralized and decentralized. a key question to ask of each model is "what are the costs of preserving different types of materials and on whom do these costs fall?" the centralized model envisages a small number of tightly controlled repositories. each repository a) does the entire job of preserving content, b) requires expensive hardware, and c) requires sophisticated technical staff. to establish a centralized system, publishers and librarians must take legal and data management actions cooperatively. preservation costs are borne by a few. centralized systems are focused on preserving bits and pay less attention to ensuring access. indeed many of these systems are explicitly "dark" archives, in which content will not be accessed until some "trigger" event (migration, publisher failure, etc.) occurs. the decentralized model envisages a large number of loosely controlled repositories. each repository or node in the system a) does some but not the whole job of preserving content, b) uses relatively inexpensive hardware, and c) needs relatively little technical expertise to maintain the hardware and software. the content at each repository is in constant use, under constant scrutiny, and undergoing continual repair. in a decentralized system, publishers take little or no action to preserve the content they publish; librarians take action to preserve access for their local communities. while libraries bear the costs of digital preservation, each participating library bears only a small fraction of the total cost, in proportion to its resources and priorities. the benefits accrue only to the participating library's communities. decentralized systems are focused on preserving access rather than just preserving the bits. these systems count on the redundancy inherent in distributed systems to keep the bits safe. overview of lockss lockss is a digital preservation internet appliance, not an archive. archives exist to maintain hard-to-replicate materials, and access is sacrificed to ensure preservation. locksss is more akin to a global library system. libraries hold fairly common materials in "general collections" with access as the primary goal. a key difference between lockss and "general library collections" is that the action of preserving material in the collection is intertwined with the provision of access to the end user. librarians retain paper publications to ensure long-term accessibility. one could visualize all the libraries in the world as parts of a system -a very informal, highly decentralized, highly replicated system. the primary goal of this system is to provide access to material, but providing access in this way also ensures that documents are not lost as a result of publisher takeovers, malicious actions, natural disasters, or official edicts. generally, someone at a local library will find it easy to access the paper copy of a particular book or journal. once a book or journal has been published it is hard to "unpublish" it by finding and destroying all copies. lockss is modeled on this paper system. with the lockss model, libraries run persistent web caches. their readers can use these caches to access the journals to which the library subscribes, whether or not the journals are still available from the publisher. the system makes it easy to find a copy of an article but hard to find all the copies of the aricle, thus making it hard to "unpublish" it. very slowly, the lockss caches "talk to each other" to detect missing or damaged content. if a cache needs to repair content, it can get a replacement from the publisher or from one of the other caches via "inter-library loan".   how the data flows figure 1: in this example, each lockss cache (oval) collects journal content from the publisher's web site as it is published. readers (circles) can get content from the publisher site. when the publisher's web site is not available (gray) to a local community, readers from that community get content from their local institution's cache. the caches "talk" to each other to maintain the content's integrity over time.   reader's perspective lockss' key goal is to preserve a reader's access to content published on the web. readers expect two kinds of access. they expect that: when they click on a link to it, or type in a url, the relevant page will be delivered with minimal delay and no further interaction. when they enter terms into a search engine that should match the relevant page, it will be among the returned matches. readers who use the web are learning that if a link doesn't resolve to a page, or a search engine can't find a page, further attempts to find the information the page carries are unlikely to be worth the effort. this poses problems for those who use preservation techniques that concentrate on preserving bits; the bits may be preserved yet the reader may not know how to access them, or even that the preserved bits exist. in contrast, lockss focuses on preserving the service of having links resolve to, or searches to find, the relevant content. an institution using lockss to preserve access to a journal in effect runs a web cache devoted to that journal. readers use the cache as a proxy in the normal way. at intervals the cache crawls the journal publisher's web site and pre-loads itself with newly published (but not yet read) content. just as other types of caches are invisible to their users, so is lockss. the lockss cache transparently supplies pages it is preserving even if those pages are no longer available from the original publisher's web site. an institution can include the contents of the cache among the pages indexed by its local search engine, and provide its readers with searching across all the journals to which it subscribes. at present, readers typically have to search individual collections of journals separately. librarian's perspective librarians subscribe to journals on behalf of their readers in order to provide both immediate and long-term access. with the advent of the web, for the most part libraries are forced to lease rather than own the web-based content. leasing provides immediate access but carries no guarantee of long-term access. some journals provide their peer reviewed content through off line storage media (tape, cd-rom, paper), but then links don't resolve and searching is harder to accomplish. a major flaw with web publishing is that there has been no mechanism to implement the traditional purchase and own library model. a major flaw with web publishing is that there has been no mechanism to implement the traditional purchase-and-own library model. lockss is changing this by demonstrating that it is both easy and affordable to operate a purchase model for web journals. the subscribing library bears costs analogous to the costs of putting paper copies on shelves, keeping track of them and lending or copying them as needed. a library using lockss to preserve access to a collection of journals pays for the equipment and staff time to run and manage a cache containing the full content of the journals. unlike normal caches, the lockss cache is never flushed and, over the long term, the full content remains accessible. because individual libraries must pay for the preservation of the content to which they subscribe, it is essential that the price they pay be as low as possible. lockss is free, open-source software designed to run on inexpensive hardware. the machines the lockss team is using for the beta test cost less than $800 each and each machine is capable of storing the content of 5 years worth of a major journal's issues. running lockss requires so little staff time that one alpha test site complained they learned nothing about the system over the course of 10 months while running it. lockss' low cost and democratic structure (each copy is as valuable as any other) empowers smaller institutions to take part in the process of digital preservation. in normal operation, an ordinary cache will only act as a proxy for, and thus supply content to, the host institution's own readers. but in a rough analog of inter-library loan, lockss caches cooperate to detect and repair damage. if damage to a page is detected, the lockss cache fetches a copy of the page from the publisher or from another cache. a lockss cache will only supply a page to another lockss cache if the requesting cache at some time in the past proved that it had the requested page. in this way, lockss prevents freeloading. those who contribute to the preservation of the journal are rewarded with continued access; those who do not contribute to the journal's preservation are not provided with replacement pages. publisher's perspective publishers want to maintain journal brand and image. they want material available for future society members and other subscribers. most publishers will save money and serve their readers better if the transition to electronic-only journals can be completed. they want to encourage libraries to purchase and/or activate online versions of journals. one major obstacle to libraries purchasing online journals is resistance to the rental model with its lack of credible assurance of long-term access. publishers are unhappy with a purchase model for electronic journals because: they fear the journal content will be illegally replicated, or leaked, on a massive scale once copies are in the custody of others; they want their access control methods enforced. they want to retain access to reader usage data and have access to the record of the reader's interactions with their site. lockss solves the reader's and the librarian's problems. it enables librarians to collaborate to preserve readers' access to the content to which they subscribe, but it also addresses the publisher's concerns. because content is provided to other caches only to repair damage to content they previously held, no new leakage paths are introduced. because the reader is supplied preferentially from the publisher, with the cache only as a fallback, the publisher sees the same interactions they would have seen without lockss. lockss has other advantages from the publisher's perspective: it returns the responsibility for long-term preservation, and the corresponding costs, to the librarians. although publishers have an interest in long-term preservation, they cannot do a credible job of it themselves. failures or changes in policy by publishers are the events librarians are most interested in surviving. publishers could run lockss caches for their own journals and, by doing so, over time could audit the other caches of their journals. a non-subscriber cache would eventually reveal itself by taking part in the damage detection and repair protocol. the mere possibility of detection should deter non-subscribers from taking part in lockss. just as the publisher cannot be sure he has found all the caches, the caches cannot be sure none of the other caches belongs to the publisher. how does lockss preserve content? lockss has two tasks in preserving content: it needs to detect, and if possible repair, any damage that occurs through hardware failure, carelessness or hostile action. it must also detect, and if possible render ineffective, any attacks. detecting and repairing damage damage to contents is a normal part of the long-term operation of a digital storage system. disks fail, software has bugs, and humans make mistakes. to detect damage, the caches holding a given part of a journal's site vote at intervals on its content. they do so by calculating digital hashes of the content and running polls on the values of these hashes: in the absence of damage the hashes will agree. if they disagree, one of the losers calls a sequence of polls to walk down the tree of directories to locate the damaged files. when a damaged file is located, a new copy is fetched to replace it. if the file is not available from the publisher, it will be requested from one of the winning caches. if a cache receives a request for a page from another cache, it examines its memory of agreeing votes to see if the requester once agreed with it about the page in question. if the requester did, a new copy will be supplied. lockss polls are not like the elections of conventional fault-tolerant systems in which voting is mandatory. they are like opinion polls, in which only a sample of the potential electors takes part. only a sample of the lockss caches holding given content vote in any one poll. note that even caches that don't vote hear the votes of those that do. they can decide whether they agree or disagree with the majority in the poll. normally there will be no damage and each of the polls will be unanimous. if there is a small amount of random, uncoordinated damage each poll will be a landslide, with only a few disagreeing votes. in normal landslide polls, the majority of systems will be reassured that their copy is good without voting. details of this system were presented to the 2000 usenix conference (http://lockss.stanford.edu/freenix2000/freenix2000.html). hampering the "bad guy" if there is ever a poll whose result is close it is highly likely that a "bad guy" is trying to subvert the system. attempts at subversion are endemic in peer-to-peer systems like lockss. (see http://www.wired.com/news/infostructure/0,1377,41838,00.html.) a "bad guy's" goal might be to change the consensus about some content in the system without being detected. a "bad guy" who infiltrated only a few caches and made matching changes to each of their contents would appear to be random damage. the other caches would not change their contents to match. if the "bad guy" infiltrated a substantial number of caches, even a small majority, he would cause polls whose results were close. in the event that the material was no longer available from the publisher, the caches which had not been subverted might replace their good copies with the "bad guy's" modified ones. however, the close results of the polls would alert the system's operators that something was wrong. only if the "bad guy" infiltrated the overwhelming majority of caches would his change be both effective and undetected. the remaining "good" caches would appear to have been damaged and would fetch the "bad guy's" versions. another obstacle for the "bad guy" is that lockss is designed to run extremely slowly. a single poll may take days. it takes many polls to have an effect. by preventing the system from doing anything quickly, we prevent it doing what the "bad guy" wants. because it is difficult for the "bad guy" to operate without raising an alarm, it is likely that the administrators of the system would notice and react to attempts at subversion while those attempts were underway. the inter-cache communication can run very slowly because it does not delay readers' accesses to the journals, which are purely local operations. the difficulty in infiltrating an overwhelming majority of caches lies in being sure you have found enough of them. if there are a lot of caches it will be a long time between votes from any one of them. the set of caches holding the material to be attacked will be changing over a somewhat slower time-scale as caches die and are born. because the underlying communication protocol is inherently unreliable, some caches will not hear some votes. these uncertainties work against the "bad guy". reputation system to make life even harder for the "bad guy", each lockss cache maintains a record of the behavior of the other caches in the form of a reputation. as caches are observed taking good actions, their reputation is enhanced; as they are observed taking bad actions, their reputation is degraded. when a cache tallies a poll, it will take action only if the average reputation of those voting on the winning side is high enough. this has three effects: a "bad guy" can only have an impact by behaving well for long enough to build up a good reputation. if the "bad guy" has to spend most of his time acting as a "good guy" the system may get sufficient benefit from the good actions to outweigh the bad ones. the "bad guy" must achieve his nefarious aims quickly, before his reputation is eroded far enough to render him ineffective. but the system is designed to run very slowly. the "bad guy" might find ways to skew the sample to include his co-conspirators and exclude the "good guys". however, unless the conspiracy overwhelms the "good guys" very quickly, the appearance of "bad guys" acting in concert will damage their reputations even faster than if they were acting alone. more detail on the struggle between good and evil in the lockss context can be found at http://lockss.stanford.edu/locksssecurity.html. running lockss the current lockss version runs on generic pcs. at current prices, a suitable machine with a 60gb disk in a 1u rack-mount case should cost about $750. the system is distributed as a bootable floppy disk. the system boots and runs linux from this floppy; there is no operating system installed on the hard disk. the first time the system boots it asks a few questions, then writes the resulting configuration to the floppy, which is then write-locked. at any time, the system can be returned to a known-good state by rebooting it from this write-locked disk. each time the system is booted, it downloads, verifies and installs the necessary application software, including the daemon that manages the lockss cache and the java virtual machine needed to run it. the system then runs the daemon and starts the http servers that provide the user interface web pages. the cache's administrator can use these pages to specify the journal volumes to cache and monitor the system's behavior. project status up-to-date project status is available at http://lockss.stanford.edu/projectstatus.htm. alpha test design and development of lockss started in 1999. testing started on 6 old 100mhz pcs late that year. an "alpha" version of the software, without a user interface or any precautions against the "bad guy", ran from may 2000 through march 2001 with around 15 caches. the test content was about 160mb representing three months of aaas science online. alpha sites were stanford university, the university of california, berkeley, the los alamos national laboratory (lanl), the university of tennessee, harvard university, and columbia university. this test established that the basic mechanisms worked. the system was able to collect the test content and repair both deliberate and accidental damage to it. the system survived a fire at lanl, network disruptions at stanford, relocation of the machine at berkeley, and flaky hardware at columbia. beta test the worldwide "beta" test began in april 2001, using an almost complete implementation of the system. approximately 35 publishers are endorsing the test. over 40 libraries, with about 60 widely distributed and varyingly configured caches, have signed on to the project. they include major institutions, such as the library of congress and the british library, and smaller institutions, such as the university of otago in new zealand. beta will test the usability and performance of lockss, measure its impact on network and web-server traffic, and provide some estimates of the costs of running an individual cache and the system as a whole. the test content is a total of about 15 gb from bmj, jbc, pnas and science online. the test content is provided to the caches by shadow servers, which partially mirror the publishers' web sites. they isolate the lockss data streams and allow us to simulate journal failures. we hope in the later stages of beta to add other publishers' journals, and other types of content such as government documents. acknowledgements the stanford university libraries lockss team members are: vicky reich, tom robertson (highwire press), david rosenthal (sun microsystems), and mark seiden (consultant). the national science foundation, sun microsystems laboratories, and stanford university libraries funded development and alpha testing of lockss. the worldwide "beta" test in 2001 is made possible through a grant from the andrew w. mellon foundation, equipment donated by and support from sun microsystems laboratories, and support from stanford university libraries. we are grateful to the contributors at our alpha sites: dale flecker and stephen abrams, rick luce and mariella digiacomo, david millman and ariel glenn, bernie hurley and janet garey, chris hodges and hal clyde king, and jerry persons. special thanks are due to michael lesk, michael keller, bob sproull, and neil wilhelm. (29 june 2001, a correction was made to vicky reich's email address.) copyright 2001 vicky reich and david s.h. rosenthal top | contents search | author index | title index | back issues previous article | in brief home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/june2001-reich   snapshots of the national science digital library annual meeting: october 18 20, 2006, washington, dc search   |   back issues   |   author index   |   title index   |   contents d-lib magazine december 2006 volume 12 number 12 issn 1082-9873 snapshots of the national science digital library annual meeting october 18 20, 2006, washington, dc   brad edmondson epodunk.com carol minton morris cornell university nsdl annual meeting 2006 the 2006 meeting of the national science digital library (nsdl) officially convened on october 19 at the american association for the advancement of science (aaas) building in washington, dc. a related poster session was held october 18 (please see appendix for more information about the poster session). the nsdl annual meeting location was just one block away from the metro center subway stop and four blocks from the mall, which made it easy to find a good restaurant or take an interesting walk – for example, ford's theater was just around the corner. nearly 200 people were on hand for the opening session in aaas's beautifully designed auditorium. opening session: gift to the nation "nsdl is about the dream of education shared by all of us in this room," said kaye howe, executive director of the nsdl core integration project in her opening remarks. "it pulls together the tools we all create so that someone teaching the third grade in north dakota can do a good job for his students. and this year, we are at a point of convergence." howe's message was celebratory; she said that the road nsdl has followed since it launched in 2000 has really been a path to glory. by way of elaboration she quoted the third stanza of wallace stevens' "poems of our climate" where the poet writes: there would still remain the never-resting mind, so that one would want to escape, come back to what had been so long composed. the imperfect is our paradise. note that, in this bitterness, delight, since the imperfect is so hot in us, lies in flawed words and stubborn sounds. "all of us who are involved in nsdl know that it is coming together," she said. "our next stage will be a gift to the nation. it is a gift that you all have given." keynote speaker: dr. daniel atkins dr. atkins is the head of the national science foundation's new office of cyberinfrastructure. the office is promoting "science gateways" such as linked environments for atmospheric discovery (lead), a web portal that gives high school students, undergraduates, and the public direct access to a network of the world's fastest computers. it takes a very large, very fast computer to crunch all the variables that make an accurate forecast of a hurricane's path, for example. teragrid, the computer network behind lead, has more than 102 teraflops of computing capability and more than 15 petabytes (quadrillions of bytes) of online and archival data storage. its headquarters at the university of chicago provides rapid access and retrieval over high-performance networks linking computers on nine campuses. one of lead's main goals is to improve storm forecasting; another is to improve the public's understanding of storms. dr. daniel atkins, director, the national science foundation office of cyberinfrastructure. photograph credit: university of michigan. the vision of atkins's office is to promote four broad outcomes, which he calls a "vision framework." these are high performance computing; data, data analysis and visualization; virtual organizations; and learning and workforce development. organizations already exist to promote these things, of course, but atkins' office is uniquely positioned to facilitate the "multi-discipline, multi-investigator, and multi-institution approach that is driving the frontier of science," he says. lead and other science gateways are research fronts that rely on massive amounts of data that must be shared and stored. they also lower the barrier for entry into these disciplines. atkins says his office's mission includes finding ways to train the people who will run the gateways and do the research. atkins uses an ancient scientific analogy to describe the activities on which his office must focus. a borromean ring is a structure of three circles that are inseparable; remove one and the other two will fall apart. the three activities are "provisioning," or the creation, deployment, and operation of advanced cyberinfrastructure (ci) networks; research and development to enhance the social and technical effectiveness of future ci environments; and sponsoring "transformative applications" that will enhance discovery and learning. several people in the audience nodded when atkins said the words "transformative application," because that is what nsdl strives to become. atkins said that in his opinion, nsdl is also like a borromean ring. one if its accomplishments is creating "a diverse and complimentary community linking people in work who have often not physically met." he also said that nsdl has made important technical contributions to the field of digital libraries and is "broadening the notion of library content." atkins identified several trends his office is watching closely. one is the open education/learning resources movement. another is the reuse, remixing, and "mashup" of existing digital resources. the office is keenly interested in efforts to improve the integration of learning and discovery based on cyberinfrastructure platforms. these efforts can come from just about anywhere – and as an example, atkins also cited secondlife.com, a 3-d virtual world that is entirely built and owned by its residents. three-year-old secondlife is already "inhabited" by more than a million participants. it is increasingly used as an online location for communication, commerce and education. "people are making money in the real world by creating and selling virtual pieces of real estate on secondlife.com. these lots have great views of the virtual sea," said atkins. "so one particularly clever resident bought a piece of imaginary land right along the shore that was very narrow and infinitely long, and erected mile-high billboards on it. he then sold the land to the people with the sea views." sites like these are examples of enterprising and creative directions for the web, he said. panel discussion: "meeting web kids on their own turf" social media is turning education upside down. "kids are coming to the columbia university library now mostly to use the coffee bar," says kate wittenberg, director of columbia's electronic publishing initiative (epic). "this upsets librarians. yet the columbia library page on facebook has over 1,000 friends. they are coming to the library in new ways." (l-r) kate wittenberg, julie evans, karon weber, eric rosenbaum, and jennifer groff of the nsdl "webkids" panel. photograph by carol minton morris. wittenberg, who is also a principal investigator (pi) for nsdl core integration at columbia, has been urging academic publishers and librarians to expand into "social media," which is a general term used to describe the places students visit to talk and interact on-line. the vast and rapidly expanding world of text messages, instant messages, e-mail, multiplayer games, facebook, myspace and others is too important for academics to ignore, she says. at the thursday morning discussion wittenberg moderated, "meeting web kids on their own turf," four experts shared their thoughts about how the nsdl community might plug in. "students no longer believe that the teacher is the font of all knowledge and students are the vessels to be filled," said julie evans, the ceo of netday/project tomorrow. evans' organization conducts large national surveys with k-12 students and teachers to hear and report on emerging ideas about educational technology. here are some highlights from those surveys: the three most popular online applications for students are games, music, and communications. sixth grade is the point when more than half of girls and boys report having weekly access to e-mail and instant messaging. middle school students are more sophisticated users of social media than high school students are. students say that instant messaging is their preferred means of communication with peers. they say that e-mail is a "storage medium." cell phones and text messaging are even more popular with students than are computer-based instant message programs. the most technically advanced students are the least likely to spend time updating a myspace page. myspace is for beginners. evans made several conclusions. first, grown-ups who send e-mail to students are signaling that they are not within the student's circle of friends. second, students view online research as more accurate, efficient, and fun than listening to teachers. they feel they can't rely on school libraries because books are likely to be outdated. "fifth and sixth graders told us they share web pages with each other at lunchtime," said evans. "they don't have access to computers, so they write the urls on their lunch bags." students and teachers view social media differently, says evans, and this difference reduces students' access to information during school hours. when asked to name the biggest barrier to online communication in school, teachers said they needed faster internet connections. students said there are too many rules restricting access. the bottom line, said evans, is that students are native dwellers in the digital landscape. while students grew up online, their teachers have to learn new habits. this means that many students are ahead of their teachers. they are shaping the digital environment outside of school while teachers struggle to learn and argue over the rules. "i see a huge up tick in rules and regulations on use of the internet at school, and it worries me," said evans. "someone needs to show how these rules are handcuffing the technology. i was in an affluent high school that had a great laptop program, with ninth graders bringing their laptops from home and logging into a wireless network. but the network code changed daily and teachers had to waste enormous amounts of time putting the code into the computer every day. they were forced to do this because the principal wanted to keep kids from accessing the code outside of school. other schools have filters that cut out all commercial .com sites. "as a result of these rules, the relevance of the school as a learning environment is fading away. there are pockets of great things happening, but in most schools the power of this new technology is being squashed by the principal's fear of the unknown." the second panelist explained how her company is helping social media expand. karon weber recently joined the yahoo!' youth and education research group after two decades of developing digital animation software. she said that social media operates at the intersection of people, media, and technology. yahoo! runs three of the biggest social media projects on the web. their site del.icio.us allows users to store the addresses of favorite websites online. this allows users to access these bookmarks from any computer, add new bookmarks from anywhere, and share favorites with others. since everything on deli.cio.us is someone's favorite, the site also serves as a user-rated web crawler. the second site, flickr, allows users to share photos on a public or a private website. it has five million registered users, and eight out of ten photos are public and searchable. weber did a search on the day of the panel and found that flickr offered 47,000 free photos of giant pandas. many of the photos are tagged with information such as the latitude and longitude where they were taken, so a teacher who uses a panda photo from flickr can show where the photo was taken on a map. the most social of the projects is yahoo! answers, which allows users to ask and answer any question. people who answer questions are ranked with points as if they were sellers on ebay, with a similar feedback system. this means that people who give good answers gain status and become popular on the site. yahoo! answers has dealt with 65 million questions so far, and each question hatches a story. for example, the question, "are there trips to see giant pandas?" yielded the answer, "i am in chengdu now and will check out conditions at the zoo today and at the base camp research center tomorrow." learners on these sites create and share content as they are consuming it, said weber. handheld technologies are emerging as powerful educational tools for six reasons, said the third and fourth panelists, jennifer s. groff, program manager, and eric rosenbaum, research manager, from the teacher education program at the massachusetts institute of technology (mit). game boy game consoles, cell phones, play stations, tamagachi toys, palm pilots, and pocket pc devices work for education because they are portable, ubiquitous, have connectivity, offer social interaction, are content-sensitive, and can be customized for the individual user. groff and rosenbaum work with the mit education arcade, which develops educational games. a game called "big fish, little fish" teaches ecology: you can either take on the role of a big fish searching for food, or a school of small fish working together for protection. another game, "sugar and spice," teaches the concepts of microeconomics. and a third game uses pocket pcs linked to a server to teach ecology and evolution, based on the journals charles darwin kept on the behavior of finches. users can breed the birds, feed on pollen, and otherwise interact with the virtual environment. the game goes constantly, so that students can manage their birds and flowers outside of class and analyze the data to present to their teachers. rosenbaum described "augmented reality games" that connect a pocket pc to a global positioning system (gps) to put a virtual overlay on a physical space. "environmental detectives" presents gamers with a satellite map of the mit campus with dots highlighted. a toxic waste spill has put chemicals in the groundwater. students walk around the campus with their pocket pc devices and conduct virtual interviews with characters who are professors and construction managers to find clues. they can also drill a virtual well to get the components of the spill. students share their data and make inferences to ultimately write a remediation plan. another game, "outbreak@mit," simulates a disease outbreak. students are connected to a network, so if one of them should pick up the "vaccine," it will disappear from the other players' screens. students get "sick" and collaborate as they try to stop the outbreak. julie evans pointed out that teachers will not use games in the classroom unless the games include ways to measure learning outcomes. the inquiry-based learning that takes place in computer games is harder to measure, but it is compelling to children. groff said that children who are used to pursuing answers through games are apt to disengage from the old lecture-based learning models. "when i taught school, i dealt with a girl whose parents fought to get her into an inquiry-based learning model," she said. "i think we're going to see more of that." true to its subject, the nsdl web kids panel bred a lively discussion that continues on nsdl's expert voices blog page at: (http://expertvoices.nsdl.org/webkids/). please add your comments. panel discussion: compadre panel high school physics and math teachers aren't an endangered species yet, but they are threatened. about two-thirds of chemistry and physics teachers in u.s. high schools do not have a major or a certificate in their subjects. only half of math teachers in grades 5 through 12 do. two physics communities have joined to attack the problem with new strategies for sharing information, and they described their efforts at nsdl's annual meeting. the physics teacher education coalition (http://www.ptec.org) supports the professional development of physics and physical sciences teachers. this summer, the ptec website was re-designed to link to the collections of compadre, the nsdl pathways partner for physics and astronomy education. the new site also includes a database of opportunities for professional development. "the professional societies are taking up the slack," said ted hodapp, director of education and outreach for the american physical society, one of the funders. physics teachers often drop out because they feel isolated, said hodapp. the ptec/compadre initiative combats this with geographic tools, including links to professional development workshops around the country and links to regionally available grants. faculty use of digital materials: survey design and preliminary results you might think that college and university faculty who have the scholarly advantage of access to excellent research libraries would use digital collections and services provided by their institutions to prepare lectures rather than just going to google, right? but that isn't always true, reports alan wolf, university of wisconsin, and joshua morrill, morrill solutions research. they presented an interim report on a nationwide survey they are conducting to learn more about how digital collections align with the practices of college level instructors as they prepare for teaching. they conducted a study in 2005-2006 with instructors in 12 focus groups in 8 locations at a variety of types of institutions of higher education to find out how they used digital collections as well as which characteristics of digital collections they found to be helpful. the themes that emerged along with specific responses formed the basis for design of the national survey. early observations: personal definitions of what a digital library is vary widely from "a web page with links" to "a collection of items with metadata and associated services." there is general feeling that "too much information" is available. instructors are concerned about copyright on digital materials. instructors need specific information and are discriminating about their selections. preservation is a concern. will a resource be available next time it is requested? based on study results and early analysis there appears to be a desire for low complexity, high granularity digital collections coupled with a need for "simple pictures" among college and university faculty. a full report on the survey will be released in 2007. closing panel discussion: sustainability "digital media is now the medium for communication into the millennia," said paul berkman, chair of the nsdl sustainability committee. berkman's group has spent several years exploring strategies that could sustain nsdl projects. he told the group that the nsdl community is like an ecosystem, and that each project should define its niche in that ecosystem including how it might pursue symbiotic relationships with other projects. in the closing session of the annual meeting, three digital library visionaries shared their views and experiences related to sustaining digital library programs and projects. (l-r) paul berkeman, chair, nsdl sustainability standing committee, kevin guthrie, ithaka, laura campbell, library of congress, chris greer, national science foundation office of cyberinfrastructure and kaye howe, executive director nsdl core integration. photograph by brad edmondson. kevin guthrie helped develop journal storage (jstor.org), which offers digital storage of the archives of many scholarly journals. jstor was originally funded by the andrew w. mellon foundation and is now supported largely by licensing agreements. guthrie has moved on to become president of ithaka, a not-for-profit organization seeking to accelerate the productive uses of information technologies for the benefit of higher education. jstor was driven by the vision of technology changing the way people use information, said guthrie. one key to its success was transparency: when it made its index available to google, use increased threefold. this was a big step forward but also created problems, he said, because costs were associated with increased use, and someone had to pay. "the costs were not large, but a small number multiplied by a very large number is still a large number," he said. he recommended that nsdl sites carefully understand their costs and consider how they might change as usage increases. he also stressed that digital library artifact has to have economic value coupled with ongoing access to succeed. ithaka assists new organizations by researching sustainability for their communities and providing strategic information. "i would encourage all of you to understand where your value really is, and to share your value whenever you can," said guthrie. "this may go against your instincts, but it is important. find out how you add value and stick to doing that." laura campbell, associate librarian for strategic initiatives at the library of congress (loc), remembered the day in 2000 when she found out that senator ted stevens (r-ak) had secured $100 million for the loc to develop digital resources for education. "we were like the dog running after a car, and we caught a bus," she said. campbell has since become responsible for strategic planning at the loc, which includes cooperating with other institutions on the national digital information infrastructure and preservation program (ndiip). the goal of ndiip is to create a "universe of libraries to preserve and protect information over time," said campbell. she imagines a day when all types of educational information will be openly available and provided by a public utility, as sewer and water are provided today. the library of congress focuses on multiple humanities initiatives such as the american memory project, but they see the need to partner with digital library projects in science and mathematics, said campbell. she is required to report back to congress in 2010 with successful outcomes. "you are going to need to do the same thing," she told the audience. "and if we're both going to be telling our stories to congress, we ought to leverage our efforts so we can tell the best story possible." for example, the loc might choose to conserve digital records that tell the story of the birth of the dot-com industry, she said, and many members of the ndsl community might be able to share sources and records that pertain. she also pointed out that conservation of digital records in the film, gaming, and cartooning industries is almost nonexistent because the major players are competing with each other. nsdl and the library of congress could provide a safe neutral ground where industry leaders could talk about their shared need for permanent archives, she said. chris greer is the program director at the national science foundation's office of cyberinfrastructure. greer is responsible for digital data activities. he said that berkman's analogy of an ecosystem is a good one, and he is committed to finding common threads in a chaotic environment where "a flood of digital information and products are both an end and a beginning for multiple groups and organizations." greer put the problem in context by pointing out that a human brain contains about 200 megabytes of information. all of the words ever spoken by human beings would amount to about five exabytes of data, which is 5000 bytes to the sixth power, he said. the total amount of electronic information available will probably exceed 10 exabytes in 2007, and another exabyte is being added every ten months. greer said that only broad partnerships would be able to create a "digital data universe that will be understood both by specialists and non-specialists." coda: the annual zia haikus lee zia, the nsf program officer for nsdl, continued his tradition of closing the annual meeting by sharing haiku he had composed about digital libraries. nsf's nsdl program director, dr. lee l. zia photograph by carol minton morris. "zia's haikus" were delivered with the usual disclaimer that the "views expressed are author's own and are not official nsf policy." here are this year's offerings: web two-point-zero; veni, vidi, wiki, blog; two-way trumps one-way! sacrifice control, but gain mind share. through usage, determine value. reuse, repackage, repurpose; it's a mindset. but recognize too! for custom service; privacy as currency? a new world beckons. teachable moments. formal blurs to informal, and then back again. embrace the spectrum from sandbox to reviewed gem, tap the social mind! for more information about nsdl annual meeting presentations please visit the annual meeting web site (http://nsdl.comm.nsdl.org/). you may download a podcast of selected nsdl annual meeting sessions from . appendix: poster session a marketplace of ideas the day before the opening session, representatives from 56 nsdl projects gathered to mingle and discuss their projects at the marriott hotel across the street. these are a few examples from the lively and well-attended poster session exchange: the ben collaborative (http://www.bioscied.net) was awarded a four-year, $2.8 million nsf grant last year that will allow it to increase its collection to more than 27,000 papers, images, labs exercises, and other materials. founded in 1999 by aaas and a few other groups as a portal for teachers in the biological sciences, ben now includes 25 collaborators. linda akli, the aaas liaison to ben, reports that the site's home page will have a new design that includes an improved search engine and separate options for k-12 and college teachers. the nsdl registry (http://metadataregistry.org) provides support for human and machine use and re-use of metadata vocabularies to connect vocabulary owners with vocabulary users in a dynamic feedback loop. developers diane hillmann, cornell university library, stuart sutton, university of washington, and jon phipps, cornell university library, were on hand to encourage nsdl projects to register their vocabularies, or to use the "sandbox" to test drive the service. the macaulay library (http://www.birds.cornell.edu/macaulaylibrary) is working with the board of cooperative educational services (boces) in wayne county, new york to integrate animal sounds and video into classroom materials that will be distributed to upstate teachers. macaulay, a section of cornell university's lab of ornithology, is the world's largest archive of these materials. its website allows the public to access the entire collection online. colleen mclinn, content information specialist, joined macaulay earlier this year and is leading the classroom program. the math gateway (http://mathgateway.maa.org/) is an expansion of an earlier project called mathdl, sponsored by the mathematics association of america. it is a pathway project, which means that it receives nsdl funds to manage web content for a specific field. lang moore, principal investiagor (pi) of the project, reports that it features online materials from 16 partners, some of which are still in development. the chemed digital library is getting ready to go live. it will combine materials from three sources: the journal of chemical education's digital library (http://jchemed.chem.wisc.edu), the american chemical society's education division, and chemcollective (http://www.chemcollective.org), a site that offers virtual versions of chemistry labs. the site aims to develop "communities of content" for different educational levels, different branches of chemistry, and different learning styles, reports john moore of chemed. three side-by-side nsdl core integration posters demonstrated how nsdl 2.0 tools powered by fedora (http://fedora.info) such as the expert voices (http://expertvoices.nsdl.org) blogging system and the on ramp content and communications system integrate different kinds of information, from workshop materials to rich media and commentary, into the framework of the library. dean krafft, pi of nsdl core integration at cornell, explained that these and other services allow users to classify, organize, annotate, and create library resources. copyright © 2006 brad edmondson and carol minton morris top | contents search | author index | title index | back issues previous article | next conference report home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/december2006-morris   archives described at collection level search  |    back issues  |    author index  |    title index  |    contents d-lib magazine september 2000 volume 6 number 9 issn 1082-9873 archives described at collection level   meg sweet and david thomas public record office, united kingdom [email protected] [email protected] what makes archives different? at the public record office, the uk national archives, three keywords sum up our role: selection; preservation; access. access applies not only directly to the records in whatever form but also via an intermediary, the finding aid or catalogue. archives have their own very particular features which are explicitly recognised in their description. first and foremost, they are hierarchical: one person's/family's/corporate body's archives are broken down into components which can themselves be further broken down, layer after potential layer into the smallest describable component. archival description is multi-level. it is also based firmly on the concept of provenance. what is being described has been created and accumulated by an identifiable body (or bodies). the general international standard for archival description, isad(g), published in 1994, with a second edition due out in september 2000, is widely recognised and increasingly adhered to. its popularity stems from its being firmly rooted in tradition and in its being a fairly permissive compilation of best practice. isad(g) incorporates a wide range of data elements for archival description, a small number of which it prescribes as mandatory for international data exchange. it embodies the principle of multi-level description and takes as its underlying premise four rules of multi-level description: go from the general to the specific; provide information relevant to the level of description; link descriptions; avoid redundancy of information. these rules have, to date, served archivists well. traditional use of collection level description unlike books, which are stand-alone products, archival documents can only be understood in the context in which they were created. t.s.eliot's the four quartets by itself is readily accessible to users. a file of miscellaneous correspondence, on the other hand, can only be understood if it is known who wrote it and when it was written. even then, the information needs to be qualified: a treasury file of correspondence would have a different significance if it were produced by that part of the treasury which dealt with public expenditure rather than by the part which dealt with the management of the economy. for this reason, archives have traditionally been described in terms of the organisation (usually a public body or private company) that created and accumulated them. the fashion for describing archives has gradually changed over time. the public record office (pro), london, was founded in 1838 and moved into its first building in chancery lane in the early 1850s. at the same time, fs thomas produced the first guide to the pro. he used what would now be seen as a curious hybrid system. the records were described by their creating department largely the medieval courts but within those courts they were described by subject. so, the records of the exchequer were described at collection level as exchequer records and a short administrative history was given. the records were then described by subject 'abbeys, accounts, acquitances', etc. this pattern of describing records by collection and then by subject was the norm at the pro for the next 70 years. it was not until 1923 that m.v. giuseppi produced what was the first modern guide to the public record office. this gave the administrative history of each department and then went on to describe the divisions into which it was organised and then individual series of documents, ranging from medieval rolls to modern files. for the exchequer, there was a detailed administrative history that focused on the records it generated, followed by an administrative history of each of the exchequer's divisions. finally, all the exchequer's file series were described. giuseppi's model continued up to the 1960s when the last printed guide to the pro was published. the great value of printed, high level guides to holdings was, before the advent of the world wide web, their provision of remote access (however limited) to the holdings of record offices. by the 1960s, however, the volume of modern records flowing into the pro was so large that conventional print media could not keep up. the new style current guide of the 1970s onwards was produced annually but only for a strictly limited circulation at the pro and a couple of other london institutions. a microfiche version was published at irregular intervals. for its first 150 years, the pro, like most archives, had a dual approach to describing its holdings. like most other archives it had separate, paper-based systems for providing access to collection-level descriptions and to individual items. why go beyond collection level description? although collection level descriptions are of enormous value to some researchers, particularly those who are conducting research into the history of individuals or of institutions, they are of limited value on their own to the broad range of researchers. this is because what researchers wish to see are individual files and it is only possible to identify detailed files from multi-level catalogues. some academic and other researchers would be interested to learn from the pro's collection level catalogue entry that the pro has the records of service of officers in world war one; the pro's series level scope and content note indicates that. this series (wo 339) contains records and correspondence for regular army and emergency reserve officers who served in the first world war. the content of the files varies enormously, from a note supplying date of death, to a file of several parts containing attestation papers, record of service, personal correspondence and various other information. records of british reserve officers who were commissioned into the indian army were originally held separately, but later added to this series. for the majority of the series there is no correspondence date range, and the nominal description has been abbreviated to surname and initial. however, our experience is that most of our family historian users are interested in the records of individual officers, whether it is because they are tracing their ancestors or pursuing an interest in world war one poets. consequently, they are far better served by having access both to the series entry and to an online list of all those officers whose records survive. similar problems are posed by very large series. for example, the main series of home office papers at the pro has 26,000 files. the series level scope and content note is relatively comprehensive: this is the main series of home office papers. the subject matter of the files reflects the diversity of domestic matters dealt with by the home office. these have included aliens, betting and gaming, borstals, building societies, burials and cremations, bye-laws, changes of name, the channel islands, charities, children, civil juries, drugs, ecclesiastical matters, elections, explosives, extradition, factories, fire services, firearms, honours, ireland and subsequently northern ireland, the isle of man, magistrates, markets and fairs, lords lieutenant, mental patients and criminal lunatics, naturalization, pardons, petitions of right, poisons, police, prisons, prostitution, public order, use of royal title by institutions and companies, universities, vivisection, wartime measures, warrants and wild birds. further home office papers are in a supplementary series, ho 144: these are files on criminal and certain other subjects, separated from the main file series in ho 45 because of their sensitivity at the time of transfer. both series are described clearly and accurately in conformance to standards and would be of great value to any researcher who is interested in broad issues of policy. they are of less use to people interested in particular individuals: there is nothing to tell the researcher that ho 45 contains the papers relating to the conviction of timothy evans for a murder he did not commit nor that ho 144 contains the home office's files on dr. crippen and the jack the ripper murders in whitechapel. in practice many archive users require clear, accurate and searchable descriptions of individual files (or their equivalents). they then move 'bottom upwards' to see the context in which the documents were created and used. a researcher into the welsh aberfan tip disaster may start with a keyword search for all references to the disaster but will very quickly see the value, if not the necessity, of knowing exactly which of all the bodies connected to the disaster and subsequent inquiry produced the records in question. technology in the late twentieth century made it possible to unite description of the whole with description of the component layers making up that whole. the emergence of the world wide web made remote access to full, up to date catalogues a possibility, and a ready-made means of providing multi-level description in an automated environment was established with the advent of encoded archival description (ead) in the mid 1990s. ead, a document type definition of standard generalized markup language (sgml), was specifically designed for multi-level archival description. largely because it maps so well to isad(g) there has been a large, international, take up of ead. many institutions have mapped existing finding aids to ead; many have developed ead templates for new cataloguing. even where database solutions have been adopted for archival catalogues, a requirement has often been that the system should be capable of ead import and export. in a very short space of time ead has taken on the role of chief means of data exchange for archival descriptions. it has proven popular because it can provide one stop access to the description of the whole archive together with its parts. why stay with collection level description? real life for all too many archival institutions consists of running to try to stay still. too few resources, in terms of money and skilled staff, combine with increased pressure for service. very often the immediate has to take precedence over the important or, at the very least, the demands of the immediate preclude a fully worked through strategy of balancing priorities. cataloguing backlogs build up. in almost every case, a decent collection level description of each of an institution's holdings is prepared as a necessary control in the accessioning process. very often, though, more detailed cataloguing has to take its turn in the queue. priority there may be determined strictly chronologically or may be decided by such factors as importance, user demand or particular staff skills available. even when the whole archive has been fully catalogued, there may be great dissatisfaction with the results. with legacy finding aids, the standard of archival description may be judged too poor or too idiosyncratic for a modern global audience. too much may have been dependent on the human eye's interpretation of layout on a printed page or understanding of particular typing conventions. data elements may have been mixed in together to form one mass of 'description' or levels, jumbled in a way that a machine will all too quickly expose but which a person browsing through page after page may not pick up. when thinking in terms of retroconversion from paper based finding aids to electronic form, and especially for presentation on the web, difficult choices about priorities sometimes have to be made. one such choice can be between depth and breadth: whether in the short term it is more beneficial to end users to have multi-level catalogues of some archives available remotely or whether it is more useful to view across the board collection level descriptions that indicate the locations of the archives (together with more detailed listings when these exist). for one or more of the reasons noted above, a significant number of individual record offices, or of bodies working together to form a network, have decided to focus, at least in the short term, on providing collection level descriptions of their holdings on the web. a further consideration may be the applicability of isad(g) to cataloguing in an automated environment. isad(g) emerged from the world of paper based finding aids. avoiding redundancy of information, avoiding repetition of information from one level to another made total sense. but what of the world of automation? what of the possibility of an isolated 'hit' in response to a particular search? how does the end user make sense of the 'hit' without adequate contextual information being returned as part of or with the hit? do we need brand new rules of archival description for an automated environment, with a global, and largely unknown, audience? is it safer for now to stick with collection level description for any other context than the local one, where human intervention is to hand? why multi-level description? we may well have to rethink or readjust some of the standards we follow as we take on fully the promises and challenges of online delivery of our archival catalogues. what we do know is that existing users want full catalogues on the web and that new users (previously hampered by geography, limited mobility, lack of resources or simply unaware of the possibilities offered by archival research) will benefit most by remote access to the whole together with its parts. copyright© 2000 meg sweet and david thomas top | contents search | author index | title index | monthly issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/september2000-sweet   digital library research in action: supporting information retrieval in sowiport search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine march/april 2015 volume 21, number 3/4 table of contents   digital library research in action: supporting information retrieval in sowiport daniel hienert, frank sawitzki and philipp mayr gesis, leibniz institute for the social sciences, germany {daniel.hienert, philipp.mayr}@gesis.org doi: 10.1045/march2015-hienert   printer-friendly version   abstract sowiport is the information portal for the social sciences that contains over 8 million literature references, research projects and full texts from 18 databases, including six english-language databases from proquest/csa. since its foundation seven years ago, sowiport has been re-launched on the basis of the popular discovery framework vufind. in this article we present how heterogeneous databases from different data providers can be integrated to provide the user one point of access to social science information. further we describe several value-added services that assist the user at different stages of the information seeking process.   1 introduction beside multidisciplinary commercial and non-commercial bibliographic search engines such as google scholar (jacsó, 2009), ms academic search (jacsó, 2011) or base (pieper & summann, 2006), there exist a number of search engines for domain specific bibliographic information such as dblp (ley, 2009) for the computer sciences, pubmed (lu, 2011) or gopubmed (doms & schroeder, 2005) for biomedical literature or the arxiv repository for physics, mathematics and computer science along with a large number of smaller initiatives with digital libraries (dls), opacs and search engines for domain-specific content. focusing on one domain can improve the search experience by getting more relevant results due to a limited scope, and most notably, domain knowledge such as taxonomies, thesauri, structures, relationships and information behavior can be utilized to improve search quality and search usability (compare e.g. battelle, 2005, p.274). research information in the social sciences like in any other domain is widely distributed across different databases, systems, search tools, digital libraries and information portals. the challenge is to collect these information sources and to make them easily accessible for the end user. heterogeneous metadata and data make it difficult to integrate different databases and often hinder information search with qualitative results. providing different information types (like literature references, research projects, and full texts) in one search makes it even more difficult to integrate and interlink this information. sowiport uses open-source technology, established metadata schemes and terminology mappings between different thesauri to overcome some of the typical retrieval problems. supporting services, as described in the next section, can assist the user at each stage of the information seeking process (for an overview, see kriewel, et al., 2004; mutschke, et al., 2011). in the query formulation phase the user is supported by a search term recommender that combines terms from a domain-specific thesaurus and highly associated terms from a co-word analysis. this way, the user is supported in the (re-)formulation of queries by controlled vocabulary and alternative concepts. at the next stage, in the result list, users can apply re-rankings for citation counts, journal or author productivity that provides different views for the result set. next, in the document full view, we provide several possibilities to continue the search process based on exploratory search via metadata facets, references, citations and links. in the following section we will give a brief overview of tools we use to build a knowledge basis for further developments.   2 supporting information search in sowiport   2.1 overview the social science information portal sowiport integrates quality research information from national and international data providers and makes them available in one place. sowiport integrates literature references, research projects and full texts. it currently contains about 8 million literature references and research projects from 18 databases, including six english-language databases from proquest/csa which are available by a national license funded by the german research foundation. intelligent technologies such as the automatic mapping of search terms between different thesauri support the user in simultaneous searches across heterogeneous databases. furthermore, sowiport offers value-added services like advanced term recommendations, different novel ranking techniques and exploratory search facilities based on metadata attributes, references, citations and links to support the retrieval process. sowiport began in 2007 and was re-launched in april 2014 on the basis of the vufind framework. the main target groups of sowiport are users interested in the social sciences and adjacent fields. the portal serves about 20,000 unique users per week, mainly from german-speaking countries.   2.2 sowiport architecture figure 1 gives an overview of the sowiport system architecture. the vufind discovery software provides the basis for sowiport. it offers basic search and browsing functionality including features like faceted search on the basis of apache solr, simple and extended search, search history and favorites, persistent urls, apis like open search and oai and more features which make it an ideal basis for a modern discovery infrastructure. figure 1: overview of the sowiport architecture. the basic vufind system was adapted on the frontend with a modern corporate design and several modifications and extensions. the backend (and respectively frontend) was extended with several services that support the user in the search process: (1) the term recommender supports the user in choosing appropriate terms for his information need, (2) the hts service expands the query with synonym terms to find results from several heterogeneous databases, (3) the possibility to re-rank the result list allows different views and (4) links based on metadata attributes, references and citations allow browsing between documents and other result sets. in the following sections we explain these supporting services in more detail. table 1 gives a detailed overview of the databases that are currently included. sowiport does not use a federated search approach or a harvesting via oai-pmh, but most data providers deliver their data by uploading it to an ftp server (because e.g. they do not offer an oai-pmh interface or the data is not publically available). therefore, one first step for integration and integrated access is the conversion from diverse input formats into one overall metadata schema which is applied in sowiport. the schema has to be compatible with different information types and many different metadata to avoid the loss of essential information. even if standards such as marc, dublin core, mets, mods exist, in practice, data providers often use their own proprietary formats or variations of standards. for sowiport we tried to automatize the process of conversion and indexation. therefore, we used a script-based workflow, which first copies raw input files from the ftp to the conversion server. then, for each input format we defined templates which describe the conversion process from input format to output format. in a last step, these files are indexed by apache solr into the document index. the overall schema for all input databases is a first step to let the user easily search in heterogeneous databases. table 1: databases included in sowiport information type name description update frequency quantity german-language literature references gesis-solis social science literature information system daily (diff)/ monthly (full) 465,094 gesis-bib gesis library quarterly 127,822 usb köln university library cologne weekly 273,707 dipf — fis-bildung fis education literature quarterly 831,294 fes-katalog library of the friedrich ebert foundation quarterly 636,012 iab — litdokab literature for employment research monthly 134,213 dza — gerolit german centre of gerontology library monthly 162,135 dza — solit literature for social work and social education quarterly 203,205 wzb — library catalogue berlin social science center library monthly *106,640 english-language literature references proquest — csa / sa sociological abstracts monthly 1,057,186 proquest — csa / ssa social services abstracts 176,827 proquest — csa / assia applied social sciences index and abstracts 602,269 proquest — csa / pais public affairs information service 1,956,945 proquest — csa / pei physical education index 409,325 proquest — csa / wpsa worldwide political science abstracts 805,274 proquest — pao periodicals archive online onetime 193,334 full texts gesis — ssoar social science open access repository daily 31,386 research projects gesis — sofis social science research information system daily (diff)/ monthly (full) 53,156 *available in april 2015 8,225,824     2.3 integrating heterogeneous databases with cross-concordances another core challenge in integrating heterogeneous databases from different providers in one digital library is to keep precision and recall high if users perform simple keywords searches. sowiport contains databases which use different thesauri to index their literature references. thus, without any system side support, there is a small chance that users will get precise results from all databases. a first step was the syntactical integration in one overall metadata schema; the next step is the semantic integration with cross-concordances. in (mayr & petras, 2008) we defined cross-concordances as intellectually (manually) created crosswalks that determine equivalence, hierarchy, and association relations between terms from two controlled vocabularies. for example, the term 'computer' in system a is mapped to the term 'information system' in system b. semantic integration in our definition seeks to connect different information systems through their indexing languages (thesauri, classifications, etc.) — insuring that search over several heterogeneous collections in sowiport can still use the advanced subject access tools provided with the individual databases. cross-concordances can support search in several ways. first and foremost, they enable seamless search in databases with different indexing languages. additionally, cross-concordances serve as tools for vocabulary expansion in general since they present a vocabulary network of equivalent, broader, narrower and related term relationships. as we have shown in sowiport, this vocabulary network of semantic mappings can also be used for automatic query expansion and reformulation. to search and retrieve terminology data from the database, a web service (called heterogeneity service, hts) was built to support cross-concordance searches for individual start terms, mapped terms, start and destination vocabularies as well as different types of relations. the cross-concordances database contains controlled terms from 25 different thesauri with about 513,000 entries. we use the hts service in sowiport to expand the user's search query with equivalence relations from all cross-concordances where the thesaurus for the social sciences (thesoz) is source or target database. figure 2 shows the user interface to the hts integrated in the thesoz browser. the controlled term "marketing" in the example is expanded with terms such as (translated) "public relations", "advertising", "social management" or "market". (mayr & petras, 2008) have evaluated the effect of using these cross-concordances for intraand interdisciplinary search questions in a controlled information retrieval scenario. the expansion with "exact match"-crosswalks shows a very positive effect in terms of retrieval precision and recall, especially for topics which are searched in two databases and are situated in different research domains like psychology and medicine. figure 2: query expansion in sowiport: the search term marketing is expanded with several equivalents from other thesauri.   2.4 term recommender a qualitative indexing of documents with controlled terms and the integration of heterogeneous databases with cross-concordances improves the search if the user is aware of the correct vocabulary and uses it for formulating his search query — the so-called vocabulary problem (furnas, et al., 1987). to assist the user in the search query formulation and for finding alternative concepts we have implemented a term recommender in sowiport. similar to solutions from commercial platforms like google, ebay or amazon, it proposes possible keywords to the user while entering letters and words into the search form. in addition to simple autocomplete functionality it utilizes vocabularies from (1) the thesaurus for the social sciences and (2) the search term recommender (str) to provide descriptors for high-precision results. the thesaurus for the social sciences is an instrument to index and retrieve subject-specific information in sowiport. the list of keywords contains about 11,600 entries, of which more than 7,750 are descriptors and about 3,850 are non-descriptors (see zapilko, et al., 2013). topics in all of the social science disciplines are covered. thesaurus terms are linked with each other with semantic relations such as "broader", "narrower" or "related". the search term recommender (lüke, et al., 2012) maps arbitrary input terms to terms of a controlled vocabulary. all documents of a collection are processed by performing a co-occurrence analysis from free terms in titles and abstracts to subject-specific descriptor terms. the logarithmically modified jaccard similarity measure is used to rank term suggestions from the controlled vocabulary. two language-specific services have been created for sowiport. first, for the german-language, all documents from the databases solis (literature references) and sofis (research projects) have been processed and mapped to thesoz terms. second, for the english language, all documents from proquest/csa have been processed and mapped to collection-specific indexing terms. to enable dl operators to create their own individual search term recommender based on oai-harvested metadata we have built a framework (lüke, et al., 2013). figure 3 shows the current implementation of the term recommender. in the upper part it shows autocomplete descriptors from the thesoz that fit to the current user input "religion" like e.g. (translated) "religious freedom", "religious association", "religious history" and "religious criticism". additionally, underneath each descriptor it shows related terms in a lighter font color, e.g. "criticism" as a broader term for "religious criticism". this can help the user to find broader, narrower and related keywords, to identify the context of descriptors and to get suggestions for further search. in the lower section, beginning from three entered letters, it shows suggestions from the str. here, topically near suggestions for the input term "religion" are shown, e.g. (translated) "religiosity", "islam", "christianity", "church" and "secularization". figure 3: search term recommendations for the entered term religion. thesaurus terms in the upper part [translated: religious freedom, religious association, religious history, religious criticism] and statistically near terms in the lower section [translated: religiosity, islam, christianity, church, secularization]. narrower terms for religion are [translated]: buddhism, christianity, hinduism, judaism and more. in a first evaluation (hienert, et al., 2011) we conducted a user study with over 4,000 unique visitors and four different design approaches. we used three variants with one single vocabulary: (1) user search terms, (2) terms from the heterogeneity terminology service, (3) thesaurus terms and (4) a combined recommender with thesaurus terms and terms from the str. the different recommenders were successively activated in the live system and how often a search term suggestion was selected in relation to the number of conducted search queries was measured. the combined approach performed best with 14% usage, followed by thesaurus terms (9%), user search terms (7%) and terms from the heterogeneity service (3%). in an ongoing evaluation with sowiport users we try to identify how users utilize the term recommender. we especially want to identify patterns of using thesaurus structures within a search session.   2.5 re-ranking techniques after submitting a search query, a result page is presented that lets the user explore document metadata. the vufind framework provides standard functionality like modifying the search query, facets for filtering, saving and exporting individual documents or results sets. the standard ordering of the document list is a tf-idf measure provided by solr. another supporting facility that we integrated is the re-ranking of the documents in the result set to journal/author productivity or citation count. the initial ranking can be re-arranged by metrics derived from the models described below. traditional information retrieval has reached a high level in terms of measures like precision and recall, but scientists and scholars still face challenges present since the early days of dl: mismatches between search terms and indexing terms, overload from result sets that are too large and complex, and the drawbacks of text-based relevance rankings. therefore, we focus on statistical modelling of the evolving science system. such analyses have revealed not only the fundamental laws of bradford (1948) and lotka (1926) (see also garfield, 1980), but also network structures and dynamic mechanisms in scientific production. in our approach these fundamental laws serve as structuring facilities (see e.g. bradfordizing below) which have a strong empirical evidence and utility for the search process, especially for scholarly searchers. the overall aim here is to help the user to grasp the size and structure of the information space, rather than force him to precisely define the search space (see in detail mutschke, et al., 2011). bradfordizing is a simple re-ranking model which we have called "journal productivity" in sowiport. fundamentally, bradford's law states that literature on any scientific field or subject-specific topic scatters in a typical way. a core or nucleus with the highest concentration of papers — normally situated in a set of a few so-called core journals — is followed by zones with loose concentrations of paper frequencies. the last zone covers the so-called periphery journals which are located in the model far distant from the core subject and normally contribute just one or two topically relevant papers in a defined period. bradford's law, as a general law in informetrics, can be successfully applied to most scientific disciplines, and especially in multidisciplinary scenarios (mayr, 2013). bradfordizing, originally described by (white, 1981), is a simple utilization of bradford's law of a scattering model which sorts/re-ranks a result set according to the rank a journal gets in a bradford distribution. the journals in a search result are ranked by the frequency of their listing in the result set, i.e. the number of articles in a certain journal. if a search result is "bradfordized", articles of core journals are ranked ahead of the journals which contain only an average number (zone 2) or just a few articles (zone 3) on a topic. the calculation of the individual document scores of a "bradfordized" article list is explained in (schaer, 2011). in (mutschke, et al., 2011) we could show empirically that different bibliometric-enhanced re-ranking models can be used to improve retrieval quality. bradfordizing as a re-ranking ir service has been evaluated in (mayr, 2013). "author productivity" based on lotka's law is technically implemented exactly like "journal productivity". the author name with highest productivity is ranked highest. re-ranking subjects to "citation count" is sorting documents corresponding to their citation count in sowiport (see more in the section 2.6). this is a standard sorting option in all systems with citation counts (like e.g. web of science). in sowiport users can re-rank results directly in the result page by choosing one of the techniques from the sort results by menu (see figure 4). figure 4: alternative re-ranking techniques citation count, journal productivity and author productivity.   2.6 metadata elements, references, citations and links for exploratory search typically, the last step in a basic search is choosing a document from the result list and checking its details in the document full view. at this stage we provide several possibilities to continue the search process in an exploratory way. documents within a corpus like sowiport (with its several databases) are not isolated elements, but form a network of interconnected entities. connections between documents, and other result sets, are given by attributes such as authors, publishers, keywords, journals, subjects, references, citations and many more. metadata elements like author names, keywords or journal titles are marked as links and can easily be used to trigger new searches in sowiport. we also provide links to full texts on the web or to services like google scholar, google books or local copies where available. a special feature here is the linkage of research publications in sowiport with research data in portals that contain this kind of data. publications that cover issues such as the design, methodology, implementation and results of an empirical study, or literature which interprets or discusses this data, are linked to the original data set with its metadata so that researchers have direct access. in the past, these links were added manually to the metadata by domain experts. in an on-going project we identified workflows and algorithms to extract these links automatically from full texts (boland, et al., 2012). at this time, about 2,706 bibliographic entries in sowiport are linked to research data and can be followed from the full view of a record. references and citations are harder to extract, compute and process, but if they are available, they offer the user the possibility of identifying and browsing citation networks or of re-ranking the result list by citation counts. figure 5 shows a full view of a document in sowiport with the capability to search for authors, journals, classifications and topics, and also to browse to referenced documents or to documents which cite the document. figure 5: full view of a record in sowiport which makes it possible for the user to browse to referenced documents or to records which cited it. smaller data providers for sowiport only very rarely have access to reference information and provide it for their collections. today, only proquest provides approximately 8 million references for its english-language databases csa-wpsa, csa-sa and csa-ssa. references are directly embedded in the raw input files and contain reference information with metadata fields such as title, author and publication year. for our own german-language full text repository ssoar, we built an experimental workflow which automatically extracts references from pdfs based on the reference extraction tool parscit. in the future we will be able to extract reference information from other full text collections with very different citation styles. to match these references with existing documents in sowiport we apply a multi-stage process (sawitzki, et al., 2013). first, all reference information is stored in a separate reference index based on solr. then, for the matching step, we search for the normalized title and publication year in the document index for every reference. if this step fails, a fuzzy search with title, publication year and issn is carried out. if a document is found, its id is added to the reference entry. with this matching process about 30% of the references can be assigned to an existing record in sowiport, which means in total about 2.5 million referenceand citation-links. the precision of the algorithm was 95% in a sample of 400 randomly chosen documents. in a final step, the reference index can be used to add reference and citation information to the metadata of each record in the document index. metadata entries are then used to build browsable links as shown in figure 5. in addition, computed citation counts can also be used for re-ranking the result set (compare section 2.5, 're-ranking techniques').   3 analyzing user behavior for further developments to build the basis for further value-added services which support a user in his information search we first have to understand why and how users search in our domain and especially in sowiport. we try to obtain this knowledge through two measures: (1) usability studies and (2) an analysis tool which enables us to explore and examine whole retrieval sessions. with the relaunch of sowiport in 2014 we began a series of task-based usability tests. initial tests made basic usability weaknesses visible which could easily be resolved with minor updates. in future tests with domain professionals we will try to understand the different search tasks that exist and how experts try to solve them (see also the initiative of an online access panel for iir systems in kern, et al., 2014). to get further insight into why and how users are searching in sowiport, we have very recently developed a tool for the analysis of whole user sessions. this framework can easily be integrated in different digital libraries with only some lines of software code. furthermore, the examination of existing log files is possible if they are transformed into a simple schema. the purpose of this framework is to understand how users behave within and beyond sessions. figure 6 shows a screenshot of the user interface with user session data from august 2014. the dl operator can get an overview of how users behave in a set of search sessions. therefore, the sankey diagram shows which actions are performed at each search step in the session and which actions will most probably follow. in the session list, user sessions can be examined in detail, including for several sessions by the same user. the data set can be filtered to specific situations, e.g. to sessions in which users viewed a document for more than thirty seconds. the tool can help to answer questions such as "how has the search process evolved for a certain topic?", "which documents have been finally viewed?", "how has a search process evolved over several sessions?". the session based analysis can be the basis for a future set of value-added services that allow personalization, recommendation and awareness. for example, we can generate term suggestions based on the personal history of a user or can recommend documents viewed by other users that used the same search query. figure 6: screenshot of the tool for the analysis of user sessions.   4 conclusion sowiport currently supports the user's information seeking process at different stages. social science information from different data providers is syntactically integrated with an overall metadata schema and semantically integrated with cross-concordances. the user is aided by a term recommender in the query formulation and reformulation phases; re-ranking mechanisms offer alternative views on the results set; links based on metadata, references and citations allow exploratory search in the document graph and related results sets. to build the basis for future developments we began with a series of usability studies and have created a tool that helps to analyze user behavior based on log data. this will help us to identify not only usability problems, but also discover at which stage in the information seeking process the user can be further supported.   acknowledgements we thank all our partners, and active and former members of the sowiport team at gesis. we acknowledge the long term support of the deutsche forschungsgemeinschaft (dfg), which is financing the nationallizenzen. the following research grants supported our development of value-added services for sowiport: the service hts was funded by bmbf under grant no. 01c5953; the re-ranking services and the irsa framework were funded by dfg under grants no. inst 658/6-1 and su 647/5-2.   references [1] battelle, j. (2005) the search: how google and its rivals rewrote the rules of business and transformed our culture. portfolio. [2] boland, k., ritze, d., eckert, k. & mathiak, b. (2012) identifying references to datasets in publications. in: tpdl. pp.150—161. http://doi.org/10.1007/978-3-642-33290-6_17 [3] bradford, s.c. (1948) documentation. london, lockwood. [4] doms, a. & schroeder, m. (2005) gopubmed: exploring pubmed with the gene ontology. nucleic acids research, 33 (suppl 2), pp.w783—w786. [5] furnas, g.w., landauer, t.k., gomez, l.m. & dumais, s.t. (1987) the vocabulary problem in human-system communication. commun. acm, 30 (11), pp.964—971. http://doi.org/10.1145/32206.32212 [6] garfield, e. (1980) bradford's law and related statistical patterns. current contents, 4 (19), pp.476—483. [7] hienert, d., schaer, p., schaible, j. & mayr, p. (2011) a novel combined term suggestion service for domain-specific digital libraries. in: proceedings of the 15th international conference on theory and practice of digital libraries (tpdl). [8] jacsó, p. (2009) google scholar revisited. online information review, 32 (1), pp.102—114. http://doi.org/10.1108/14684520810866010 [9] jacsó, p. (2011) the pros and cons of microsoft academic search from a bibliometric perspective. online information review, 35 (6), pp.983—997. http://doi.org/10.1108/14684521111210788 [10] kern, d., mutschke, p. & mayr, p. (2014) establishing an online access panel for interactive information retrieval research. in: ieee/acm joint conference on digital libraries. london, uk, ieee, pp.473—474. [11] kriewel, s., klas, c.-p., schaefer, a. & fuhr, n. (2004) daffodil — strategic support for user-oriented access to heterogeneous digital libraries. d-lib magazine, 10 (6). http://doi.org/10.1045/june2004-kriewel [12] ley, m. (2009) dblp: some lessons learned. proceedings of the vldb endowment, 2 (2), pp.1493—1500. [13] lotka, a. (1926) the frequency distribution of scientific productivity. journal of the washington academy of sciences, 16 (12), pp.317—323. [14] lüke, t., schaer, p. & mayr, p. (2013) a framework for specific term recommendation systems. in: proceedings of the 36th international acm sigir conference on research and development in information retrieval. new york, ny, usa, acm, pp.1093—1094. [15] lüke, t., schaer, p. & mayr, p. (2012) improving retrieval results with discipline-specific query expansion. in: p. zaphiris, g. buchanan, e. rasmussen, & f. loizides eds. tpdl. lecture notes in computer science. springer, pp.408—413. [16] lu, z. (2011) pubmed and beyond: a survey of web tools for searching biomedical literature. database, 2011, p.baq036. [17] mayr, p. (2013) relevance distributions across bradford zones: can bradfordizing improve search? in: j. gorraiz, e. schiebel, c. gumpenberger, m. hörlesberger, & h. moed eds. 14th international society of scientometrics and informetrics conference. vienna, austria, pp.1493—1505. [18] mayr, p. & petras, v. (2008) cross-concordances — terminology mapping and its effectiveness for information retrieval: crosskonkordanzen — terminologie mapping und deren effektivität für das information retrieval. in: world library and information congress. québec. [19] mutschke, p., mayr, p., schaer, p. & sure, y. (2011) science models as value-added services for scholarly information systems. scientometrics, 89 (1), pp.349—364. http://doi.org/10.1007/s11192-011-0430-x [20] pieper, d. & summann, f. (2006) bielefeld academic search engine (base) an end-user oriented institutional repository search service. library hi tech, 24 (4), pp.614—619. http://hdl.handle.net/10760/9207 [21] sawitzki, f., zens, m. & mayr, p. (2013) references and citations to support information retrieval in sowiport. in: h. c. hobohm ed. 13th international symposium of information science (isi 2013). potsdam, verlag werner hülsbusch, pp.267—272. [22] schaer, p. (2011) using lotkaian informetrics for ranking in digital libraries. in: c. hoare & a. o'riordan eds. proceedings of the asis&t european workshop 2011 (aew 2011). cork, ireland, asis&t. [23] white, h.d. (1981) 'bradfordizing' search output: how it would help online users. online review, 5 (1), pp.47—54. http://doi.org/10.1108/eb024050 [24] zapilko, b., schaible, j., mayr, p. & mathiak, b. (2013) thesoz: a skos representation of the thesaurus for the social sciences. semantic web journal (swj), 4 (3), pp.257—263. http://doi.org/10.3233/sw-2012-0081   about the authors daniel hienert is a postdoctoral researcher at the gesis department "knowledge technologies for the social sciences". he joined gesis in 2007 after his graduation in computer science at the university of koblenz and further studies of italian and business studies at the humboldt university berlin. during his first year at gesis he worked in the project vascoda-tb5, later in the projects ireon and sowiport. in 2013 he finished his phd on the integration of interactive visualizations in the search and linking process of heterogeneous information on the web. from 10/2013 to 03/2014 he was an acting lead of the team gesis architecture. his research interests are information systems, information visualization and information retrieval.   frank sawitzki graduated 2007 in computer science at the university of koblenz-landau. his diploma thesis dealt with data visualization software under ergonomic aspects. after his studies he was employed as a software developer in the field of web development and network communication. since august 2007 he has been with gesis as a research assistant in the department "knowledge technologies for the social sciences". since 2014 he works as a software developer for an insurance company.   philipp mayr is a postdoctoral researcher and team lead at the gesis department "knowledge technologies for the social sciences" (wts). since winter semester 2012, he teaches as a senior lecturer at cologne university of applied sciences, faculty 03 of information science. from october 2009 untill august 2011, he was a visiting professor for knowledge representation at the university of applied sciences in darmstadt, department of information science & engineering. philipp mayr is a graduate of the berlin school of library and information science at the humboldt university berlin where he finished his doctoral research in 2009. he studied lis, computer science and sociology. since november 2004 he has been working in the internationally recognized projects "competence center modeling and treatment of heterogeneity" (komohe) and "value-added services for information retrieval" (irm) as researcher and principal investigator. philipp mayr published in the areas informetrics, information retrieval and digital libraries and is member of the networked knowledge organization systems/services (nkos) network. he is member of the editorial board of the journals scientometrics and information wissenschaft & praxis. he serves frequently as a reviewer for various journals and international programme committees.   copyright © 2015 daniel hienert, frank sawitzki and philipp mayr jcdl workshop report: studying digital library users in the wild search   |   back issues   |   author index   |   title index   |   contents d-lib magazine july/august 2005 volume 11 number 7/8 issn 1082-9873 jcdl workshop report studying digital library users in the wild   michael khoo dlese program center university corp. for atmospheric research boulder, colorado, usa david ribes sociology and science studies university of california-san diego san diego, california, usa introduction as digital libraries (dls) continue to mature into complex networked information and knowledge tools, and become accessible to ever more diverse groups of users, library implementation and use are emerging as important research problems. how might we understand the ways in which the current generation of dls have been designed and are used in real-life contexts today, and how might we apply our understandings to future generations of dl design? workshop theme and structure the jcdl workshop "studying digital libraries in the wild" was convened with the intent of supporting constructive dialogue amongst dl researchers involved in the qualitative analysis of dl use. the workshop call for participation resulted in the submission of a number of high-quality position papers that addressed a range of theoretical and methodological issues, and the papers were all highly relevant to the qualitative, naturalistic, and longitudinal study of dls 'in the wild.' a total of eleven position papers were accepted, and these are available in this issue of d-lib magazine along with this workshop report. workshop participants were asked to treat dls as sociotechnical artifacts (bishop et al., 2003). these dls represent complex mixtures of people, practices and technologies whose design, implementation and use bring together a wide range of sponsors, developers, and users, in a range of institutional and social contexts (bijker, 1995). the development of sociotechnical artifacts can be unpredictably influenced by social and technological contingencies (bijker et al., 1987). if dl development is affected in the same way, then the identification and analysis of the dynamics of these contingencies and of how they affect the behaviors of dl users 'in the wild' (hutchins, 1995) becomes a crucial component of overall dl design. the workshop consisted of a series of panels and short presentations that focused on research methodologies addressing the issue of sociotechnical complexity. in talks reflecting their position papers, participants described 'works in progress' and outlined case studies – the latter in particular providing excellent fodder for stimulating discussions regarding theoretical and methodological approaches, practical issues such as relationships with research subjects and sponsors, and general 'tales of the field' (van maanen, 1988). common themes and questions about qualitative user evaluation emerged over the period of the workshop, including: flexibility of approach archiving and sharing data communication and intervention flexibility of approach perhaps the most obvious theme of the workshops was the diverse range of research methods that participants had applied in their own work. the main methods introduced are summarized here. (the brief sketches in this report are, of course, incomplete descriptions of each participant's research; see the archived position papers for more detailed accounts). adams, discussing tool implementation, described the use of grounded theory in studies of user authentication and privacy, and the introduction of dl technologies into clinical settings. she described how grounded theory had identified a number of unintended social consequences following the introduction of these technologies, especially in the social and institutional orders of the clinical settings. notess, who also focused on the process of inductive model generation, provided a detailed description of the application of contextual design methods to the study of music dls, and emphasized the need for such studies to become both more widespread and also more systematic within the dl community (for instance in the form of data and model sharing). the papers by buchanan et al., and hinze (both presented by buchanan), discussed methodological and practical issues (as well as the unexpected findings) that can arise when human-computer interaction (hci) research analyzes the fine grain of everyday user interactions. these researchers have analyzed news alert services and the search needs of humanities scholars, and in the process, they have questioned some common hci assumptions regarding the ability of hci methods to uncover what the user 'really' knows about, or wants from, a particular technology. special care is needed here both in framing the research question(s) and in applying these to subjects of the evaluation. wolf, describing the early stages of research with several educational dls, noted that a particular problem with user testing and evaluation is something we, as researchers, sometimes forget, and that is that many research subjects may not even have considered dls as technologies. speaking directly to ethnographic experience, the richness and complexity of real world practices was illustrated by cunningham, who described how ethnographic observations conducted outside the laboratory (for instance, in music stores and people's homes) revealed how people used the containers of digital media (cd cases) to personalize their social spaces and also to communicate with one another. weedman provided a wide-ranging review of a number of approaches that she has brought to bear on her field data, including the sociology of science, the sociology of art, sts, practice theory, and design theory, and she described how these have been applied to a series of different field sites, including image digitization projects and metadata design. ribes, drawing on actor network theory, described the complex sociotechnical ecology of ethnographic study itself, recounting several reflexive moments from his own ethnographic observations in a number of field sites when he realized that his presence within his field sites had triggered of a series of unintended social consequences within those sites. the conclusion to be drawn here is that researchers cannot avoid having an influence on their study sites. a number of the workshop participants described the use of computational tools to archive and analyze data. coding tools such as nvivo (http://www.qsrinternational.com) and atlas.ti (http://www.atlasti.de) were popular choices for the inductive coding and analysis of texts (such as transcripts of conversations, e-mail, threaded discussion, documents, etc.). several researchers also described how they also used tools to parse (as well as to code) texts. shumar argued for considering online technology users as being bound together in symbolic communities, constructed and mediated through discourse, an analysis that opens up dl research to a range of anthropological theory; he studies the boundaries of such communities with netscan, a tool that analyzes discussion threads and spaces. khoo described the use of computational text analysis tools to identify the presence of regular patterns in spoken and written discourse, patterns that could point to the existence of differences in tacit understanding amongst different dl groups such as developers and users. notess (see above) shared how he has used contextual design to generate models and diagrams of user behaviors that can then be used to inform design and also can be represented back to users. in discussion, the participants moved freely between their own theoretical and methodological 'home territory' and that of other participants, and they were quick to point out that each method had advantages and disadvantages; no one method is perfect for all locales and research questions. as discussions progressed about juxtaposition of methods some common issues surfaced, including questions of scale, resolution and granularity, and the need to balance (where appropriate) macro and micro, and inductive and deductive, approaches. here, using pragmatic 'tool box' approaches that combined elements of different methods was often seen as an appropriate response to the granularity question – indeed, methods such as contextual design can be seen as attempts to formalize the tool box approach – although also noted was the danger that the use of individualized suites of methodologies could reduce the ability to generalize research findings and share data. another significant issue that emerged during the discussion was that of the importance of the notion of 'practice' to many theoretical and methodological approaches (for instance, in the form of practice theory, communities of practice, etc.). an interest in practice and the hands-on activities and know-how of actors in particular circumstances often transcended (or perhaps united) disparate theoretical and methodological frameworks, and permitted an understanding of behaviors reflecting 'pre-digitalization' attitudes to new technologies, local adaptations of technologies, work-arounds, rejections of technologies, and so on. the notion of a 'community of practice' was mentioned in several presentations and discussions, pointing to the consideration of collectives of actors as defined by the things they do rather than by their 'official' organizational positions. archiving and sharing data as was mentioned in the previous section, qualitative data analysis is increasingly supported by software tools such as nvivo and atlas.ti, which allow for the inductive creation of codes by which data may be marked-up. these methods augment traditional qualitative data methods with digital technologies, and it was noted that this digital functionality is reproducing, in the context of dl qualitative research, the more general concern for data sharing and interoperability in scientific circles raised by the atkins report (atkins, 2003). that is, while dl researchers may be collecting, archiving, and analyzing qualitative data in ever-increasing quantities, little is being done to enable data-sharing across research communities. this prompts the question of how we may archive and exchange our digital data, (for example, for historical record, for comparative and meta-analysis, or for future re-use). while software packages such as nvivo and atlas.ti can support collaborative and distributed projects – allowing the merging of coding categories – as it stands, there are no wide-scale efforts for the preservation and exchange of dl qualitative data. further, the nascent state of our endeavors means that many aspects of such work are presently under-thought (issues of data accessibility, data security, and data privacy, for instance, remain to be worked on). in this regard, the cyberinfrastructure initiative has recently turned its attention to the social sciences, and it is hoped that research aimed at developing such tools will be conducted in the future (see, for instance, http://vis.sdsc.edu/sbe/reports/sbe-cise-final.pdf). communication and intervention while qualitative dl research can provide rich descriptions of cultural practices and uses of technical artifacts, the complexity of these findings can make communicating them with diverse audiences such as dl developers, policy makers and end-users difficult. additional framing is often necessary in order to present results. two lines of discussion emerged regarding the question of communication: first, developing strategies for communicating with those not familiar with qualitative research, including the possibility of novel representational forms; and second, establishing forums to support communication amongst qualitative dl researchers, and between researchers and the outside world. with regard to developing communication strategies, qualitative research needs to emulate quantitative research, which has traditionally used visual expression, such as graphs or charts. visual representations have the advantage of quickly communicating large sums of data in synoptic fashion, and a number of visualizations have become standardized across fields, facilitating inter-disciplinary decoding (e.g., a bar chart). there is room to develop similar representations for sociotechnical research. here, researchers should not be limited to traditional representational mechanisms, but should also consider such forms as concept maps, social networks, organizational charts, etc. with regard to establishing communication forums, there was a general feeling that qualitative dl research is often peripheral to those within the wider dl research community. to strengthen the position of such research, there is a need for a digital space for the archiving of qualitative data and for the discussion of methodological and theoretical issues and case studies. such a space would support the development of 'toolbox' approaches to dl research, permitting comparison of research sites and findings, and supporting the emergence of more general theory and understanding, as well as presenting rich theoretical and methodological accounts, backed up by 'thick' case studies. such a forum need not necessarily be limited to dl research, but could include examples of research into other information, communication, and knowledge technologies. it could also pay attention and make contributions to sociotechnical and sts studies outside of dl research. following on from the workshop, several initiatives will be undertaken to establish such a space and support the qualitative dl research community. the first step is the archiving of this report and the workshop position papers here in d-lib magazine. we will also mirror these documents on a plone server at the digital library for earth system education (dlese: http://www.dlese.org/cms/qdl/). this mirror is expandable and editable, and researchers will be able to upload and archive papers, workshop reports, etc. in the future. the plone site at dlese will function as a repository for qualitative dl research 'gray literature,' and it is hoped that this function will prove particularly useful for researchers who do not have access to the resources necessary to generate formal peer-reviewed publications for dl conferences and publications. the plone site at dlese will also have wiki-like capabilities to support the posting of announcements, etc. establishing a qualitative dl research list is also planned, again on a server at dlese, for announcements and the archiving of relevant documents. finally, a number of participants expressed enthusiasm for holding another workshop on user evaluation. the upcoming european conference on digital libraries (ecdl 2006) is a possible venue, which would afford european colleagues greater opportunities to attend and present their research. a 'traveling' qualitative dl workshop is also a possibility. conclusion the workshop position papers and discussions spanned a wide range of evaluation approaches. rather than pointing up any weakness in the sociotechnical approach, this variety of approaches underlined the complexity of the research question(s) being addressed. the workshop also illustrated the creative possibilities inherent in dialogs between those using these approaches, as was evidenced in the lively exchanges that occurred during the various presentations. judging by the enthusiasm of the attendees, the outlook for qualitative dl research is favorable. hopefully, as dl research in general matures and shifts from technological development to deployment and use, dl stakeholders will become progressively more amenable to complex and situated analyses of use. dls are community efforts, involving an enormous diversity of domains, as well as traditions of knowledge acquisition and management, and in the face of this diversity, drawing on a full range of qualitative research approaches seems necessary and productive. workshop position papers adams, anne. "grounded theory: case studies and methodological issues," (http://www.dlib.org/dlib/july05/khoo/01_adams.pdf). buchanan, george, ann blandford, jonathan rimmer, claire warwick. "usability challenges in digital libraries for the humanities," (http://www.dlib.org/dlib/july05/khoo/02_buchanan.pdf). cunningham, sally jo. "people and their music," (http://www.dlib.org/dlib/july05/khoo/03_cunningham.pdf). gazan, rich. "digital library evaluation: a longer view," (http://www.dlib.org/dlib/july05/khoo/04_gazan.pdf). hinze, annika. "issues in usability studies for alerting in digital libraries," (http://www.dlib.org/dlib/july05/khoo/05_hinze.pdf). khoo, michael. "technological frames and language-games: understanding tacit knowing," (http://www.dlib.org/dlib/july05/khoo/06_khoo.pdf). notess, mark. "using contextual design for digital library field studies," (http://www.dlib.org/dlib/july05/khoo/07_notess.pdf). ribes, david. "the positions of the social scientist: social and technical acts of intervention," (http://www.dlib.org/dlib/july05/khoo/08_ribes.pdf). shumar, wesley. "ethnography at the math forum," (http://www.dlib.org/dlib/july05/khoo/09_shumar.pdf). weedman, judith. designers, "clients, and knowledge production," (http://www.dlib.org/dlib/july05/khoo/10_weedman.pdf). wolf, alan j., ellen iverson, flora mcmartin, cathy manduca, glenda morgan. "understanding faculty needs of digital resources and libraries," (http://www.dlib.org/dlib/july05/khoo/11_wolf.pdf). references atkins, d. e. c. (2003). revolutionizing science and engineering through cyberinfrastructure: report of the national science foundation blue-ribbon advisory panel on cyberinfrastructure, national science foundation. bijker, w. (1995). of bicycles, bakelites and bulbs: toward a theory of sociotechnical change. cambridge, ma: the mit press. bijker, w., t. hughes, and t. pinch, the social construction of technological systems. cambridge, ma: the mit press, 1987. bishop, a., n. van house, and b. buttenfield, dl use: social practice in design and evaluation. cambridge, ma: the mit press, 2003. hutchins, e., cognition in the wild. cambridge, ma: the mit press, 1995. van maanen, j. (1988). tales of the field: on writing ethnography. chicago: chicago university press.   copyright © 2005 michael khoo and david ribes top | contents search | author index | title index | back issues jcdl 2005 conference report | in brief home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/july2005-khoo   networked knowledge organization systems/services (nkos): ecdl 2008 workshop report search   |   back issues   |   author index   |   title index   |   contents d-lib magazine november/december 2008 volume 14 number 11/12 issn 1082-9873 networked knowledge organization systems/services (nkos) ecdl 2008 workshop report   marianne lykke nielsen royal school of library and information science, denmark the 7th european networked knowledge organization systems/services (nkos) workshop, organized by marianne lykke nielsen in collaboration with doug tudhope and traugott koch as co-organizers, took place on september 19, 2008, in århus, denmark as part of the european conference on digital libraries (ecdl) 2008. the full day workshop was attended by 24 people. knowledge organization systems/services (kos), such as classification systems, thesauri, taxonomies, and ontologies attempt to model the underlying semantic structure of a domain. they serve a variety of functions: tool for representation and indexing of information and documents, knowledge-based support for information searchers, semantic road maps to domains and disciplines, communication tool by providing conceptual framework, and conceptual basis for knowledge based systems, e.g., automated classification systems. modern digital information systems afford more options for mapping and presenting alternative orders of information than traditional physical libraries. thus, the challenge is as much intellectual as technical when we want to develop and map knowledge organization systems that are useful and meaningful for end-users operating in complex, interdisciplinary knowledge domains. the nkos workshop addressed some of the challenges involved in leveraging the full potential of kos for advanced digital library applications. the first morning session concerned the use of social tagging to improve online access to art collections. jennifer trant and david bearman, archives & museum informatics, canada presented results from a large research project investigating how social tagging can contribute to the organization and retrieval of museum objects. based on results from a study in the steve museum project,1 they concluded that tagging does contribute. tags assigned by users are different from the established public vocabulary used by art professionals, and cannot be mined from other sources. 85% of the user tags are not found in the museum documentation, and 60% do not match established art vocabularies. marianne lykke nielsen, royal school of library and information science, denmark followed up with a general overview of trends and findings in social tagging research. the aim of the presentation was to put forward a plenum discussion of social tagging and its relation to kos. the topic of the second morning session was networked kos in general. marjorie m.k. hlava, access innovations / data harmony, usa reported findings from a study at the u.s. weather channel investigating 'return on investment' (roi) using automatic term suggestions in retrieval. the term suggestion feature expanded the original search query by a large set of synonyms providing a roi of 10% reduction in search time. magnus pfeffer, university library, university of mannheim, germany presented a thesaurus-based system for automatic indexing. a study of the indexing quality showed that synonyms are essential for term disambiguation and to avoid assignment of overly broad concepts. in the session of short communications felix boteram and jessica hubrich, fachhochschule köln, germany proposed a model for a comprehensive international knowledge organization system offering interoperability between kos and meeting challenges such as language, structural and typological differences. christoph wieser, salzburg research, germany presented the damped clustering concept method that uses a thesaurus model to assign index terms automatically to newspaper articles. the method favours the occurrence of narrower terms in the document text. marcia zeng, school of library and information science, kent state university, usa reported on the us nkos/cendi workshop, held in washington, d.c., september 11, 2008. the first afternoon presentations concerned terminological registries. doug tudhope, glamorgan university, uk and kora golub, ukoln, uk reported the trss terminology registry scoping study that described scope and potential use of terminal registries. they analysed requirements for registry services with the overall aim of helping stakeholders understand the need for this component of shared infrastructures and to point to future research. marcia zeng continued by putting forward the questions: what do we need to know about terminology resources, do we need metadata descriptions at all, and is there a standardised set of metadata for terminology resources? the last session concerned domain-specific, faceted kos. claudio gnoli, university of pavia, italy discussed the potential of freely faceted classification for knowledge retrieval and browsing, and he showed specifically such classification's potential for systematic browsing. marianne lykke nielsen presented a faceted indexing approach to marking up semantic units or components of document texts. the method seeks to improve precision in domain-specific information retrieval. findings from a comparative study evaluating retrieval effectiveness showed improvement on average of 25% measured by ndcg (normalized discounted cumulative gain). the nkos presentations resulted in a wide range of comments and discussion throughout the workshop, which was an important workshop goal. participants identified various issues and themes for continuing work: use cases and business models for shared infrastructures and registries need for evaluations and validation of kos investigations of the role of folksonomies in retrieval need for empirical studies of end-user interaction with kos issues and problems related to vocabulary mapping and interoperability across domain-specific kos studies of end-user vocabulary vs. professional jargon use of structure and suggestions in social tagging further information on the workshop and links to the presentations can be found on the workshop website at . nkos is a community of researchers, developers and practitioners seeking to enable kos as networked interactive information services via the internet. this was the seventh european nkos workshop. in line with previous workshops authors presenting at the workshop are encouraged to submit papers for consideration in future issues of the journal of digital information (jodi) and new review of hypermedia and multimedia (nrhm). note: 1. the project team responsible for the steve museum deliberately chose to use the lower-case "s" because "steve" 'is neither a person nor an acronym...the team wanted something simple, friendly, and easy to remember.' copyright © 2008 marianne lykke nielsen top | contents search | author index | title index | back issues previous conference report | next conference report home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/november2008-nielsen   meta-design of a community digital library search  |    back issues  |    author index  |    title index  |    contents d-lib magazine may 2002 volume 8 number 5 issn 1082-9873 meta-design of a community digital library   michael wright and mary marlino digital library for earth system education program center university corporation for atmospheric research boulder, colorado <[email protected]>, <[email protected]> tamara sumner center for lifelong learning and design department of computer science, institute of cognitive science university of colorado at boulder <[email protected]> abstract the community digital library has emerged as a recent evolution in the development of digital libraries. a community digital library is distinct through having a community of potential users define and guide the development of the library. in this article, we present how the geoscience community has worked to develop the digital library for earth system education (dlese) in the light of recent work in the area of meta-design, the design of the design process. the development of dlese is described utilizing a conceptual framework developed from the analysis of a variety of open source projects. introduction community libraries, more than any other civic institution, are the hallmark of an informed society. libraries carry powerful symbolic meaning and inference. a library can serve as the communal repository of knowledge, the town meeting place, a touchstone for intellectual discussion and community activism. a community library is a free and accessible place, hospitable to both the learned scholar and the struggling student. a library has the potential to transform the way a community thinks about itself; it provides the conditions and possibilities for intellectual growth, social mobility, and community development and cohesion. community libraries are evolving with the information age. the past decade has witnessed the increasing ubiquity of the world wide web in homes and schools, the emergence of new kinds of "electronic communities" (preece 2002), and the widespread creation and distribution of digital educational materials. digital libraries have emerged as a means for disciplinary communities to share, organize, and assess their intellectual holdings. to date, there are many digital library efforts underway aimed at improving k-12 and undergraduate science education (e.g., needs, ilumina, teacherlib, dlese). one prominent example is the nsdl (national science, technology, engineering, and mathematics digital library), a program initiated by the national science foundation (nsf) to integrate multiple digital libraries and to serve the needs of the community of science educators and learners at the broadest level (fulker and janée 2002). over the past three years, the earth system science community has come together to articulate a vision for, and begin construction of, the digital library for earth system education (dlese). dlese is being designed to support the specific educational needs of the geoscience community within this larger national library network. in the tradition of community libraries, the mission of dlese is to fundamentally change the way students learn, instructors teach, and researchers interact, by providing new ways of sharing information, tools, and services. dlese is a grassroots, community-led project that, among a range of services, provides searchable access to high-quality, online educational resources for k-12, undergraduate, and informal earth system science education (marlino, sumner et al. 2001). these resources include objects such as maps, lesson plans, lab exercises, data sets, virtual field trips, and interactive demonstrations. the holdings of dlese are created by a wide variety of individual faculty members, agencies, and institutions. these resources are held (stored) on local servers and are accessed through the library via a database of searchable metadata records that describe them. to date, significant progress has been made on many aspects of the library: the community has been organized, a governance structure has been established, a strategic plan has been developed, a useful collection is available, and a working version of the library is now in use (figure 1). this operational library was released in august 2001, and contains approximately 2000 carefully selected educational resources. version 1.0 provides educational discovery features that enable users to search by grade level, educational resource type, and keyword. this version of the library also contains a web-based resource cataloger enabling community members to contribute resources to the library. additional community oriented services are provided such as discussion forums for working groups, and posting tools for geoscience education opportunities and announcements. future versions of dlese will offer peer-reviewed teaching and learning resources, interfaces and tools to allow exploration of earth data, and services to help users effectively create and use educational resources.   figure 1. version 1.0 of dlese (www.dlese.org) was released in august 2001. it enables educators and students to search and browse for educational resources by grade level, keyword, and educational resource type. every month an educational resource created by a community member is featured.   designing the design process the uniqueness of community libraries lies not in whom they are being built for, but in how they are designed, constructed, and managed. a library such as dlese cannot thrive without the distributed effort and energies of a broadly engaged community. our experiences to date demonstrate the benefits of adopting a design process where community-building and technical development are tightly integrated: through workshops, working groups, and funded projects, community members are actively participating in the development of library policies, technologies, collections, and peer reviewing services. dlese's unique approach to distributed library design and construction is informed by the emerging theoretical perspective of meta-design (fischer 2001; fischer 2001). meta-design is a useful perspective for analyzing projects where ‘designing the design process' is a first-class activity, i.e., creating the technical and social conditions for broad participation in design activities is as important as creating the artifact itself. meta-design theory draws on existing design traditions such as user-centered design and participatory design. methodologies such as user-centered design (norman 1986) and task-centered design (gould, boies et al. 1991; lewis and rieman 1993) emphasize getting systematic user feedback on the evolving design throughout the design process; however, it is largely designers who generate possible solutions, users are mainly in a reactive role. participatory design (pd) methods (greenbaum and kyung 1991; schuler and namioka 1993) seek to involve users more deeply in the process as co-designers; i.e., users also propose and generate design alternatives. an important ideal of pd is industrial democracy (ehn 1989), which implies that users should be politically empowered with a broad design remit that includes: co-designing the technology, co-determining the nature of the design process itself (schedules, processes, outcomes), and co-designing the social conditions of technology use (i.e., work practices, organizational structures and responsibilities). meta-design is largely an empirically inspired theory, derived from examining success models of open source projects, where the design and development of complex software systems is carried out by highly distributed, potentially large groups of people communicating and coordinating activities via the internet (raymond 2001). while meta-design shares the industrial democracy ideals of participatory design, the social and technical conditions under which design is performed, as well as the anticipated outcomes, are quite different: design participants are largely volunteers, and the distinctions between user and developer roles are blurred. people step forward to design and develop pieces of interest or importance to them. there can be large numbers of these individual sub-projects proceeding in parallel, in some cases even hundreds or thousands. these diverse activities are coordinated by a smaller team of project leaders, or even a single individual (fielding 1999; nakakoji, yamamoto et al. 2002). participants are distributed, with most project communication and coordination activities taking place over the internet. design outcomes can have an emergent flavor, in that it is difficult to predict who will step forward to do what, or what new lines of inquiry are being pursued in the broader community of participants. a detailed analytic framework for meta-design based on an analysis of open source projects was recently developed by scharff (scharff 2002). the framework illustrates the collaborative process toward developing an object, in scharff's case, a piece of software where the participants are a group of software developers (figure 2). a core principle of the framework is the intertwining of social processes and technical artifacts. the participants use various computer-mediated communication technologies for social interaction (email, news, web forums) and software development tools that support collaboration and coordination (e.g., cvs, or concurrent versions system, a tool used to manage changes to software code). the participants use these technologies to support their collaborative process to contribute to the development of the final object, or software. we should note this framework also captures the iterative nature of development: participants can contribute to the object creation process after using interim versions, and where the state of the design is accessible usually by means of a public artifact or standard distribution. figure 2. analytic framework developed by scharff, derived from examining a variety of successful open source projects.   collaborative library development in dlese in the previous section, we describe how meta-design theory has been derived from the examination of open source projects. some common traits across the projects are that participants are largely volunteers, distinctions between developer and user roles are blurred, participants are numerous and highly distributed, and various sub-projects can emerge and proceed in parallel, coordinated by a small team or an individual. these traits are also evident in dlese with its highly distributed community, community members being users and developers, and various dlese sub-projects being developed in the community with policy oversight and coordination through a community-based governance mechanism. this governance mechanism formally adopted open source principles for dlese in december 2000. because of these similarities, among others, we believe that scharff's framework is highly relevant for analyzing the distributed library construction process. participants in the case of dlese, the participants are a broad collection of engaged individuals from the geoscience community (researchers, educators, and students), k-12 educators, library science specialists, digital library and information sciences specialists. the participants are distributed across the usa. these participants can be arranged into three primary groupings: governance, the dlese program center (dpc) and the community. governance includes the steering committee responsible for overall policy guidance for the dlese project, and four standing committees aligned to core facets of the library (services, users, collections and technology). the dpc is responsible for developing and operating the core infrastructure of the library, and provides a coordinating role for the dlese project. this coordinating role is similar to the coordinating roles of teams in other large open source projects (e.g., the apache coordinating team on the apache project (fielding 1999), or squeak central (kaehler 2002; scharff 2002)). the community includes the individuals and institutions that have an interest in seeing dlese develop, and they are involved through individual action and through structured arrangements such as working groups (reporting to standing committees) discussion groups, and events such as the dlese annual meeting. as noted earlier, individuals contribute as project collaborators or leaders based on their interests and expertise. collaborative process the collaborative process in dlese has allowed the community to interact and participate in the design and development of the library. the framework highlights two facilitation mechanisms to support the collaborative process: a social facilitation of participants; and a technical facilitation of communication infrastructure and technical artifacts. the overall process toward developing the library (the object) has involved both these paths. an important aspect of the dlese process has been the social facilitation of participants through face-to-face meetings and workshops, as well as through the technical facilitation using mechanisms such as computer mediated interaction, e.g., email, web-based forums, and document sharing. the social facilitation has been very important to the dlese development in that it helps community members from different backgrounds understand and share a common vision of what dlese should be. in addition, explicit use of user-centered and participatory design methodologies to support the collaboration process has proved to be very useful. this has included the development of use cases derived from envisioned use scenarios written by educators and from interviews with prospective users (davis and dawe 2001; sumner and dawe 2001). as a check on library development progress, formative usability testing is done to ensure that the library development is meeting the goals (as expressed through the use cases), and to provide further contributions to the development. the development of use cases and the testing of the library against them is a level of formalization on the contribution and feedback mechanisms beyond those found in the open source projects analyzed by scharff. collaborative technology dlese has taken advantage of various computer mediated communication mechanisms (email, web forums, document versioning, development and discussion of use cases) and human mediated mechanisms (face-to-face meetings, telephone conferencing and workshops). for any given component project within dlese, a number of these collaborative technology mechanisms have been used, often starting with work group meetings followed by extensive online discussions around shared documents. we have found that support for document versioning is crucial and needs to be available and accessible to all participants. we have found that the versioning systems commonly used in open source projects (e.g., cvs) do not lend themselves to use by our broader, non-software developer community. object produced the development of dlese is an ongoing project, but services are already being provided to the geoscience community. services include the core technical mechanisms of collection building and discovery, plus the community building support such as online groups and human interaction through meetings such as the annual meeting. there are projects now under development in the wider dlese community that are building on this core, e.g., the community review system (crs) under development at columbia (kastens 2001) and the community issues and groups area at carleton college (manduca 2002). these projects utilize underlying technical structures developed by the dpc to explicitly support distributed library building. each of these developments has evolved from the early stages of the ongoing dlese collaborative process. the leaders of these developments came forward from the larger dlese community to lead the development of an identified library component and to coordinate their work with other projects. the subsequent development of the components have themselves elicited a collaborative process in their development through working with specific working groups and committees (e.g., the crs with the collections committee, the metadata working group and the dpc), so the conceptual framework has applicability at different granularities of our library development process. conclusion dlese is a partnership between the national science foundation (nsf), the dlese community, the steering committee, and the dlese program center. the success so far in developing dlese has been based on the effective collaboration of the dlese participants to date. the framework developed by scharff has given us a tool on which to build the description of dlese development and to illustrate how the collaborative process has been important to dlese development. while scharff's work looked at distributed software developers on open source software development projects, we have found that the framework is useful in the context of a broader participant pool such as that found in dlese. one major factor that other digital library developers heading down the community library path should consider is the large amount of work required to support the social facilitation of a diverse set of participants beyond core digital library technologists to allow all participants to have a shared vision, and allow them to participate fully in the design and development. in dlese, the structure and use of meetings, committees, governance, and dpc has provided a framework (the meta-design) to support the distributed library building process. acknowledgements this article has benefited from discussions with eric scharff, gerhard fischer and other colleagues at the center for lifelong learning and design. the research described here was partially funded under cooperative agreement #atm-9732665 between the national science foundation (nsf) and the university corporation for atmospheric research (ucar), and grants #9978338 and 0085600. the views expressed herein are those of the author(s) and do not necessarily reflect the views of nsf, its sub-agencies, or ucar. references davis, l. and m. dawe (2001). collaborative design with use case scenarios. acm/ieee joint conference on digital libraries (jcdl '01), roanoke, va (june 24-28), pp: 146-147. ehn, p. (1989). work-oriented design of computer artifacts. stockholm, arbetslivscentrum, 496 pages. fielding, r. (1999). "shared leadership in the apache project." communications of the acm 42(4): 42-43. fischer, g. (2001). communities of interest: learning through the interaction of multiple knowledge systems. 24th iris conference, bergen, norway (august), pp: 1-14. fischer, g. (2001). external and sharable artifacts as sources for social creativity in communities of interest. fifth international roundtable conference on computational and cognitive models of creative design, heron island, australia (december 9-13), in press. fulker, d. and g. janée (2002). components of an nsdl architecture: technical scope and functional model. to appear in: proceedings of the acm/ieee joint conference on digital libraries (jcdl '02), portland, oregon (july 14-18), in press. gould, j. d., s. j. boies, et al. (1991). "making usable, useful, productivity-enhancing computer applications." communications of the acm 34(1): 74-85. greenbaum, j. and m. kyung (1991). design at work: cooperative design of computer systems. hillsdale, nj, lawrence erlbaum associates, 294 pages. kaehler, t. (2002). squeak, [http://www.squeak.org]. last access: april 29, 2002. kastens, k. a. (2001). "how to identify the "best" resources for the reviewed collection of the digital library for earth system education in column "another node on the internet"." computers & geosciences 27(3): 375-378. lewis, c. h. and j. rieman (1993). task-centered user interface design: a practical guide, [http://home.att.net/~jrieman/jrtcdbk.html]. manduca, c. (2002). dlese community issues and groups, carleton college, [http://www.dlesecommunity.carleton.edu]. last access: april 29, 2002. marlino, m., t. r. sumner, et al. (2001). "the digital library for earth system education: building community, building the library." communications of the acm special issue on digital libraries (may): 80-81. nakakoji, k., y. yamamoto, et al. (2002). evolution patterns of open-source software systems and communities. international workshop on principles of software evolution (iwpse 2002), orlando, florida (may 19-20), in press. norman, d. (1986). user centered system design, new perspectives on human-computer interaction. hillsdale, nj, lawrence erlbaum associates, 31-62 pages. preece, j. (2002). "special issue: supporting community and building social capital." communications of the acm 45(4): 37-39. raymond, e. (2001). the cathedral and the bazaar: musings on linux and open source by an accidental revolutionary. sebastopol, ca, o'reilly & associates, 255 pages. scharff, e. (2002). open source software: a conceptual framework for collaborative artifact and knowledge construction. ph.d. thesis, department of computer science, university of colorado, boulder: 181. schuler, d. and a. namioka (1993). participatory design: principles and practices. hillsdale, nj, lawrence erlbaum associates, 319 pages. sumner, t. r. and m. dawe (2001). looking at digital library usability from a reuse perspective. acm/ieee joint conference on digital libraries (jcdl '01), roanoke, virginia (june 24 -28), pp: 416-425. (on may 17, 2002, a link in the caption of figure 1 to the article was corrected.)   copyright © michael wright, tamara sumner and mary marlino top | contents search | author index | title index | back issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/may2002-wright   context and its role in the digital preservation of cultural objects search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine november/december 2012 volume 18, number 11/12 table of contents   context and its role in the digital preservation of cultural objects joan e. beaudoin wayne state university joan.beaudoin@wayne.edu doi:10.1045/november2012-beaudoin1   printer-friendly version   abstract in discussions surrounding digital preservation, context — those properties of an object related to its creation and preservation that make the object's origins, composition, and purpose clear — has been identified as a critical aspect of preservation metadata. understanding a cultural object's context, in as much detail as possible, is necessary to the successful future use of that object, regardless of its form. the necessity of capturing data about the creation of digital resources and the technical details of the preservation process, has generally been agreed. capturing many other contextual aspects — such as utility, history, curation, authenticity — that would certainly contribute to successful retrieval, assessment, management, access, and use of preserved digital content, has not been adequately addressed or codified. recording these aspects of contextual information is especially important for physical objects that are digitally preserved, and thereby removed from their original setting. this paper investigates the various discussions in the literature surrounding contextual information, and then presents a framework which makes explicit the various dimensions of context which have been identified as useful for digital preservation efforts, and offers a way to ensure the capture those aspects of an object's context that are often missed.   introduction   "the context of a digital object to be preserved over time comprises the representation of all known properties associated with it and of all operations that have been carried out on it." (brocks et al., 2009, p. 197)   this paper seeks to examine and clarify contextual information recorded for the preservation of digital cultural objects. an overview of the published literature written on the topic of contextual information recorded for digital preservation is provided here to illustrate the multifarious nature of the topic. the various approaches to the topic of context revealed through the literature are then used to develop a multidimensional framework within which to capture contextual information regarding cultural objects. this framework acknowledges the rich information about context that can be captured to provide more effective means of search, retrieval, examination, use, management, and preservation for cultural objects in a digital form. digital preservation, according to conway (1996) is the "acquisition, organization, and distribution of resources to prevent further deterioration or renew the usability of selected groups of materials." this definition provides an indication of the various efforts involved in preserving digital materials so that they find extended use, but it leaves a key piece of the preservation process unacknowledged. the importance of preserving the descriptive and explanatory information that accompanies digitized materials fails to appear in this definition, except perhaps through intimation. this situation is not surprising given that preserving digital content is the principal goal of digital preservation. the literature surrounding digital preservation reflects this aim, and so it has primarily focused on those technical issues that need to be addressed in order to extend the life of digital materials beyond their period of creation. however, this focus means that the important contextual data concerning digital content generally go unrecognized. this situation exacerbates the contextual break that occurs in the information available about an item beyond the time of its creation. the further removed the period of creation of an object (digital or otherwise) is from the period of its later examination, the less likely it is that its full significance will be appreciated. knowledge about the context of cultural objects is nearly mandatory for our understanding, use, care, and preservation of them. an acknowledgement of this situation can be seen in the investigations of several researchers who have considered issues of contextual information for digital preservation. many authors have discussed the general problems encountered when there is a lack of contextual information. one of the earliest authors to address this problem in the literature felt that the predominantly technical metadata recorded at the time of a digital object's creation was of limited usefulness since it lacked information concerning the historical context, or broader contextual information beyond that of the current system (duranti, 1995). even at this early date in the discussion of digital preservation, the limitations of information recorded during the digitization phase were recognized. this focus on the technical details has remained a common topic in the literature in the intervening years. chowdhury (2010) noted that the primary topics addressed in the digital preservation literature are those which focus on technological and semantic information surrounding digital content. while technical details are useful in their own right for the preservation record of digital objects, this does little to aid our broader understanding of the item. the difficulties resulting from a restricted view of context in digital preservation metadata appear in more recent discussions of the topic, with several authors expanding the discussion to include very different kinds of metadata (lavoie & gartner, 2005; watry, 2007; lee, 2011). several authors have discussed the need and reason for recording contextual metadata. conway (1996) notes the difficulties encountered with a lack of contextual information for digital materials, stating that this creates a situation where "... we find ourselves confronting a dilemma such as the one faced by howard carson, macaulay's amateur digger [in motel of the mysteries (1979)]: a vast void of knowledge filled by myth and speculation." for conway, preservation is primarily concerned with evidence that is a part of the physical object and the intellectual content represented by it. digital materials for him, since they are divorced from the physical world, are seen as fragile objects in perpetual danger of loss or damage without the information needed to contextualize them. lee (2011) also uses an archaeological analogy in his paper examining the topic of contextual information within digital preservation, noting that the difference between an archaeologist and a looter is that the latter does not record contextual information before removing objects from their find spot. removing an object from its surrounding stratigraphy without recording those details often means that interpretive clues and the object's full significance are lost. while most authors would now recognize that there are multiple levels of contextual information useful to digital preservation, the problem may be the lack of resources available to the task. watry (2007), in fact, questions whether sufficient capture and management of contextual metadata are achievable for meeting the needs of the archivist and, i would add, the ultimate users of preserved digital content. owing to the relatively youthful nature of the discipline of digital preservation, with its limited exploration and tentative practices, a marked tendency toward addressing fundamental principles has appeared in the literature. this can be seen in bearman's (2007) discussion of digital preservation where he notes there is little consensus about fundamental issues of what should be saved or how to save it. this idea of worthiness is mirrored by vogt-o'connor (2000) when she suggests criteria to be used in choosing materials for digitization projects. the evaluative questions she asks concerning selection indicate the critical nature of context in the digitization process. she asks "[d]oes the candidate material require substantial research and a sophisticated and expensive context in order to be useful?" (vogt-o'connor, 2000, p. 68). indicating just how critical this information can be for their use, she goes on to state that if context for the materials being digitized cannot be provided, other materials should be chosen. expanding upon these selection rules for the digitization process itself, it seems likely that these criteria should also be employed in decisions concerning digital preservation efforts. one of the most difficult problems encountered in the discussion of context as it relates to digital material is the variable nature of the term. vogt-o'connor used the term in the discussion above to express possible technical limitations of the digital materials themselves (or their systems) which would interfere with the reception of key characteristics of the physical objects. however, the meaning of the term context in the passage above could just as easily be applied to discussions about social, historical, physical, or a whole host of other aspects. it was only through a reading of the text surrounding the above passage that the specific meaning of context was discovered. the text served as the "contextualizer" for the term in this instance. this discussion concerning vogt-o'connor's passage offers a brief, but clear example of how important context is for the reception of information. the problems of context can be exacerbated in the case of non-textual media, such as visual or audio materials, as they often do not include text to provide contextual clues. context is especially important in discussions of digital preservation since in most instances the digital materials have been separated from their original format and context in the processes of digitization and preservation. digital materials pose a "... risk of decontextualization —the possibility that the digital surrogate will become detached from some context that is important to understanding what it is, and will be received and understood in the absence of that context", (unsworth, 2004). in other words, since digital materials are typically not situated within their original context they are prone to being experienced and interpreted in ways that were unintended. while there is value in using materials in decontextualized ways, for example, as a sort of creative springboard, it is critical that the original and intended meaning and/or experience be preserved whenever possible. contextual information surrounding digital content is varied. what follows is a discussion of eight major preservation topic areas that were identified during a review of the digital preservation literature that addresses the concept of context.   technological aspect by far the most thoroughly investigated form of context in the literature surrounding digital preservation is that concerned with technology. as was mentioned earlier, this is hardly surprising given the centrality of this topic to the discipline of digital preservation. issues of hardware and software, emulation and migration, formatting, and translation all fall under this general rubric and are issues that continue to receive much research interest. day (1997) is among the earliest authors to discuss the importance of recording technological context for digital preservation. he suggested that dublin core elements could be used to preserve details (e.g., migration, encoding) about the technical context of digital materials. furthermore day (1997) suggests that the metadata recorded for each instance would make it possible to discover how to accurately manipulate and display digital materials. discussions of the issues surrounding technical context can be found in the work of levy (1998), bullock (1999), besser (2000), and chen (2001). beyond the technical dependencies of digital materials on hardware and software, these authors address technological issues such as emulation, file formats, migration, storage, obsolete hardware maintenance, compression and encryption and how these have important implications for the future reuse of preserved digital content. bullock (1999), levy (1998) and chen (2001) discuss the difficulties facing any preservation effort due to the history of rapid obsolescence and lack of backward compatibility found in the digital arena. chen (2001) suggests there are diametrically opposed needs in the area of digital materials. this is seen in the need to maintain digital materials intact as they were created, while at the same time wanting to use ever more advanced tools and techniques. levy (1998), too, argues that there is a division between the technical requirements of digital preservation and the users of those materials, and so he states that "[t]he challenge ahead is to bring our best technical skills to bear on the problem of digital preservation without losing sight of the ultimate human purposes these efforts serve, purposes which cannot be found within machines", (p. 161). for chen (2001) the disparity between how digital context was created and how it was used represents a major research challenge, as well as requiring increasing amounts of metadata. the importance of metadata to record technical information for digital preservation, mentioned by day (1997) and chen (2001) is more completely addressed by waibel (2003), brocks et al. (2009), and faniel & yakel (2011). waibel (2003) discusses the topic of technical context through three interlocking metadata standards, the open archival information system (oais), metadata encoding and transmission standard (mets) and niso data dictionary — technical metadata for digital still images (x39.87). using these, waibel attempts to capture the full spectrum of information surrounding the preservation of digital materials. technical aspects of context were similarly the focus of brocks et al. (2009) in their paper which developed an extended oais model for digital preservation. digital preservation is not just a technical problem, however, as chen and levy observed. for digital preservation to be successful additional aspects beyond technical details need to be recorded for digital content. a broadening of the kinds of information to be recorded is evident in the paper by faniel & yakel (2011) where they state that "[c]ontextual metadata hasn't garnered a great deal of attention, but there is an acknowledgement that it is key to long-term renderability and meaningfulness in reuse", (p. 156). these authors go on to state there are currently two separate research camps, that of digital curation and that of reuse, and that these two camps focus on different aspects of preservation metadata. the digital curation camp focuses its attention on metadata for technical aspects in digital preservation, while the reuse camp examines meaning making through metadata. recording multiple kinds of context about digital content is also a topic addressed by mayer & rauber (2009) in their paper which introduces semi-automatic methods to capture information critical to the interpretation, authenticity and use of large data sets. using the dimensions of time, object type, contributors and content these authors examine how contextual information can be detected and extracted from digital objects embedded in an information space. while technical details have been a primary focus of discussions surrounding digital preservation, the future utility of the preserved items is an often identified reason for including contextual data and so this topic is what we turn to next.   utilization aspect context in this case clarifies aspects about who the audience is and what their requirements are when they seek out and use digital materials. the importance of use context is seen in hedstrom's (1998) definition of digital preservation "... as those methods and technologies necessary to ensure digital information of continuing value remains accessible", (p. 190). in order for digital materials to remain accessible, preservation efforts must ensure that the requirements of users, present and future, are met. wallis et al. (2008), in their study of escience data archiving and reuse, discuss how the quality and value of digital content are tied to a user's ability to understand its origins, provenance, and context. particularly important to these researchers was the documentation of decisions on what content was retained and how it had been processed (collected, cleaned, calibrated, reduced, etc.) prior to its original use and deposition in the digital archive. while these researchers examined escience data rather than cultural heritage objects, their study helps point out that digital content may pass through various stages of use and reuse. as circumstances of use have been recognized as crucial to a determination of what is to be preserved, recording contextual information about use would be helpful (levy, 1998). there is, however, some disagreement among researchers about how important users ultimately are in the digital setting and what aspects of use, including the needs of the users themselves and their specific tasks, required tools and social, political and/or organizational settings, should be considered. the degree to which potential users and uses of an object can be judged with any accuracy has been debated by lynch (2002), who states that "... perhaps we should avoid over-emphasizing pre-conceived notions about user communities when creating digital collection[s] at least in part because we are so bad at identifying or predicting these target communities." while it may be difficult to predict who the eventual users of digital objects may be, it is fairly clear that the impetus to digitize materials or provide access to born-digital content typically originates with some defined audience in mind. marchionini & maurer (1995) identify three basic types of users of digital materials in an online setting. while specifically written for an audience interested in digital materials for educational purposes, these authors outline the various types of "learning" experienced by users of digital libraries and offer a discussion of the levels of intermediation needed by each. they suggest that the creation of an intellectual infrastructure for the effective use of materials is dependent on the user type (formal, informal, or professional). a categorization of digital content users into types (expert, general, or casual) is also discussed by benoit (2011) in his study of how information systems which contained contextualizing information about items held in it were perceived by various groups. benoit's study is useful to note here since it offers support for the idea that contextual information about use plays an important role in information seeking. users without specialized subject knowledge, those falling in benoit's general user classification, "felt they could pose a broader range of (unusual) questions that are meaningful to their information needs", (p. 144). furthermore, benoit found that the "integration of user context-use data altered expectations of the role of information systems in general", (p. 144). in addition to the benefits suggested for the ultimate end-users of preserved digital content, copeland & barreau (2011) note that user-supplied contextualizing information may assist people in identifying, preserving and sharing their own digital content. aspects of use incorporated into retrieval systems ensure the future value and usefulness of digital materials and so they should be recorded. specific task-based needs of users can be all-important in the use of digital materials, as meyyappan et al. (2001) and mayer & rauber (2009) discuss. digital preservation must also consider the tools and techniques used to support users' analyses. for example, in a scholarly setting, tools to help with interpretive processes, note taking and collaboration have been noted as important aspects of use (palmer, 2002). mayer & rauber (2009) present several use scenarios where automatically generated contextual information is used to assist "in virtually any task where specific digital objects are concerned and where the context is not obvious to the user", (p. 8). while digital materials are dependent on the systems and tools developed for their presentation and usage, they can become separated from their mechanisms of presentation and usage and so some provision must be made to identify how the materials were intended to be used by their primary audience. a critical aspect of use to be discussed in the context of digital preservation is the original setting for the digital materials. social, political and/or organizational contexts have a broad impact upon the use of digital materials and these aspects should also be recorded in the preservation record. as adams & blandford (2004) discovered with their study of digital libraries within a medical setting, the use of digital materials cannot be divorced from a critical analysis of the social and organizational setting within which their users operate. these researchers found that inadequate consideration of these aspects can lead to negative perceptions of digital libraries, a lack of knowledge about, abilities with, and awareness of digital libraries, and can contribute to the hoarding of information and technology. as users are so important to the use and reuse of digital materials, aspects concerning the intended use and audience also need to be addressed through the metadata record for digital preservation.   physical aspect many of the difficulties experienced with digital preservation are simply due to the fact that digital materials are decontextualized from their original state in the digitization process. simple characteristics of the original are lost in the creation of a digital surrogate of that work. information about scale, surface, behavior, relationships, arrangement of parts, functionality and so on, is intimately tied to the perception of physical objects. digital materials, while they enable some analyses which are impossible with physical manifestations, provide very weak information concerning tangible aspects. bullock (1999) states the theme of documentation and description in the digital realm is in part due to the fact that digital objects tend not to carry visible evidence of their creation. clues to information concerning the original objects, such as those found in the materials and techniques used in their creation, tend not to be readily discernible in digital surrogates. while physical aspects are fundamental to the reception of the digital object in its use environment, they also guide preservation decisions. without information concerning the physical nature of the original it is difficult to make informed decisions about which digital items should be selected for preservation efforts. another aspect that has been discussed concerns how user experiences differ between the original and digital versions. as meirelles (2004) points out in her paper on the challenges of presenting artworks in the electronic environment, the way an item is experienced is mediated through hardware and software. visual displays, speakers, system speeds, interface design, mice and other devices used to interact with digital content transform how the original is received. that changes in an item's reception can occur due to hardware and software variations, even with objects created for the electronic environment, speaks to the basic problems inherent in the medium. issues with the physical-digital transformation are apparent in the discussion of decontextualized digital materials by unsworth (2004) and conway (2009). conway (2009) carefully recounts how the digitization of historical photographs "diminishes, masks, or even distorts visual cues that are potentially fundamental to the extraction of meaning", (p. 16). the relationship between representation, replacement, and superiority in the physical-digital transformation are complex and fraught with many challenges. due to these problematic relationships, menne-haritz and brübach (n.d.) feel that through the conversion process critical information about the context circumstances of documents/objects is lost, and so "[d]igital imaging is not suitable for permanent storage." these authors suggest that since digital materials are unable to accurately represent analog objects, there is little reason to be concerned with digital preservation. unsworth's (2004) suggestion that each digital surrogate is "shaped by the perspective from which it was produced", also alludes to the limitations of digital materials to truthfully represent original objects. the result of the analog transition to digital media is multiple and varied versions of a single item. the question of how we choose the one that most closely reflects the original remains unanswered. conway (2009), in his discussion of ways to regulate or lessen the loss of information in the analog to digital transformation, points to the potential usefulness that standardized digitization guidelines and explicit processing statements could provide. a number of the problems experienced in the physical-digital transformation are due to the fact that, unlike physical materials, formats and principles for digital preservation are in the early stages of development. problems associated with the lack of persistency, how digital objects relate to one another, the behavior of digital objects, and so forth, could potentially be resolved in the long-term when fully developed methods and principles are available (besser, 2000). on the other hand, there may be viable reasons to represent materials in their original, historical format. without the ability to provide an object's original access and functionality, the experience of the user-viewer no longer reflects what was intended by the item's creator. in this case, the ability to record what is to be retained, perhaps through a statement of the creator's intentions, is of paramount importance in guiding preservation efforts (lusenet, 2002).   intangible aspect although typically not mentioned outside of discussions of the physical features lost in digitization of items, this dimension of context is concerned with recording those aspects which are the result of the intangible nature of digital materials, and so is an aspect believed to warrant its own entry. this aspect includes qualities such as indistinct object boundaries and impermanent linkages between digital materials. meirelles (2004) notes that interactions, links and connections made between data are important to the way a work is experienced. this suggests that the vague and sometimes shifting nature of digital items, as is discussed by besser (2000), bullock (1999) and lusenet (2002), has a powerful influence on how we receive digital content.   curatorial aspect although this aspect hasn't received much attention in the literature, several authors have mentioned issues surrounding the custodial tradition of the information record of digital materials (gilliand-swetland, 2000; lavoie & dempsey, 2004). this aspect is concerned with the care and protection of digital content, and the preservation of the information surrounding these objects. besser (2000) suggests that digital preservation efforts have been stymied due to the fact that issues of responsibility between librarians and technical staff have yet to be resolved. besser suggests that if neither group claims responsibility for this effort, it is likely that this work will never be carried out in any systematic way. while nesmith (2005) discusses context as it relates to the construction of records within the archive, he suggests that the custodial history, the use of archival materials, and the impact of records across time can be used to "... explain why the records exist, what they might be useful evidence of, and how they have been and might be used", (p. 271). thus, in providing information about the custodial history in the preservation record, future users will be privy to the reasons relating to why the digital objects exists and the decisions that were made for their preservation.   authentication aspect authentication context, those issues of digital preservation surrounding evidence and verification, has garnered a great deal of attention in the literature surrounding archival records. hedstrom (1998) notes that the ability to judge and authenticate the integrity of a source is particularly problematic with digital materials since they are so "... easily altered, copied and removed from their original context", (p. 192). gilliland-swetland (2000) also notes the difficulties of amassing evidence with materials that show little chain of custody. one way to authenticate these materials is to "... require archives and libraries to preserve contextual and descriptive information", in addition to the original content (hedstrom, 1998, p. 192). more recently duranti (2005) states, while writing on the topic of the long-term preservation of digital records, that in order to preserve authenticity of the records, the identity and integrity of the content must be maintained. she suggests that the identity of digital content can readily be maintained through metadata directly attached to the material being described. integrity, however, presents several challenges. difficulties associated with verifying the integrity of digital content can result from the proprietary nature of specific environments within which the materials reside. according to duranti (2010), one way to alleviate this problematic situation is through the use of open source environments as they are able to satisfy the "legal requirements of objectivity, transparency, verifiability and repeatability for any process that is carried out in a digital environment", (p. 163). mayer & rauber (2009) state that advanced tools, such as automatically generated contextual analyses, can help to eliminate the difficulties encountered in the tasks associated with manually identifying and establishing the provenance of the digital content. although a high level of interest in the authentication of digital content has not been reflected in the literature surrounding cultural materials, archival investigations into issues such as provenance, tracking content changes, integrity, and versioning are likely to be equally applicable in the sphere of cultural heritage.   authorization aspect information concerning the intellectual property rights of original objects and their digital surrogates is another topic that was found in the literature. aspects which fall under this type of context include information concerning rights holder(s), rights management, and allowable legal use. surprisingly, discussions of intellectual property rights within the realm of digital preservation for cultural heritage literature are uncommon. the rights of original content producers are, however, addressed within the cultural heritage community and this topic also appears in studies that examine the importance of documentation of ownership of digital content (ormond-parker & sloggett, 2011). lavoie & dempsey (2004) offer a brief discussion of issues surrounding intellectual property rights in the realm of digital preservation. these authors suggest that intellectual property rights for digital materials are ambiguous under the current law, and that there are two competing issues at play in the preservation of digital content: the need to intervene to preserve digital materials and the need to protect intellectual property rights. besek (2003) and hirtle et al. (2009) present overviews of the rights, exceptions and responsibilities associated with copyright and digital materials that are generally applicable for cultural objects. digital preservation is an item of concern to copyright holders since its processes require copying, and in some cases migration, of content in ways that change the original digital object. duranti (2010) discusses these issues in the context of digital preservation and states that the intellectual property rights of the copyright holder are coupled with the authentic version of the digital content. transformative migration is particularly important to preservation efforts, according to duranti (2010), as "additions or modifications to an existing work ... can trigger new copyright considerations", (p. 160 n. 3). as digital content has an additional level of complexity concerning intellectual property rights, metadata to record aspects to lessen future challenges would likely be welcomed.   intellectual aspect one category of context which has a strong tradition in the scholarship associated with the cultural heritage community is information surrounding the significance of cultural objects. this category of context includes aspects such as meaning, function, technique, historical importance, narratives and communication of ideas through cultural objects and, by proxy, their digital counterparts. understanding a digital object's original intellectual context is viewed as critical to the reception of a work by a number of authors writing in the service of archives, libraries and information science (bullock, 1999; besser, 2000; lusenet, 2002; lynch, 2002; dalbello, 2004; mayer & rauber, 2009; duff et al., 2011; wisser, 2011). these authors note that basic questions about meaning, function, presentation and orientation can be answered through information recorded to contextualize objects in meaningful ways. while recording this form of information is noted as being critical to the future interpretation and use of preserved digital content, a basic framework to capture contextual information to assist in the future understanding of the intellectual milieu of digital content has yet to be codified and adopted among the cultural heritage community. while there is a deeply rooted tradition of recording information concerning materials in the cultural heritage community, mccarthy (2007) notes that the management of this information has been difficult to put into practice. according to mccarthy, the inadequate preservation of digital content leads to an epistemic failure, a lack of information required for an understanding of the structure and meaning of the metadata. although speaking from a place where contextual information is envisioned more broadly than only that concerned with the intellectual aspects of digital content, mccarthy (2007) directly addresses the critical nature of this information by stating that "the present generation, with its knowledge of the resources, has a clear obligation to preserve that knowledge and pass it on to future curators so informed decisions on future management can be made", (p. 256). because it plays such a critical role in future understanding, contextual information surrounding digital content needs to be seen as an integral component and not merely optional data to be captured when time and funding allow for it. the importance of metadata to the future understanding of the intellectual aspects of preserved digital content has been an often discussed topic in the digital preservation literature. the 2009 draft oais standards, produced by the consultative committee for space data systems, draws attention to the fact that descriptive information about digital content is needed to maximize future use and understanding of preserved objects. descriptive information about the digital content being preserved appears in several critical areas of the oais model, and in fact the model contains an area titled "preservation description information (pdi)", specifically to record information for preservation purposes. the pdi area focuses on "information that will support the trust in, the access to and context of the content information over an indefinite period of time", (ccsds, 2009, p. 4-28). aspects to be included in the pdi consist of information concerning reference resources, context of creation, origins and provenance, data integrity (fixity) and rights. as useful as the oais model is for identifying the kinds of information to be recorded, it is meant to be broadly useful in a variety of settings. thus, its coverage of descriptive information is general in nature and does not include a proscribed metadata schema for capturing this information. a framework for contextual information of a primarily intellectual nature for digital content is presented by lee in his 2011 article titled, "a framework for contextual information in digital collections." contextual information for his purpose falls into three specific areas: the formation of meaning, the situation of the object and the situation of the user. the first form has to do with the formation of meaning via the surrounding environment (e.g., meaning of a word embedded in a passage). the second form has to do with characteristics or conditions surrounding the object (e.g., location, social setting, or placement). the final form has to do with the situation or state of the user which influences interpretation or understanding (e.g., priming, situational relevance). using this as the basis for his later discussion lee (2011) goes on to develop a framework with nine classes of contextual entities that he believes are particularly useful to capturing information useful to the intellectual aspects surrounding digital content. these nine classes are identified as object, agent, occurrence, purpose, time, place, form of expression, concept or abstraction and relationship (lee, 2011, table i, p. 106). several current research interests in parallel disciplines could also contribute to the development of a metadata schema to record intellectual context. for example, interest in developing metadata schemas for contextual information about research data sets in the scientific disciplines could be examined for aspects that would also apply to cultural materials. cowan & lillico (2009) present a metadata framework for recording information about research projects where they include the project's title, individuals on the project team, funding organizations, account and file codes, dates, status, summary, publications, files, where data was gathered from, who gathered the data, and when the data was collected (table 6, p. 99-100). also addressing issues of research data, wallis et al. (2008) discuss how critical the interpretative metadata is to researchers, since they often have little knowledge of who has acted upon the data or what has been done to it. these authors present a nine stage life-cycle model which identifies the various processes the research data may pass through during each of these stages. these processes provide critical points to be highlighted in the documentation of each stage which would be useful to later understanding of the data. various methods of capturing contextual information are currently available. the most commonly encountered method for representing cultural objects within their intellectual contexts are human-mediated descriptive accounts. providing this form of context is an important step which allows future users to experience or understand the item as it was originally intended. richer modes of documentation are available, however, as carrozzino et al. (2010) point out in their article, which examines a 3d virtual interactive platform to try to capture long-held bronze casting skills important to the culture and history of lucchesia, italy that are being lost. other modes of capturing the intellectual context of digital content are the semi-automatic methods described by mayer & rauber (2009). these authors describe how visualizations of, and interactions with, large bodies of digital content can reduce the manual work involved in traditional methods of capturing intellectual context. although this captured information is limited to the context surrounding digital content rather than analog objects, semi-automatic methods were shown to capture information surrounding when an item was created and the individuals associated with the content. both of these are standard aspects of the intellectual context of cultural materials. while descriptive information about digital content isn't necessarily critical for its use, it does add important details to what has been recognized as an imperfect representation of analog content. the scholarship surrounding the documentation of intellectual context has been strongly influenced by postmodern theory. postmodern theory posits that all acts of description and interpretation are influenced by circumstances surrounding the author and this in turn creates a fragmentary and ever-shifting view of truth. thus, all descriptions and interpretations are limited in their ability to fully explain the truth about cultural objects. if we accept the postmodern stance about the permutable nature of truth, should information about context be recorded at all? a number of scholars suggest there is no such thing as a neutral interpretation of cultural materials, and yet they support efforts to continue recording information about materials (buckland, 1988; lynch, 2002; nesmith, 2005; duff et al., 2011). in fact, nesmith (2005) feels the contextualization of materials is an ongoing process and states that "... more context is always needed if we are to understand what is possible to know", (p. 260). for these authors the act of interpreting an object has value in that it adds an additional layer of information about a work, and interpretations should appear as a part of the work's intellectual record. a related development is the marked focus on the interpretation of materials in the literature surrounding digital libraries and preservation. the first of these was bénel et al.'s (2001) article discussing the interpretive description, based on an idea of truth that is situated firmly within social, historical, cultural and action related contexts. according to these authors this approach supports the positive goals of communication, collaborative use of vocabulary and sense-building across a group. this interest in interpretation can also be seen as a call for developing interactions within a digital library setting which present a richer user experience than the typical functions found in current online collections. dalbello's (2004) study of digital libraries is also useful to consider here as she found a preponderance of presentation techniques for materials which offered "disengaged objects in search of narrative coherence", (p. 282). since digital materials are generally presented in systems providing a display-focused experience to the user, dalbello found a lack of comprehensiveness and closure. what was missing from the users' experiences with the digital libraries, according to darbello, was "contextual processing." similarly, lynch (2002) finds that digitized collections of cultural materials are in need of additional work to package the content in ways that foster users' learning experiences, interpretations and analyses. because of these efforts, lynch (2002) notes that the historically separate roles of librarian, scholar, curator and teacher are blurring alongside the traditional distinctions between libraries, museums and archives. extending this idea a bit further, many authors on this subject note the importance of community interaction with, and interpretation of, digital materials (bénel et al., 2001; lynch, 2002; dalbello, 2004; unsworth, 2004; lagoze et al., 2005). lagoze et al. (2005) sum up the others' ideas stating "[t]his added value consists of establishing context around those resources, enriching them with new information and relationships that express the usage patterns and knowledge of the library community. the digital library then becomes a context for information collaboration and accumulation — much more than just a place to find information and access it." this idea of accumulating layers of information around digital materials through interactions with and responses to content is one that echoes the words of brown & duguid (1996) in their seminal article, "the social life of documents". cultural materials, like text-based documents, acquire rich intellectual substance over time. unfortunately, unlike text-based conversations which can be traced through citation records, connections between the various intellectual exchanges surrounding cultural materials are more tenuous. this is a critical reason to support the documentation of contextual information, although not the only benefit to be had for the development of a framework to record this information. mccarthy (2007) discusses the various benefits of recording information about digital content and suggests that these include support for knowledge transfer, decision-making processes, improving transparency (and thus, build trust), providing a structured and visible system for knowledge sources, and "vastly improving discovery, accessibility, and comprehensibility of resources", (p. 254). it was with these benefits in mind that the current research was undertaken.   the dimensions of context this examination of the literature was conducted to identify the important dimensions of context and how they apply to the preservation of digital objects, and to aid in the development of a framework for recording contextual information. eight distinct dimensions of context, which make explicit the various forms of context identified as useful to digital preservation in the literature, are presented in table 1 below. each dimension has multiple characteristics which are further developed, along with the framework itself, in the second phase of this work described in a paper also published in d-lib magazine1. technical: this dimension of context concerns digitization processes and techniques. this includes aspects such as file formats, hardware, software, operating systems, migration, emulation, storage, data loss, encapsulation of technical information, and compatibility. utilization: this dimension of context speaks to the needs of users. it includes audience needs, task support, tools, accessibility, audience characteristics, and the types of analyses to be supported. physical: this dimension of context speaks to those characteristics of a work that are dependent on a direct, tangible interaction. this includes features of analog and digital items which are sensory in nature, and so includes all issues relating to the object's physical presence (e.g., scale, materials, texture, arrangement, sound, brightness, smell, etc.). intangible: this dimension of context concerns the intangible nature of digital materials. this includes qualities such as indistinct object boundaries, impermanent relationships and network linkages between digital items. curatorial: this dimension of context is related to the standards and guidelines used in the preservation process. this includes facets such as the tradition of stewardship, and preservation purposes and strategies. authentication: this dimension of context is connected to evidence and verification. this includes the provenance, tracking of content changes, integrity, and versioning that occurs with digital items. authorization: this dimension of context concerns the intellectual property rights surrounding the original object and its digital surrogate(s). this includes aspects such as rights management, legal usage, and rights holder(s). intellectual: this dimension of context is concerned with the significance of the original cultural object and, by proxy, its digital surrogate(s). this includes facets such as meaning, function, creative technique, historical import, cultural narratives, knowledge, and the communication of ideas. table 1: dimensions of context.     conclusion if, as is generally accepted, context is truly an important part of our interaction with, and reception and understanding of, cultural materials, it is remarkable that so little discussion concerning the entire range of contextual metadata to be recorded is found in the digital preservation literature. the original objects, whether digital or physical, are usually witnessed and/or exhibited in a way that offers some contextualization for our reuse and understanding of them. however, when physical objects are digitally preserved, they tend to be divorced from their original setting. de-contextualization is a fairly commonplace situation with cultural objects. a mechanism for capturing context that could be utilized within the preservation process would assist in the re-contextualization of the material for future use. although gathering and preserving information to contextualize digital materials requires human effort, this work provides those interpretive narratives that are critical to successful use of materials in digital form. because our world continues to embrace and depend on all things digital, ways to make sense of growing collections of preserved digital content is a difficult challenge that will need to be addressed. without context the potential future usefulness of preserved digital content within the cultural heritage sector is limited. digitization permits individuals to interact with cultural objects in ways that were impossible just a few decades ago. while this is a boon to users, it must be remembered that the stories these objects tell are often impacted by differences between their physical and digital manifestations, and the passage of time between the digital content's creation and its later interpretation and reuse. gaps in our knowledge of a cultural object's important attributes affect our understanding of its significance and its history. the work presented here identifies the various kinds of information that bridge these contextual gaps. future work is planned to test the metadata framework1. it is hoped that through this work methods can be found to support the effective preservation of contextual information surrounding digital materials. if these efforts are successful, our understanding and reuse of these objects and our past will be greatly enriched.   notes 1 for the development of the framework and examples of its application see: joan e. beaudoin. (2012). a framework for contextual metadata used in the digital preservation of cultural objects. d-lib magazine, november 2012, 18(11/12). http://dx.doi.org/10.1045/november2012-beaudoin2   references [1] adams, a. and blandford, a. 2004. the unseen and unacceptable face of digital libraries. international journal of digital libraries, 4, 71-81. [2] bearman, d. 2007. addressing selection and digital preservation as systemic problems. in y. de lusenet and v. wintermans (eds.) preserving the digital heritage: principles and policies, 26-44. (den haag: european commission for preservation and access). [3] bénel, a., egyed-zsigmond, e., prié, y., calabretto, s., mille, a., iacovella, a., and pinon, j.-m. 2001. truth in the digital library: from ontological to hermeneutical systems. lecture notes in computer science, 2163, 366-377. http://dx.doi.org/10.1007/3-540-44796-2_31 [4] benoit, g. 2011. integrating use history as a context for dynamically updated metadata, journal of library metadata, 11(3-4), 129-154. http://dx.doi.org/10.1080/19386389.2011.629958 [5] besek, j. 2003. copyright issues relevant to the creation of a digital archive: a preliminary assessment. (washington, dc: council on library and information resources and the library of congress). [6] besser, h. 2000. digital longevity. in maxine k. sitts (ed.) handbook for digital projects: a management tool for preservation and access, 165-176. [7] brocks, h., kranstedt, a., jäschke, g., and hemmje, m. 2009. modeling context for digital preservation. in e. szczerbicki & n.t. nguyen (eds.) smart information and knowledge management, studies in computational intelligence, 260, 197-226. http://dx.doi.org/10.1007/978-3-642-04584-4_9 [8] brown, j. s., and duguid, p. 1996. the social life of documents. first monday, 1(1). [9] buckland, m. 1988. library services in theory and context, 2nd ed. (new york: pergamon press). [10] bullock, a. 1999. preservation of digital information: issues and current status. national library of canada, network notes #60. [11] carrozzino, m., scuccess, a., leonardi, r., evangelista, c., and bergamasco, m. 2011. virtually preserving the intangible heritage of artistic handicraft. journal of cultural heritage, 12(1), 82-87. http://dx.doi.org/10.1016/j.culher.2010.10.002 [12] chen, s. 2001. the paradox of digital preservation. computer, 34(3), 24-28. http://dx.doi.org/10.1109/2.910890 [13] chowdhury, g. 2010. from digital libraries to digital preservation research: the importance of users and context. journal of documentation, 66(2), 207-223. http://dx.doi.org/10.1108/00220411011023625 [14] consultative committee for space data systems (ccsds). 2009. reference model for an open archival information system (oais), draft recommended standard. [15] conway, p. 2009. building meaning in digitized photographs. journal of the chicago colloquium on digital humanities and computer science (jdhcs), 1(1), 1-18. [16] conway, p. 1996. preservation in the digital world. (washington, d.c.: commission on preservation and access). [17] copeland, a. and barreau, d. 2011. helping people to manage and share their digital information: a role for public libraries. library trends, 59(4), 637-649. http://dx.doi.org/10.1353/lib.2011.0016 [18] cowan, r.a., and lillico, m. 2009. increasing the value of university research records by preserving context. in: m. pember & r.a. cowan (eds.) irma information and records management annual 2009. rmaa, st. helens, tasmania, 85-105. [19] dalbello, m. 2004. institutional shaping of cultural memory: digital library as environment for textual transmission. library quarterly, 74(3), 265-298. http://dx.doi.org/10.1086/422774 [20] day, m. 1997. extending metadata for digital preservation. ariadne, 9. [21] duff, w., monks-leeson, e., and galey, a. 2011. contexts built and found: a pilot study on the process of archival meaning-making. archival science, 12(1), 69-92. http://dx.doi.org/10.1007/s10502-011-9145-2 [22] duranti, l. 2010. the long-term preservation of the digital heritage: a case study of universities institutional repositories. italian journal of library and information science, 1(1), 157-168. [23] duranti, l. (ed.). 2005. the long-term preservation of authentic electronic records: findings of the interpares project. (san miniato, italy: achilab). [24] duranti, l. 1995. reliability and authenticity: the concepts and the implications. archivaria, 39, 5-10. [25] faniel, i.m. & yakel, e. 2011. significant properties as contextual metadata. journal of library metadata, 11(3-4), 155-165. http://dx.doi.org/10.1080/19386389.2011.629959 [26] gilliand-swetland, a. 2000. enduring paradigm, new opportunities: the value of the archival perspective in the digital environment. (washington, dc: council on library and information resources and the library of congress). [27] hedstrom, m. 1998. digital preservation: a time bomb for digital libraries. computers and the humanities, 31,189-202. http://dx.doi.org/10.1023/a:1000676723815 [28] hirtle, p., hudson, e., and kenyon, a. 2009. copyright and cultural institutions: guidelines for digitization for u.s. libraries, archives, and museums. (ithaca, ny: cornell university library). [29] lagoze, c., kraft, d., payette, s., and jesuroga, s. 2005. what is a digital library anymore, anyway? d-lib magazine. 11(11). http://dx.doi.org/10.1045/november2005-lagoze [30] lavoie, b. and gartner, r. 2005. technology watch report: preservation metadata. oclc; oxford university library services. [31] lavoie, b. and dempsey, l. 2004. thirteen ways of looking at ... digital preservation. d-lib magazine. 10(7/8). http://dx.doi.org/10.1045/july2004-lavoie [32] lee, c. 2011. a framework for contextual information in digital collections. journal of documentation, 67(1), 95-143. http://dx.doi.org/10.1108/00220411111105470 [33] levy, d. m. 1998. heroic measures: reflections on the possibility and purpose of digital preservation. digital libraries, 152-161. http://dx.doi.org/10.1145/276675.276692 [34] lusenet, y. de. 2002. preservation of digital heritage. draft discussion paper prepared for unesco. [35] lynch, c. 2002. digital collections, digital libraries and the digitization of cultural heritage information. first monday, 7(5). [36] marchionini, g. and maurer, h. 1995. the roles of digital libraries in teaching and learning. communications of the association of computing machinery, 38(4), 67-75. http://dx.doi.org/10.1145/205323.205345 [37] mayer, r. & rauber, a. 2009. establishing context of digital objects' creation, content and usage. indp'09, june 19, 2009, austin, tx, usa. [38] mccarthy, g. 2007. finding a future for digital cultural heritage resources using contextual information frameworks. in f. camercon and s. kenderdine, (eds.) theorizing digital cultural heritage: a critical discourse. (cambridge, ma: mit press), 245-260. [39] meirelles, m. i.. (2004). les cd-rom presence: the ephemeral in focus. proceedings of f@imp 2004 — international audiovisual festival on museums and heritage, taipei, taiwan. [40] menne-haritz, a. and brübach, n. n.d. the intrinsic value of archive and library material. [41] meyyappan, n., al-hawamdeh, s. and foo, s. 2001. digital work environment (dwe): using tasks to organize digital resources. lecture notes in computer science, 2163, 239-250. http://dx.doi.org/10.1007/3-540-44796-2_21 [42] nesmith, t. 2005. reopening archives: bringing new contextualities into archival theory and practice. archivaria, 60, 259-274. [43] ormond-parker, l. and sloggett, r. 2011. local archives and community collecting in the digital age. archival science, 12(2), 191-212. http://dx.doi.org/10.1007/s10502-011-9154-1 [44] palmer, c. 2002. thematic research collections. chapter in companion to digital humanities. [45] unsworth, j. 2004. the value of digitization for libraries and humanities scholarship. innodata isogen symposium. [46] vogt-o'connor, d. 2000. selection of materials for scanning. in maxine k. sitts (ed.) handbook for digital projects: a management tool for preservation and access, 45-72. [47] waibel, g. 2003. like russian dolls: nesting standards for digital preservation. rlg diginews, 7(3). [48] wallis, j.c., borgman, c.l., mayernik, m.s., & pepe, a. 2008. moving archival practices upstream: an exploration of the life cycle of ecological sensing data in collaborative field research. international journal of digital curation, 1(3), 114-126. http://dx.doi.org/10.2218/ijdc.v3i1.46 [49] watry, p. 2007. digital preservation theory and application: transcontinental persistent archives testbed activity. international journal of digital curation, 2(2), 41-68. http://dx.doi.org/10.2218/ijdc.v2i2.28 [50] wisser, k.m. 2011. describing entities and identities: the development and structure of encoded archival context—corporate bodies, persons, and families. journal of library metadata, 11(3-4), 166-175. http://dx.doi.org/10.1080/19386389.2011.629960   about the author joan beaudoin is an assistant professor in the school of library and information science at wayne state university where she teaches and performs research on metadata, information organization, digital libraries, digital preservation and visual information. prior to her position at wayne state university she was a laura bush 21st century librarian fellow at the school of information science and technology at drexel university. in addition to a doctor of philosophy in information studies at drexel university, she holds a master of science in library and information science degree in the management of digital information from drexel university, a master of arts in art history from temple university, and a bachelor of fine arts in art history from massachusetts college of art.   copyright © 2012 joan e. beaudoin collaborative construction of digital cultural heritage: a synthesis of research on online sociability determinants search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine november/december 2015 volume 21, number 11/12 table of contents   collaborative construction of digital cultural heritage: a synthesis of research on online sociability determinants chern li liew victoria university of wellington, new zealand chernli.liew@vuw.ac.nz doi: 10.1045/november2015-liew   printer-friendly version   abstract the purpose of this scoping study is to answer the research question: what does the literature tell us about online sociability that could inform how participation in collaborative construction of digital cultural heritage (dch) can be supported, motivated and sustained? a scoping review was conducted with the aim of building on the recent advances in research on online sociability and participatory culture. an extensive literature survey was undertaken across various disciplinary fields to gain a broad snapshot of the factors that have been found and suggested as having an influence on online sociability in collaborative projects. relevant literature were analysed and weaved together to map a pathway with motifs that could be useful as a guide for projects working towards collaborative construction of dch.   1 background the cultures and tools afforded by the social web have offered cultural heritage institutions (chis) (archives, libraries, galleries and museums) the opportunities to transform the services they provide as well as their relationships with their user communities and the wider public. chis are still a long way from reinventing themselves through the social web however. most chis still make restricted use of social media for basic functions like communicating general information and news, and promoting events, collections and services. some have gone a bit further — taking advantage of the participatory potential of the social web to actively engage users and stakeholders in conversations about their collections and services. even fewer have involved carefully coordinated efforts of soliciting and providing meaningful ways for collaborative construction and enrichment of cultural heritage knowledge. the success and sustainability of such endeavour are likely to depend considerably on the presence of a platform and system that supports sociability as a start — one that facilitates social interaction and connectivity — which could then potentially lead to sustained engagement and collaboration to achieve a shared goal. the value and sustainability of online participatory projects depends on many factors. each participant's interaction with the program is facilitated by technology. hence, the usability of the project platform is undoubtedly important. however, a broader view recognises that participants aim to interact not only with the project system, but with each other and with the professionals who are guiding the projects. therefore the sociability of the project as a whole must be taken into account, and this is the focus of this research. we define online sociability in the context of dch as the extent to which the information and communication environment mediated by social media is perceived and/or believed to facilitate social interaction, engagement and meaningful participation.   2 methodology a scoping study worked well to answer the research question: what does the literature tell us about online sociability that could inform how participation in collaborative construction of dch can be supported, motivated and sustained? a scoping review was conducted with the aim of building on the recent advances in research on online sociability and participatory culture. relevant literature were analysed and weaved together to map a pathway with motifs that could be useful as a guide for projects working towards collaborative construction of dch. an extensive literature survey was undertaken across various disciplinary fields (see appendix 1) including digital cultural heritage crowdsourcing and citizen science, e-commerce, educational and organisational psychology, sociology, media studies, online participatory culture and human-computer interaction to gain a broad snapshot of the factors that have been found and suggested as having an influence on online sociability in collaborative projects. there are challenges involved in collating and interpreting this literature. one issue is that although online sociability is a concept which has received increasing recognition and attention within the field of computer science over the last decade, it is often studied in other fields without being labelled as such. for example, the past decade has seen extensive literature published examining the determinants of the overall success of participatory online cultural heritage programmes and online voluntary communities of purpose such as wikipedia or open source software development networks. however, there are limitations on the information available in this area. most published quantitative information on participation rates derives from those projects which were successful (noordegraaf et al., 2014). hence, there may be gaps in our knowledge of what hinders online sociability. the available information on less successful participatory initiatives is also often anecdotal in the sense that it is usually published in summaries of experts' deductions from observations in the field, as in ridge (2013), or as case studies of single projects as in mcclean (2011). furthermore, most qualitative information on participants' perceptions is gathered from the small proportion of active participants who answer surveys or comment in forums (eveleigh et al., 2014; raddick et al., 2010) and thus does not tell the full story of the majority of participants, who are also valuable members of any project, not only because of their combined production capacity but also because they act as "the essential audience" for others, and as ambassadors for the program and the institution in the wider community (groenewegen & moser, 2014; nonnecke & preece, 2000; shirky, 2010). studies on participation in cultural heritage work also rarely use control groups and manipulation of variables to clearly establish causal relationships between variables, or validate participants' self-reported motivations with examination of each individual's actual participation rates under various conditions (eveleigh et al., 2014; nov, arazy, & anderson, 2011, 2014). these are some of the reasons why a net was cast across various disciplines to identify as many factors and determinants as possible, without neglecting the context-specific insights available from within the dch field itself. the breadth of this literature review has created further challenges. participatory cultural heritage projects differ widely in their aims, content, structures and participant groups. determinants of online sociability vary likewise. what motivates and encourages participants in one context can hinder sociability in another (eveleigh et al., 2014; ling et al., 2005; nov et al., 2014; ridge, 2012). however, some commonalities have emerged consistently across the fields of practice and disciplines and these common determinants of online sociability are discussed in the next section. key concepts and motifs are highlighted in bold.   3 online sociability determinants in collaborative projects there are many ways to conceptualise the complex relationships between the myriad factors which affect social interaction, engagement and participation in collaborative projects. overviews of success factors have previously been arranged according to phases of project development (noordegraaf et al., 2014), phases of participants' connection with the project (rotman et al., 2012), or as simple checklists of considerations for planning (holley, 2009, 2010; romeo & blaser, 2011). determinants of online sociability according to preece (liao & chou, 2012; preece, 2001) fall under three areas: purpose — factors which enhance a community's shared focus people — factors which satisfy participants' individual, social and organizational needs and allow participants to fulfil desired roles policies — factors such as protocols, norms, or informal and formal policies that support community governance motifs and concepts in relation to participants' interactions with a project under each of these areas are organised according to: before and during initial contact; throughout participation and post-participation events. many of these could actually be placed under more than one of these areas, as these facets of online sociability intersect with each other (preece, 2001); and volunteer participation in collaborative programs is often cyclical rather than following a simple linear sequence of beginning, middle and end (rotman et al., 2012). however, this structure of reporting will provide some order for the wide array of determinants identified in the scoping study and review of the relevant literature.   3.1 purpose — community's shared focus before participation begins social ties and social capital which exist outside of a project have been identified as beneficial for online sociability of a collaborative project. successful projects often identify a community with pre-existing social ties (bak, 2012; gao et al., 2010; liao & chou, 2012). some institutions benefit from investing time in building social capital with a community before asking them to contribute (ransom, 2008). pre-existing interest in the topic or process of the enterprise is highly motivating for participants, as is indication of the public good that could result from the project. hence, it is important to connect with the right community of interest (causer, tonra, & wallace, 2012; holley, 2009b; mankowski, 2011; nabpp, 2009; noordegraaf et al., 2014; raddick et al., 2010; romeo & blaser, 2011; shirky, 2010). communities with niche or local interests may be small, but they may also be passionate and socially well-integrated. hence, they should not be neglected (ransom, 2008). as participants initially make contact with the project the goals and underlying purposes of a collaborative project must be clearly presented and communicated to prospective participants, indicating the value and impact of the work and who could benefit from the project (holley, 2009b; liao & chou, 2012; ling et al., 2005; locke & latham, 2002; nabpp, 2009; noordegraaf et al., 2014; nov et al., 2011, 2014; proctor, 2010; ridge, 2012, 2013; romeo & blaser, 2011; yan & davison, 2013). a key phrase or image for instance may be helpful to focus the attention of first-time visitors to the project site (bitgood, 2010). it must also be clear whether the information contributed will be added to the official record or an existing repository and in what way (liew, 2014; rotman et al., 2012). data should be shared openly as soon as possible and the project should inform volunteers when it has been made openly available. this has been found to increase participants' trust in the organisation and the project concerned. this has also been found to also appeal to volunteers' altruistic or collective motivations for contributing to a meaningful cause, as well as their desire to be recognised for their efforts (eveleigh et al., 2014; romeo & blaser, 2011). throughout participation provision of a large and quantified challenge has been found to be effective in maximising participants' motivation and contribution rates (ling et al., 2005; locke & latham, 2002; romeo & blaser, 2011). indications of progress towards this big goal are integral to the success of this strategy, both for project productivity and for participants' satisfaction and wellbeing (deci & ryan, 2000; holley, 2010; locke & latham, 2002; romeo & blaser, 2011). showing that the community is active builds volunteers' confidence that their efforts will be reciprocated (gao et al., 2010; liao & chou, 2012; nov et al., 2014). establishing a shared vocabulary and 'common language' builds a sense of belonging to the project community, and thus, encourages participation (liao & chou, 2012). supporting offline interaction and integration of the project with the wider community have also been found to reinforce altruistic motivations and participation rates by demonstrating the value of volunteers' work (liew, 2014; rotman et al., 2012). as the aims and purposes of the project evolve, the project must keep participants and potential participants informed, for instance of any changes in short-term goals and policies (chung & lampert, 2011).   3.2 people — individual, social and organizational needs and roles before participation begins engagement with cultural heritage exhibitions or projects has been conceptualised as a process in which a prospective volunteer balances the perceived costs of participation, such as effort and time, against perceived benefits, whether to self or others, such as learning, fulfilment of needs or alignment with values and principles (bitgood, 2010; haefliger et al., 2011; ye & nov, 2013). these benefits may be experienced during participation in a task, as intrinsic motivators such as fun or a sense of altruism which arise inherently from the task itself; or they may result indirectly from the completion of the task, as extrinsic motivators such as social status (ye & nov, 2013). this decision-making process occurs both at the moment of initial interaction with the project, and cyclically throughout the volunteer's participation, most noticeably at certain common exit points, such as the end of a task (rotman et al., 2012). therefore, sustained involvement in a project relies on an environment in which participants generally experience low barriers and positive benefits from participation. as participants initially make contact with the project most authors agree that entry barriers should be as low as possible (phillips, 2013; proctor, 2010; ridge, 2012). accreditation through required training may increase some volunteers' commitment to the project (rotman et al., 2012), and initial registration may in some cases encourage first-time visitors that the community is trustworthy (virtanen & malinen, 2008). however these procedures may be ineffective or even counterproductive in attracting and retaining the majority of potential participants, and should be used with care. the first task presented to a participant must be encouragingly easy (chung & lampert, 2011; eveleigh et al., 2014; noordegraaf et al., 2014; phillips, 2013; proctor, 2010; ridge, 2012, 2013). clear, detailed instructions should be provided (holley, 2009b; noordegraaf et al., 2014); but task completion should also be guided by scaffolding in the form of constraints and visual cues, as it is best to show as well as to tell (liew, 2013; ridge, 2013; proctor, 2010; locke & latham, 2002). in addition, participants who can see others' contributions as examples reported feeling more confident and this has been found to lead to them producing higher quality work (lee, hsu, & chang, 2013; locke & latham, 2002). tasks should be divided into small chunks (eveleigh et al., 2014; mankowski, 2011; nov et al., 2011; rotman et al., 2012). these measures show a first-time participant that the cost of getting started is low. at this decisive initial contact phase, a prospective volunteer also needs to see that the potential benefits are high. feedback on successful task completion (eveleigh et al., 2014; noordegraaf et al., 2014; ridge, 2013), and a choice of clearly graduated levels of difficulty (eveleigh et al., 2014), demonstrate that even casual participants can quickly gain intrinsic motivation and confidence from their success. participants have been found to be highly motivated by learning opportunities at this stage (rotman et al., 2012). so it is essential to show what or how participants can learn from their input (bitgood, 2010; chou, 2010; liao & chou, 2012; nov et al., 2014; verhagen et al., 2012). however, ling et al. (2005) and oum & han (2011) sound a word of warning here: volunteers have strong altruistic and intrinsic motivations, and may react negatively to assert their autonomy against messages which they perceive as manipulative or insincere marketing. as in many areas of project planning, the best ways to strike the right balance are to show, rather than to tell (ling et al., 2005); and to test the project interface with a small group of pioneer volunteers from the target audience (ridge, 2012, 2013; romeo & blaser, 2011). throughout participation to maintain a participant's engagement with the project, the perceived cost of participation must remain low. an easy, quick, reliable interface with navigable information architecture has been found to limit friction and barriers (bonastre & granollers, 2014; ridge, 2012; romeo & blaser, 2011), and supports confidence building and experimentation with the technology, improving users' impressions of the project and increasing participation rates (chou, 2010). again, the perceived benefits must be high. in this regard, there are several types of factors which have been found to contribute to participants' sustained engagement with projects. participants derive intrinsic motivation from a sense of autonomy. therefore, a choice of options and of patterns of participation is important (deci & ryan, 2000; eveleigh et al., 2014; oomen et al., 2013; romeo & blaser, 2011). ongoing learning and personal growth are also meaningful for participants and have been found important to sustain their engagement with a project (deci & ryan, 2000). projects can maximise learning opportunities and make progress salient by continuing to supply new or increasing challenges (holley, 2009b; noordegraaf et al., 2014; nov et al., 2014), and by enabling individuals to monitor their own progress and set their own targets (eveleigh et al., 2014), especially within a session (locke & latham, 2002). connected with personal growth are issues of self-expression and identity building, which are important to members of many online communities. a program can allow self-expression by representation of identity via the use of unique names and avatar images, and of opinions via discussion forums (chou, 2010; zollers, 2007). recognition of individuals' actual contributions is vital (chou, 2010; eveleigh et al., 2014; gao et al., 2010; holley, 2009b; huang & benyoucef, 2013; liao & chou, 2012; romeo & blaser, 2011), and there are many ways in which this can be offered. the key for project designers is to "recognise what you want to reinforce" — for example, measure and reward quality as well as quantity if this is what is required (nov et al., 2014). allowing participants to view each others' contributions can help them to build status or reputation (groenewegen & moser, 2014), and as has previously been noted, this is one of the reasons why low contributors are valuable to a project, as they form the majority of the "essential audience" which promotes motivation for high contributors. therefore, all participants' input should be encouraged and appreciated (groenewegen & moser, 2014; nonnecke & preece, 2000; shirky, 2010). the relationship between low and high contributors can be mutually beneficial when participants can see each others' work, as low contributors often feel more confident to attempt tasks if they can observe before acting. other features which may help to retain low contributors or tempt them to return include choices, easy well-defined tasks and opportunities to set short-term goals (locke & latham, 2002). on the other hand, high contributors may be rewarded by competitive features such as leader tables, or social rewards such as opportunities to take leadership or moderator roles (eveleigh et al., 2014; holley, 2010; liao & chou, 2012). this type of reputation-based reward is usually classified as an extrinsic motivator, but these rewards have been found to have the potential to enhance volunteers' enjoyment of the task itself (nov et al., 2014). participation in dch construction can also enhance volunteers' sense of identity outside of the program. participants may be motivated by accreditation of the skills developed or training undertaken within the program, by attribution on publications or other public-facing literature resulting from their contributions, or by messages showing when their data was used (rotman et al., 2012). simply making a project's site or results easy to share via other social media can encourage volunteers to show their wider social networks what good work they are doing in their spare time, as well as potentially attracting more participants (romeo & blaser, 2011). conversation between the volunteers and professionals involved in a program, and between participants, is widely identified as essential to participants' perceptions that their contribution is meaningful and rewarding (bonastre & granollers, 2014; bray et al., 2011; gao et al., 2010; holley, 2009b; lee et al., 2013; oum & han, 2011; proctor, 2010; ridge, 2012; romeo & blaser, 2011). project staff should be clearly tasked with the responsibility of responding to participants' messages and actively inviting questions and suggestions from volunteers (liew, 2014; chung & lampert, 2011), as well as responding to participants as equals in the endeavour (mankowski, 2011; rotman et al., 2012). project organisers and participants alike can benefit especially from the insights of the most active volunteers, or those who make the highest quality contributions (gao et al., 2010; nov et al., 2014; proctor, 2010). productivity, enjoyment and intention to continue are heightened when participants feel deeply immersed in an activity. high engagement can be engendered by factors such as telepresence, which is the feeling that a user is immersed in a virtual environment and is communicating with others within that environment (kwon & wen, 2010; oum & han, 2011); or a sense of playfulness (oum & han, 2011). this can be a challenge for designers, as what is fun and diverting for one individual may be annoying or distracting for another (organisciak, 2010), but it is worth aiming for (lee et al., 2013; nabpp, 2009; verhagen et al., 2012) by taking into consideration the suitability of activities for the context concerned. game-like activities suit some participant groups (nov et al., 2014; oomen & aroyo, 2011), while simply allowing for the possibility of unexpected discoveries, and for conversation about those unexpected finds, is preferable for others (mankowski, 2011; romeo & blaser, 2011). interaction with and between participants in close to real time enhances a sense of sociable telepresence, especially when personal opinions and identities can be expressed (lee et al., 2013). flow states were originally described by csikszentmihalyi in 1975 as cases of maximal engagement in a task, and have received ongoing attention in the psychological literature (deci & ryan, 2000; ridge, 2013; yan & davison, 2013). individuals experiencing "flow" feel so highly focused and engrossed in a task that they can lose track of time; and flow states are instrinsically motivating, as the experience is so enjoyable. flow states require a level of challenge which is optimally matched to an invidual's self-perceived ability, and clear immediate feedback on the success of the individual's attempts. to provide the potential for flow states, an interface should limit distractions and unnecessary difficulties (bitgood, 2010; bonastre & granollers, 2014; ridge, 2012; romeo & blaser, 2011), lead smoothly from one task or chunk to the next (eveleigh et al., 2014), provide feedback on successful task completion (ridge, 2013), and provide graduated levels of difficulty so that each task is challenging, but not too difficult as to be discouraging. post-participation, or between participation sessions former participants may return to active involvement and could function as ambassadors for the program and the institution concerned. continued communications about the project, especially news of interesting new events or opportunities, are often both welcome and have been found to have an effect on continuous engagement and productivity, and maintaining a sense of community (bonastre & granollers, 2014; liao & chou, 2012).   3.3 policies — protocols; norms; informal and formal policies; community governance before participation begins projects can identify themselves with their parent institutions by using the familiar visual language, web space and logos of the parent site (bonastre & granollers, 2014), but they will then rely on the goodwill and trust which has been created over time by the interactions of the organisation with its community (ransom, 2008). as has previously been noted, clearly stated aims, purposes and policies of the project must be easily accessible so that participants can see how their contributions will be managed and used, i.e. are their contributions altered, merged with existing content, shared openly, archived and/or open for reuse? (liew, 2014). as participants initially make contact with the project at initial contact, participants need to know that the project, the institution, and the volunteer community are trustworthy, especially if they will be asked to contribute personal information, such as family stories or photographs (bonastre & granollers, 2014; chou, 2010). demonstrating the trustworthiness of the volunteer community is less directly controllable by the project organisers. requiring participants to register is one option, but it is no fail-safe guarantee of courteous behaviour in online communities (o'neil, 2010; reagle, 2012). a better strategy is to allow potential participants to view the community's recent activity, so they can judge for themselves whether the culture is one in which they can feel at home. some correspondents in online communities also feel more secure when they can see each other's demographic details, such as home town or gender (virtanen & malinen, 2008). clear statements of expectations can also help to set the scene (holley, 2010). throughout participation ongoing moderation has been found helpful in making participants feel safe and allows them to spend their time on productive contributions or discussions, instead of on conflict resolution (o'neil, 2010; reagle, 2012). participants perceive that an online community is sociable and successful when they believe that they will receive supportive feedback when they ask questions, share unique opinions, or try new things (locke & latham, 2002; mankowski, 2011; mcclean, 2011; rotman et al., 2012; yu, lu, & liu, 2010). however, many volunteer communities are highly trustworthy, requiring minimal moderation, and respond well to respectful recognition of this fact (holley, 2009b). each project must weigh its unique circumstances and consider how, and how much, the organisers will attempt to influence the project community (haefliger et al., 2011). in large or ongoing dch projects, organisers can progressively support and empower participants to co-moderate the community themselves in accordance with agreed guidelines (liew, 2013; nov et al., 2011, 2014; ridge, 2012; proctor, 2010). experienced participants also derive satisfaction, motivation and a sense of community from opportunities to guide or mentor each other; and newer volunteers can gain confidence and improve their contributions through this process (jahnke, 2010; nov et al., 2011, 2014; proctor, 2010).   4 a 'pathway' towards collaborative construction of digital cultural heritage this study is an attempt to gain a preliminary understanding of what factors could be important in facilitating and supporting online sociability in dch projects that involve participation and collaborative construction of cultural heritage knowledge. an interand cross-disciplinary literature review was conducted with an aim to scope the literature to uncover main factors and motifs that influence online sociability of collaborative projects. these were discussed in the previous section and in this section a preliminary 'pathway' is presented (see figure 1). the aim of this is to map and summarise the main motifs that have been found in previous studies to affect online sociability in collaborative and participatory projects. the pathway is expected to provide a reference for researchers, to serve as a research roadmap and to stimulate new ideas in future research on this topic. it is also expected to provide practitioners in dch projects an overview of key motifs that influence online sociability that they could consider in their projects.   before participation initial contact throughout participation post-participation purpose social ties social capital public good communities of interest goals, underlying purpose(s) trust building altruistic, collective motivations indications of progress sense of belonging supporting offline interaction integration with wider community continuous communications and engagement maintaining community identity people low barriers positive benefits clear, detailed instructions task scaffolding feedback on progress and task completion graduated levels of difficulty learning opportunities confidence building support for experimentation sense of autonomy ongoing learning and sense of personal growth self-expression and identify building enjoyment, sense of playfulness conversation telepresence flow states policies accessible and clear statements of project goal and purposes, protocols, norms, informal and formal policies trustworthiness of project and community clear statements of expectations ongoing moderation participants empowerment figure 1: preliminary 'pathway' the scoping review shows that although there are many existing projects to look to as models, there is no single recipe for planning participatory and collaborative programs. project managers and coordinators are encouraged to "plan for the audience you want" (ridge, 2012), as no two projects or volunteer groups are exactly alike (eveleigh et al., 2014; ling et al., 2005; nov et al., 2014). results and reactions can be surprising, and many initiatives have achieved success by starting small and designing iteratively in response to participant feedback (ridge, 2012; romeo & blaser, 2011). open communication with participants about the development process is critical to maintaining trust in these cases and also to empower participants to take ownership in guiding the growth of the project. further research is needed to empirically validate the relevance of these determinants of online sociability and to uncover other determinants that may be applicable to dch projects. it would be useful to develop an integrative research framework which can be used to inform institutional policy and to guide practice decisions when developing and nurturing sustainable cultures of participation in collaborative construction of dch and other similar endeavours.   acknowledgements i would like to acknowledge the victoria university of wellington — faculty research fund (fca) grant/project 115007 for supporting the undertaking of this research and the research assistance provided by melissa bryant.   references [1] bak, g. (2012). building the crowd: archives as social networks. presentation at the chs | high-tech heritage 2012 conference; slides accessed courtesy of greg bak, university of manitoba, canada. [2] bitgood, s. (2010). an attention-value model of museum visitors. the center for the advancement of informal science education. [3] bonastre, l., & granollers, t. (2014). a set of heuristics for user experience evaluation in e-commerce websites (pp. 27-34). presented at the achi 2014, the seventh international conference on advances in computer-human interactions. [4] bray, p., chan, s., dalton, j., dietrich, d., kapsalis, e., springer, m., & zinkham, h. (2011). rethinking evaluation metrics in light of flickr commons. paper presented at museums and the web 2011, april 6-9, 2011, philadelphia, usa. [5] causer, t., tonra, j., & wallace, v. (2012). transcription maximized; expense minimized? crowdsourcing and editing the collected works of jeremy bentham. literary & linguistic computing, 27(2), 119-137. [6] chou, s.-w. (2010). why do members contribute knowledge to online communities? online information review, 34(6), 829-854. http://doi.org/10.1108/14684521011099360 [7] chung, s. k., & lampert, c. (2011). strategic planning for sustaining user-generated content in digital collections. journal of library innovation, 2(2), 74+. [8] deci, e. l., & ryan, r. m. (2000). the "what" and "why" of goal pursuits: human needs and the self-determination of behavior. psychological inquiry, 11(4), 227-268. [9] eveleigh, a., jennett, c., blandford, a., brohan, p., & cox, a. l. (2014). designing for dabblers and deterring drop-outs in citizen science. in proceedings of the sigchi conference on human factors in computing systems (pp. 2985-2994). new york, ny, usa: acm. http://doi.org/10.1145/2556288.2557262 [10] gao, q., dai, y., fan, z., & kang, r. (2010). understanding factors affecting perceived sociability of social software. computers in human behavior, 26(6), 1846-1861. http://doi.org/10.1016/j.chb.2010.07.022 [11] groenewegen, p., & moser, c. (2014). online communities: challenges and opportunities for social network research. research in the sociology of organizations, 40, 463-477. http://doi.org/10.1108/s0733-558x(2014)0000040023 [12] haefliger, s., monteiro, e., foray, d., & von krogh, g. (2011). social software and strategy. long range planning, 44(5-6), 297-316. http://10.1016/j.lrp.2011.08.001 [13] holley, r. (2009). crowdsourcing and social engagement: potential, power and freedom for libraries and users. (report). [14] holley, r. (2009b). many hands make light work: public collaborative ocr text correction in australian historic newspapers. australia: national library of australia. [15] holley, r. (2010). crowdsourcing: how and why should libraries do it? d-lib magazine, 16(3/4). http://doi.org/10.1045/march2010-holley [16] huang, z., & benyoucef, m. (2013). user-centered investigation of social commerce design. in a. a. ozok & p. zaphiris (eds.), online communities and social computing (pp. 287-295). springer berlin heidelberg. [17] jahnke, i. (2010). dynamics of social roles in a knowledge management community. computers in human behavior, 26(4). http://doi.org/10.1016/j.chb.2009.08.010 [18] kwon, o., & wen, y. (2010). an empirical study of the factors affecting social network service use. computers in human behavior, 26(2), 254-263. http://doi.org/10.1016/j.chb.2009.04.011 [19] lee, c.-c., hsu, s. h., & chang, j.-w. (2013). factors influencing sociability in educational mmorpgs — a fuzzy ahp approach. internet research, 23(3), 298-315. http://doi.org/10.1108/10662241311331745 [20] liao, s., & chou, e. (2012). intention to adopt knowledge through virtual communities: posters vs lurkers. online information review, 36(3), 442-461. http://doi.org/10.1108/14684521211241440 [21] liew, c. l. (2014). participatory cultural heritage: a tale of two institutions' use of social media. d-lib magazine, 20(3/4). http://doi.org/10.1045/march2014-liew [22] liew, c. l. (2013). digital cultural heritage 2.0: a meta-design consideration. information research, 18(3). [23] ling, k., beenen, g., ludford, p., wang, x., chang, k., li, x., ... kraut, r. (2005). using social psychology to motivate contributions to online communities. journal of computer-mediated communication, 10(4), 00-00. http://doi.org/10.1111/j.1083-6101.2005.tb00273.x [24] locke, e. a., & latham, g. p. (2002). building a practically useful theory of goal setting and task motivation. american psychologist, 57(9), 705-717. [25] mankowski, t. a. (2011). an interpretive study of meanings citizen scientists make when participating in galaxy zoo (m.s.t.). university of wyoming, united states. [26] mcclean, g. (2011). multicultural sociability, imperfect forums and online participation. international journal of communication (1932-8036), 5, 1649-1668. [27] nabpp. (2009). north american bird phenology e-newsletter 2009. [28] naguib, s.-a. (2013). museums, diasporas and the sustainability of intangible cultural heritage. sustainability, 5(5), 2178-2190. http://doi.org/10.3390/su5052178 [29] nonnecke, b., & preece, j. (2000). lurker demographics: counting the silent. in proceedings of the sigchi conference on human factors in computing systems (pp. 73-80). new york, ny, usa: acm. http://doi.org/10.1145/332040.332409 [30] noordegraaf et al. (2014). modeling crowdsourcing for cultural heritage | mw2014: museums and the web 2014. [31] nov, o., arazy, o., & anderson, d. (2011). technology-mediated citizen science participation: a motivational model. in fifth international aaai conference on weblogs and social media. [32] nov, o., arazy, o., & anderson, d. (2014). scientists@home: what drives the quantity and quality of online citizen science participation? plos one, 9(4), e90375. http://doi.org/10.1371/journal.pone.0090375 [33] o'neil, m. (2010). shirky and sanger, or the costs of crowdsourcing. [34] oomen, j., balthussen, l. b., brinkerink, m., & van exel, t. (2013). sound of the netherlands: crowdsourcing the dutch soundscape. paper presented at mw2013: museums and the web 2013, april 17-20, 2013, portland, or, usa. [35] oomen, j., & aroyo, l. (2011). crowdsourcing in the cultural heritage domain: opportunities and challenges. in proceedings of the 5th international conference on communities and technologies (pp. 138-149). new york, ny, usa: acm. http://doi.org/10.1145/2103354.2103373 [36] organisciak, p. (2010). why bother? examining the motivations of users in large-scale crowd-powered online initiatives. ma thesis in information studies, edmonton, alberta, canada. [37] oum, s., & han, d. (2011). an empirical study of the determinants of the intention to participate in user-created contents (ucc) services. expert systems with applications, 38(12), 15110-15121. http://doi.org/10.1016/j.eswa.2011.05.098 [38] past appeals | oxford english dictionary online. (n.d.). oxford english dictionary. [39] phillips, l. b. (2013). the temple and the bazaar: wikipedia as a platform for open authority in museums. curator: the museum journal, 56(2), 219-235. http://doi.org/10.1111/cura.12021 [40] preece, j. (2001). sociability and usability in online communities: determining and measuring success. behaviour & information technology, 20(5), 347-356. http://doi.org/10.1080/01449290110084683 [41] proctor, n. (2010). digital: museum as platform, curator as champion, in the age of social media. curator: the museum journal, 53(1), 35-43. http://doi.org/10.1111/j.2151-6952.2009.00006.x [42] raddick, m. j., bracey, g., gay, p. l., lintott, c. j., murray, p., schawinski, k., ... vandenberg, j. (2010). galaxy zoo: exploring the motivations of citizen science volunteers. astronomy education review, 9(1), 010103-1. http://doi.org/10.3847/aer2009036 [43] ransom, j. (2008). kete horowhenua: the story of the district as told by its people. paper presented at the 2008 vala conference, mebourne, 5-7 february 2008. [44] reagle, j. (2012). "free as in sexist?" free culture and the gender gap. first monday, 18(1). [45] ridge, m. (2012, april 30). designing for participatory projects: emergent best practice, getting discussion started. [46] ridge, m. (2013). from tagging to theorizing: deepening engagement with cultural heritage through crowdsourcing. curator: the museum journal, 56(4), 435-450. http://doi.org/10.1111/cura.12046 [47] romeo, f., and blaser, l. (2011). bringing citizen scientists and historians together. in j. trant and d. bearman (eds). museums and the web 2011: proceedings. toronto: archives & museum informatics. published march 31, 2011. [48] rotman, d., preece, j., hammock, j., procita, k., hansen, d., parr, c., ... jacobs, d. (2012). dynamic changes in motivation in collaborative citizen-science projects. in proceedings of the acm 2012 conference on computer supported cooperative work (pp. 217-226). new york, ny, usa: acm. http://doi.org/10.1145/2145204.2145238 [49] shirky, c. (2010). cognitive surplus: creativity and generosity in a connected age. new york: penguin press. [50] verhagen, t., feldberg, f., van den hooff, b., meents, s., & merikivi, j. (2012). understanding users' motivations to engage in virtual worlds: a multipurpose model and empirical testing. computers in human behavior, 28(2), 484-495. http://doi.org/10.1016/j.chb.2011.10.020 [51] virtanen, t., & malinen, s. (2008). supporting the sense of locality with online communities. in proceedings of the 12th international conference on entertainment and media in the ubiquitous era (pp. 145-149). new york, ny, usa: acm. http://doi.org/10.1145/1457199.1457231 [52] yan, y., & davison, r. m. (2013). exploring behavioral transfer from knowledge seeking to knowledge contributing: the mediating role of intrinsic motivation. journal of the american society for information science and technology, 64(6), 1144-1157. http://doi.org/10.1002/asi.22820 [53] ye, c., & nov, o. (2013). exploring user contributed information in social computing systems: quantity versus quality. online information review, 37(5), 752-770. http://doi.org/10.1108/oir-05-2012-0091 [54] yu, t.-k., lu, l.-c., & liu, t.-f. (2010). exploring factors that influence knowledge sharing behavior via weblogs. computers in human behavior, 26(1), 32-41. http://doi.org/10.1016/j.chb.2009.08.002 [55] zollers, a. (2007). emerging motivations for tagging: expression, performance, and activism.   appendix 1 reference domain(s)/field(s) of study/practice and country of authorship(s) relevant research question(s) notes on methodology bak, 2012 archives and social media; canada how are archives using commercially-provided social media platforms? case studies bitgood, 2010 museum visitor studies; usa what is "visitor attention"? to what do visitors pay attention while viewing exhibitions? why do visitors attend? what is the motivation for attending? how do the processes or mechanisms that explain visitor attention work? what factors interfere with paying attention to important exhibit elements? conceptual model development bonastre & granollers, 2014 heuristics development, usability, e-commerce; spain how can heuristics be used to evaluate user experience in e-commerce websites? analysis of three studies recommending functional requirements for online bookstores; development of a set of heuristics to evaluate user experience in e-commerce websites bray et al., 2011 tagging, social media, dch; australia/usa how are institutions evaluating their participation in flickr commons? analysis of five case studies causer et al., 2012 dch, crowdsourced manuscript transcription; england/usa how will the participatory project impact upon long-established editorial practices? is crowdsourcing the transcription of complex manuscripts cost-effective? is it exploitative? are volunteer-produced transcripts of sufficient quality for editorial use and uploading to a digital repository, and what quality controls are required? does crowdsourcing ensure sustainability and widen access to this priceless material? should the success of such projects be measured solely according to cost-effectiveness, or also by considerations of public engagement and access? case study; participant survey; analysis of transcriptions produced chou, 2010 information management, online communities, computing; taiwan what are the relationships between individuals' differences, intrinsic and extrinsic motivation, and knowledge contribution in online communities? proposal of a research framework based on a motivational model and social cognitive theory; case studies of two online communities: the electronic engineering times in taiwan and china were surveyed chung & lampert, 2011 sustainability and preservation in dch projects, planning, libraries; usa what are the challenges inherent in efficiently managing social media and user-generated content? who are the stakeholders? how can libraries develop flexible and supportive organizational frameworks to sustain and deliver on the promise of social computing? literature review, case study, proposal of lists of considerations for planning deci & ryan, 2000 psychology; usa what is the self-determination theory of human motivation? how does it relate to needs, regulatory processes underlying goal pursuits, well-being, social contexts, individual differences and other contemporary motivation theories? literature review regarding the development of theory over decades eveleigh et al., 2014 citizen science, motivation and engagement; uk what are the experiences of the majority of contributors to the old weather project — the small-scale contributors? what are the implications for project design? literature review; case study; survey and interviews gao et al., 2010 human-computer interaction; china what factors affect perceptions related to the potential sociability of communication technologies such as sms or email among undergraduate students? interviews; surveys groenewegen & moser, 2014 organisational sociology, social network theories, online communities how does social network theory apply to the study of online communities, especially with regard to tie formation and network structures? literature review; case study of a cake decorating community in the netherlands incorporating log file data haefliger et al., 2011 corporate management; norway/france/switzerland relevant research issues for corporate social software planning include: what are the organisational conditions for long-term value co-creation? how can power relationships be made transparent and ethical? to what extent should organisations lead projects' communities, how, and when? literature review and proposal of a research framework holley, 2009 dch crowdsourcing; australia what are the commonalities between successful crowdsourcing projects? how can the lessons learnt be applied across the library and archive sector, and what is the future potential? comparison of case studies holley, 2009b dch crowdsourcing; australia report on the beginnings of the australian newspapers digitisation program case study; participant surveys holley, 2010 dch crowdsourcing; australia how and why should libraries engage in crowdsourcing? examination of case studies; participant surveys; statistics on data produced by projects huang & benyoucef, 2013 social commerce design; canada which design features on social commerce websites are most important to users? online survey of users' perceptions and preferences of social features on social commerce websites jahnke, 2010 human-computer interaction; germany how do roles change over time among users of a knowledge management system for undergraduate students at a german university? long-term design-based research; iterative design and assessment by survey and interviews kwon & wen, 2010 human-computer interaction; korea how do social identity, telepresence and altruism affect users' intention to use social network services? survey; modification of davis' technology acceptance model lee et al., 2013 human-computer interaction in educational mmorpgs /serious games; taiwan what is the relative importance of factors affecting sociability in educational mmorpgs? hybrid methodology combining fuzzy logic techniques and analytic hierarchy process approach; interviews and survey liao & chou, 2012 human-computer interaction in bulletin board services, psychology; taiwan what are the social capital and technical determinants of knowledge adoption intentions in virtual communities? are these factors viewed differently by posters and lurkers? questionnaire used to test hypotheses about correlations of factors in a model of motivations liew, 2014 participatory dch; new zealand what are some differences between new zealand participatory dch projects in terms of content, culture and participation rates? comparison of two case studies; analysis of user-contributed content such as images, stories and comments liew, 2013 meta-design of dch projects; new zealand in what ways can cultural heritage institutions respond to the opportunities and challenges brought on by the social web which involves variability, mutability and uncertainty? conceptual analysis and development of conceptual framework ling et al., 2005 social psychology, computer-mediated communication; usa can theories of social loafing and goal-setting predict participation rates in response to different explanations of the value of contributions from members of an online community? application of karau and william's collective effort model and locke and latham's goal-setting theory; testing design principles in field experiments with members of an online movie recommender community locke & latham, 2002 social psychology; usa what are the core findings of goal theory, moderators of goal effects, relation between goals and satisfaction, practical significance of goal-setting theory, and relationships of goal setting to other theories? meta-study/review of thirty five years of literature on goal theory mankowski, 2011 physics and astronomy, citizen science; usa what are the motivations of participants? interpretive research; grounded theory analysis of project forum discussions, including lexical analysis mcclean, 2011 communications/media studies, online community engagement in a public service broadcasting company; australia what is multicultural sociability? are conditions for multicultural sociability facilitated by the online chat and comments that follow a television current affairs forum on australia's national public broadcaster the special broadcasting service? case studies analysis of forum comments, with iterative coding of type of interaction reflecting the diversity of the data nabpp, 2009 citizen science; usa what are the motivations of volunteers? participant survey; case study naguib, 2013 museum studies, sustainability of intangible cultural heritage; norway what is the work of museums in constructing the intangible cultural heritage of migration and diasporas? how can collaborative museum projects keep intangible heritage alive for peoples who have immigrated to europe? literature review; examination of case studies nonnecke & preece, 2000 human-computer interaction; usa how prevalent are lurking and low contribution rates in each of two email discussion lists? is there a difference between lurking rates in the health related and software support related lists? what are the relationships between lurking and overall traffic rates? statistical analysis of contributions to two email discussion list over three months noordegraaf et al., 2014 design and evaluation of effective dch crowdsourcing projects; netherlands what are the effects of project design properties on the success or failure of dch projects? literature analysis; close study of two cases nov et al., 2011 citizen science, business, psychology; usa/canada what are the effects of task granularity and of different types of motivation on participation rates and intentions? survey of participants and comparison with actual contributions; model building nov et al., 2014 citizen science, business, psychology; usa/canada how are quantity and quality of contributions affected by different types of motivation and incentives, and by reputation? longitudinal study of a set of cases; comparison of survey results with actual contributions; modification of the social movement participation model by drawing on self-determination theory o'neil, 2010 human-computer interaction, volunteering; online collaboration; france/australia/usa how are experts defined in collaborative encyclopaedia projects? what are the relative costs of different types of systems for identifying expertise? comparison of case studies oomen et al., 2013 dch, public contribution of content; netherlands what are some issues in recruiting participants to contribute content to an online sound archive? case study oomen & aroyo, 2011 dch, crowdsourcing; netherlands what types of crowdsourcing activities are undertaken by cultural heritage institutions? what are the relationships between the different types of crowdsourcing and the core activities of heritage organizations? what are the critical challenges in finding sufficient knowledgeable and loyal users and in maintaining quality of contributions? empirical study of a substantial amount of projects initiated by relevant cultural heritage institutions organisciak, 2010 dch, e-commerce, private enterprise; canada what are users' motivations to participate in e-commerce sites? theory-building about motivations, based on studying the websites and reports of private and dch projects, plus interviewing a small number of internet users oum & han, 2011 human-computer interaction; south korea what factors affected intention to participate in an online user-contributed content system among undergraduate students in jeonju university, south korea? comparison of correlations between survey factors phillips, 2013 open-source software, dch; usa how can museums' role as content providers be augmented with the role of platform provider? how does wikipedia provide a model of open authority for museums? literature review preece, 2001 information systems, human-computer interaction; usa what are the similarities and differences between usability and sociability? literature review proctor, 2010 participatory dch; usa why should institutions do participatory work? what makes a participatory culture? literature review; case studies raddick et al., 2010 citizen science; usa what are the motivations of volunteers? participant survey ransom, 2008 dch; new zealand what factors contributed to the initiation, design and success of the kete horowhenua project and community? case study reagle, 2012 communication studies; free/libre and open-source software (floss) how do unstated norms and unofficial power structures cause gender imbalance in floss communities such as wikipedia and ubuntu? how can content and collaboration be enhanced by more balanced participation? case studies; discourse analysis ridge, 2012 dch; australia what is best practice in participatory project design? case studies ridge, 2013 dch, citizen science, crowdsourcing, games with a purpose (gwap); australia what is the value of crowdsourcing dch for participants, institutions and users of the resulting resources? how does scaffolding support participation? what motivates volunteers? how does participation enable deep engagement with heritage? what design factors contribute to the success of projects? literature review; case studies romeo & blaser, 2011 dch, citizen science; uk what motivates volunteers? what is best practice in participatory project design? literature review; case studies; construction of a model for best practice in crowdsourcing rotman et al., 2012 human-computer interaction, citizen science; usa how do participant motivations evolve during participation in a project? participant survey and interviews shirky, 2010 social media, psychology, commerce, media studies; usa what design and planning strategies enhance success in participatory projects? literature review; case studies verhagen et al., 2012 human-computer interaction, virtual worlds; netherlands/finland what motivates users to engage in the virtual world second life? literature review; empirical testing of a model of user motivations based on information system value and motivation theory virtanen & malinen, 2008 social media, online communities, user-contributed content, human-computer interaction; finland how can social interaction and sense of place in geography-based communities be supported by online communities, such as facebook location groups or a local-community service provided by a finnish newspaper? comparison of case studies; participant survey and interviews yan & davidson, 2013 management, computer-human interaction, computer science, psychology; china/hong kong how can self-perception theory and internal motivation factors explain the transition from knowledge seeking to knowledge contributing among users of online business discussion communities in china? literature review, empirical testing by survey of a model of user motivations based on self-perception theory and internal motivation factors ye & nov, 2013 human-computer interaction, user-contributed content; usa what is the relationship between quantity and quality of user contributed information in the online social computing system flickr? how are these two performance outcomes associated with motivational, cognitive and social factors, such as reputation, social ties or expected rewards including self-development? literature review, empirical testing of a model using a survey yu et al., 2010 computer-human interaction, professional information-sharing networks; taiwan what factors optimise knowledge sharing in online professional information-sharing networks in taiwan? survey of twenty self-selected, self-described participants in each of three online professional information-sharing networks zollers, 2007 social tagging; usa how do users utilize tags for social purposes, rather than for folksonomic information organization? analysis of two free-for-all systems: a commercial site (amazon.com), and a music site (last.fm), using a snowball sampling technique to identify co-occurring tags with sociable purposes     about the author chern li liew is a senior lecturer in the school of information management, victoria university of wellington (new zealand). she holds a phd in information studies from nanyang technological university (singapore) and an msc from loughborough university (uk). her research interests center on digital information services as socio-technical environments and she has published in the areas of digital libraries and digital cultural heritage: http://www.victoria.ac.nz/sim/about/staff/chernli-liew. she is on the editorial advisory boards of "international journal of digital library systems" and "online information review". she is also a member of the executive committee of the consortium of ischools asia pacific.   copyright © 2015 chern li liew trustworthiness: self-assessment of an institutional repository against iso 16363-2012 search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine march/april 2015 volume 21, number 3/4 table of contents   trustworthiness: self-assessment of an institutional repository against iso 16363-2012 bernadette houghton deakin university, geelong, australia bernadette.houghton@deakin.edu.au doi: 10.1045/march2015-houghton   printer-friendly version   abstract today, almost every document we create and the output from almost every research-related project, is a digital object. not everything has to be kept forever, but materials with scholarly or historical value should be retained for future generations. preserving digital objects is more challenging than preserving items on paper. hardware becomes obsolete, new software replaces old, storage media degrades. in recent years, there has been significant progress made to develop tools and standards to preserve digital media, particularly in the context of institutional repositories. the most widely accepted standard thus far is the trustworthy repositories audit and certification: criteria and checklist (trac), which evolved into iso 16363-2012. deakin university library undertook a self-assessment against the iso 16363 criteria. this experience culminated in the current report, which provides an appraisal of iso 16363, the assessment process, and advice for others considering embarking on a similar venture.   introduction digital preservation is a relatively young field, but significant progress has already been made towards developing tools and standards to better support preservation efforts. in particular, there has been growing interest in the audit and certification of digital repositories. given the growing reliance on institutional repositories in the past decade (opendoar, 2014), the need for researchers to be able to trust that their research output is safe is becoming increasingly important. this need was formally recognised by deakin university library in 2013, with the establishment of a project to determine the compliance of its research repository, deakin research online (dro), with digital preservation best practices. dro was established in 2007 against the background of the australian government's research quality framework (department of education, science and training, 2007). its primary objective was to facilitate the deposit of research publications for reporting and archival purposes. the fez/fedora software underlying dro was chosen, in large part, because of its preservation-related functionality, such as versioning, and jhove and premis support. over the years, new functionality has been added to dro, and workflows changed as deposit and reporting requirements changed. several ad-hoc mini projects have been undertaken to address specific preservation-related aspects of dro, but up to 2013 there had been no assessment of the new functionality or workflows against digital preservation best practices. a review of the digital preservation literature indicated that external accreditation of the trustworthiness of digital repositories by bodies such as the center for research libraries (crl) had started to gain some traction among larger cross-institutional repositories (crl, 2014). due to the cost, external accreditation was not seen as a viable option for deakin (a medium-sized university library) at that stage; however, a self-assessment against the standards used for accreditation was considered feasible. the literature review indicated that the trustworthy repositories audit and certification (trac) checklist (crl, oclc, 2007) was the most widely accepted criteria for assessing the trustworthiness of digital repositories. trac, originally developed by the research libraries group (rlg) and the national archives and records administration, evolved into the audit and certification of trustworthy digital repositories: recommended practice ccsds 652.0-m-1 (consultative committee for space data systems (ccsds), 2011), which was formalised as iso 16363 in 2012 (international organization for standardization, 2012). iso 16363 was chosen as the basis of deakin's self-assessment of dro. preliminary research for the project began in may 2013, with one staff member (the author) allocated to the project on a part-time basis. initially, the expected completion date was august 2013. however, due to other work priorities, the actual assessment did not begin until july 2013. the final report was drafted in december 2013.   laying the groundwork once the go-ahead was given for the project, research was undertaken to learn from others' experience. of particular interest were the audits undertaken by crl of hathitrust, portico and scholar's portal (crl, 2014). it was noted that each of the repositories reviewed had committed to some degree of self-assessment in addition to the external accreditation audit by crl. the crl website includes checklists and templates which can be used as the basis for a self-audit (crl, 2014). however, these were considered too out-of-date to use, being based on the 2007 trac criteria rather than the more recent iso 16363. a wiki was chosen as the mechanism for documenting the self-assessment. the crl website includes an excel template as the documentation mechanism; however, the choice of a wiki instead turned out to be a rather wise decision. some of the iso 16363 criteria required lengthy responses which don't fit too well into spreadsheets. the wiki was set up to include the following fields for each criteria: "suggested evidence", "relevant documents", "assessment", "compliant" and "areas of improvement". the setup stage was a good opportunity to become familiar with the criteria and the documentation required for the next step of the assessment. figure 1: example wiki documentation of the self-assessment. view larger version of figure 1. once the wiki framework was complete, a preliminary document hunt was undertaken to identify documentation that would provide evidence of the extent to which dro was meeting each criteria. relevant policies and procedures, at both library and university level, were sought out and linked to in the wiki. this required a good knowledge of the university's and library's organisational responsibilities, as well as familiarity with the university's governance web pages. iso 16363 provides 'suggested evidence' examples for each of its criteria. however, it soon became obvious that many of the suggested examples were not relevant for dro. from then on, documents were gathered based on the author's local knowledge. the preliminary collection of documentation did not prevent the need to search for additional documents during the course of the assessment, but it did save much time. once the preliminary documents had been gathered, each criteria was reviewed against the relevant documentation. additional documentation was chased up as necessary, and workflows and procedures clarified with relevant staff. an assessment was then made of dro's compliance with the criteria, and areas of improvement identified. all findings were progressively documented in the wiki, with each criteria allocated a rating of "full compliance", "part compliance" or "not compliant".   iso 16363 assessment iso 16363 contains 105 criteria covering 3 areas. the criteria are generally very comprehensive, although some have inadequate or missing explanations, leaving them open to interpretation. there is some overlap between individual criteria, and not all criteria will be applicable to all repositories. the areas are: organizational infrastructure, including governance, organisational structure and viability, staffing, accountability, policies, financial sustainability and legal issues; digital object management, covering acquisition and ingest of content, preservation planning and procedures, information management and access; infrastructure and security risk management, covering technical infrastructure and security issues. iso 16363's example documentation for each criteria should be viewed as a general guideline only. each repository's mileage will vary here; documentation that is relevant for one repository will not be relevant for another. the collection of documentation was probably the most time-consuming part of deakin's self-assessment, as documents were scattered across multiple areas of the university. ambiguous terminology used by iso 16363 caused some confusion. for example, iso 16363 uses "linking/resolution services" to refer to persistent identifiers or links to materials from the repository. however, in deakin's experience, this terminology is more commonly used to refer to link resolution services such as webbridge and sfx. iso 16363 is based on the open archival information system (oais) model (ccsds, 2012). it needs to be kept in mind that oais is a conceptual model that won't necessarily appear to align with actual or desired workflows. for example, oais treats submission, dissemination and archival packages as separate entities. this doesn't mean that oais-compliance requires separate submission, dissemination and archival packages. the theoretical nature of oais is reflected in iso 16363. undertaking a self-assessment against iso 16363 is not a trivial task, and is likely to be beyond the ability of smaller repositories to manage. any organisation that undertakes a self-assessment against iso 16363 should tailor it to best fit their own circumstances and budget. if iso 16363 still appears too mammoth an undertaking, alternative tools exist, such as ndsa's levels of digital preservation (owens, 2012). it should be stressed that a self-assessment is not an audit. in an independent audit, the auditor would be at arms' length from the repository, evidence of compliance with policies and procedures would be required, and testing undertaken to confirm that the repository's preservation-related functionality was indeed operating as expected. deakin's self-assessment was undertaken by the author, who works closely with dro, and was basically restricted to a review of relevant documentation and practices. for a first self-assessment, this was considered to be an appropriate strategy. however, future self-assessments should undertake more rigorous testing of dro's underlying functionality.   findings the assessment indicated that dro meets most of the criteria for being considered a trusted digital repository, but, as expected, there is room for improvement. more specifically, dro fully meets 67 of the 105 criteria, partially meets 32, and does not meet 15 criteria; with one criteria not applicable. table 1 summarises these results across the three iso 16363 areas. table 1: iso 16363 compliance summary iso 16363 section full compliance part compliance not compliant 3. organizational infrastructure 13 (54%) 8 (33%) 3 (13%) 4. digital object management 36 (62%) 20 (34%) 2 (3%) 5. infrastructure and security risk management 18 (82%) 4 (18%) —   it should be pointed out here that not all criteria should be considered to have equal weight. while iso 16363 does not allocate a rating to each criteria, some criteria are obviously more important in terms of a repository's trustworthiness than others. a statement that a "repository meets 60% of the iso 16363" criteria can be very misleading if one of the criteria the repository does not meet is fundamental to a repository's trustworthiness; for instance, absence of integrity measurements (iso 16363 criteria 3.3.5). the self-assessment indicated that dro's approach to preservation tends to be ad-hoc, which was already known. to a large extent, this is due to the still-evolving nature of the digital preservation field, and most repositories will be in a similar situation at this point in time. this approach is considered to have been an appropriate one thus far; however, it expected that a more pro-active approach will be taken in future, in line with the availability of new tools and a more mature digital preservation knowledge base. the majority of the areas of improvement identified by the self-assessment have now being incorporated into deakin's preservation strategy plan. the next step for deakin is to implement the practical strategies outlined in its preservation strategy plan to ensure it is best placed to take advantage of evolving practices in digital preservation.   summary and conclusions overall, the self-assessment has been a time-consuming and resource-heavy exercise, but a beneficial one, with several areas of weakness identified in dro's setup and workflows. a strategic plan has since been developed to address these weaknesses, and thus increase the robustness of dro's trustworthiness as a repository. it is anticipated that regular re-assessments will be undertaken at 3-yearly intervals; these are not expected to be as resource-intensive as the initial self-assessment as the relevant documentation has already been identified and located. it is highly recommended that other libraries undertake a similar self-assessment of their repository at some stage, with the tools used and the depth of the assessment dependent on the size of the library and the level of available resources. researchers expect their publications to be safe in institutional repositories, and repository managers need to ensure their repository meets this expectation. self-assessment is one mechanism by which they can ascertain whether their repository is indeed trustworthy. based on deakin's experience, the following suggestions are offered to repository managers who plan to undertake an iso 16363 assessment: do a self-assessment before considering paying for external certification. certification — and re-certification — is expensive. get senior management on board. their support is essential. digital preservation is a long-term issue. the individual doing the self-assessment should be reasonably familiar with the organisation's and repository's policies and procedures. if you don't have the time or resources to undertake an iso 16363 assessment, consider doing an assessment against ndsa levels of digital preservation (owens, 2012). set up a wiki to document the self-assessment. do this at the start, and document findings as you go along. tailor the self-assessment to risk and available time and resources. determine in advance how deep the assessment will go. for example, will the assessor just collect and review documentation, or will he also check to ensure that documented procedures have been followed and everything 'under the hood' is working properly? use local knowledge when gathering documentation. iso 16363's 'suggested evidence' are possibilities only. become familiar with the criteria before you start the assessment. some documentation will be relevant to multiple criteria, so it saves time if you can identify those criteria early on. remember, not all iso 16363 criteria will be applicable to your particular situation. keep up the momentum. finishing the self-assessment does not mean the hard work is over. there will be improvements that need making. aim to build up your repository's digital resilience over time. schedule regular self-assessments. if you're thinking about doing an iso 16363 self-assessment at some time in the future, start the process now. set up a wiki page to record relevant documentation you come across in the meantime. keep a watching brief on digital preservation issues, and update the wiki as needed to save time later on. don't assume that because your repository software is oais-compliant, your repository itself is also. workflows and repository setup can make or break oais-compliance. not all iso 16363 criteria have the same importance or risk level. assess each criteria accordingly. iso 16363 is based on a conceptual model (oais). don't expect the criteria to necessarily align with your repository's particular setup and workflows.   references [1] center for research libraries, oclc, 2007, trustworthy repositories audit & certification (trac) criteria and checklist, version 1, centre for research libraries. [2] center for research libraries, 2014, trac and tdr checklists, center for research libraries. [3] consultative committee for space data systems, 2011, recommendation for space data system practices: audit and certification of trustworthy digital repositories, recommended practice ccsds 652.0-m-1, magenta book, consultative committee for space data systems. [4] consultative committee for space data systems, 2012, reference model for an open archival information system (oais): recommended practice ccsds 652.0-m-2, ccsds, washington, dc. [5] department of education, science and training, 2007, research quality framework: assessing the quality and impact of research in australia. rqf technical specifications, commonwealth of australia. [6] international organization for standardization 2012. space data and information transfer systems: audit and certification of trustworthy digital repositories. [7] opendoar, 2014, growth of the opendoar database — worldwide, opendoar. [8] owens, trevor, 2012, ndsa levels of digital preservation: release candidate one, library of congress.   about the author bernadette houghton is currently the digitisation and preservation librarian at deakin university in geelong, australia. she has worked with deakin research online since its inception in 2007, and has a strong background in systems librarianship and cataloguing, as well as 4 years as an internal auditor.   copyright © 2015 bernadette houghton quality of research data, an operational approach search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine january/february 2011 volume 17, number 1/2 table of contents   quality of research data, an operational approach leo waaijers leowaa@xs4all.nl maurits van der graaf pleiade management and consultancy m.vdgraaf@pleiade.nl doi:10.1045/january2011-waaijers   printer-friendly version   abstract this article reports on a study, commissioned by surffoundation, investigating the operational aspects of the concept of quality for the various phases in the life cycle of research data: production, management, and use/re-use. potential recommendations for quality improvement were derived from interviews and a study of the literature. these recommendations were tested via a national academic survey of three disciplinary domains as designated by the european science foundation: physical sciences and engineering, social sciences and humanities, and life sciences. the "popularity" of each recommendation was determined by comparing its perceived importance against the objections to it. on this basis, it was possible to draw up generic and discipline-specific recommendations for both the dos and the don'ts.   introduction scientific and scholarly research nowadays results not only in publications but increasingly also in research data, i.e. collections comprising the data on which the research is based but which comes to live a life of its own — subsequently or parallel to the actual publications — as an independent source of information and analysis for further research: the "fourth paradigm of science". for this to be possible, such collections of data need to be traceable and accessible in just the same way as actual publications. quality also plays a major role in both types of research product. but whereas for publications this aspect has been operationalised — not always without controversy — via peer review and citation indices, it is still in its infancy as regards research data. surffoundation — the ict partnership between the dutch higher education institutions — is following these developments closely and in 2010 devoted three studies to the topics of selection, organisation, and quality of research data. the third of these studies was carried out by pleiade management and consultancy (maurits van de graaf) and leo waaijers (open access consultant). the present article is based on the results of that study.   summary the study investigated the operational aspects of the concept of quality for the various phases in the life cycle of research data: production, management, and use/re-use. nine potential recommendations for quality improvement were derived from interviews and a study of the literature. the desirability and feasibility of these recommendations were tested by means of a national survey of university professors and senior lecturers, with a distinction being made in this regard between the three disciplinary domains applied by the european science foundation: physical sciences and engineering, social sciences and humanities, and life sciences. the "popularity" of each recommendation was determined by setting off its perceived importance against the objections to it. on this basis, it was possible to draw up generic and discipline-specific recommendations for both the dos and the don'ts.   survey of the literature literature dealing with the quality of research data is a recent development but is rapidly increasing in volume. the importance of the quality aspect of research data is broadly recognised. this virtually always concerns the quality of the metadata and documentation, but sometimes also the quality of the actual data itself. no generic definition of the term "quality" could be identified. in general, further study is recommended; this matter is still in the pioneering phase. (for examples, see [3] and [9].) one important milestone was the 2007 oecd publication principles and guidelines for access to research data from public funding [15], which adopts the basic principle that such data is a public good which must be made as accessible as possible. our study makes a number of comments on this, particularly where data from the life sciences is concerned. the rin report [12] takes the first step towards operationalising the quality aspect by linking it to the various phases in the life cycle of research data: production, management, and use/re-use. this tripartite categorisation is frequently followed in subsequent literature, for example in the present study. a new milestone is the recent report by the high level expert group on scientific data, which develops an impressive vision regarding the issue of "how europe can gain from the rising tide of scientific data" [1]. the report includes a plea for the development of methods for measuring the impact and quality of datasets and for the production and publication of high-quality datasets to be made relevant to the career development of scientists/scholars. our recommendations would operationalise that plea. some specific findings include: the publication by marc van opijnen [5] focuses on the structuring of information (in the legal field) and creates a link between the topic of our study (the quality of research data) and a central theme of the surfshare programme (enhanced publications), concluding that it is not technology that is the main challenge but scientific/scholarly professionalism. the dissertation by jinfang niu [6] gives a lengthy analysis of the importance of good documentation for the re-use of research data in the social sciences. the interim business plan 2008/9 by australian national data services (ands) [10] announces that an annotation service will be set up for the research data for the australian life atlas as a means of improving the quality of that data. in 2008, the data archiving & networked services (dans) introduced a "data seal of approval" for the social sciences and humanities that has since gained international recognition [19]. to qualify for the seal of approval, data producers, data repositories, and data consumers must comply with certain minimum quality requirements. the string of pearls initiative [25] collects clinical data and biomaterials, stores them, and makes them accessible according to a strict code of conduct.   interviews the findings of the survey of the literature were presented during interviews with sixteen"data professionals", i.e. data managers and researchers. the above-mentioned tripartite categorisation as regards the life cycle of research data was used to define three types of quality control: quality control in the production of data; data management: ensuring permanence and accessibility (including metadata, documentation, formats, rights); actual content quality: "the scholarly merit of the dataset". all those interviewed considered this to be a useful categorisation. quality during the production phase an elementary distinction can be made between data produced by equipment and data resulting from the registration of human behaviour and findings. this distinction does not necessarily correspond with the division between the exact sciences and the humanities. digitised collections of texts in the humanities, for example, belong to the first category of data while the social and economic sciences work with collections of data from both categories. this is an important distinction in the context of the present study because there are different ways of looking at the issue of quality for the two categories. see also [8]. in the first category, it is the accuracy of the equipment and the refinement of the applied algorithms that are central. to assess the accuracy of measurements, the calibration details of the measuring equipment are required, for example. the calibration of the measuring equipment may be changed or a better algorithm may be developed to perform the necessary calculations. this may therefore mean that it must be possible to recalculate the dataset using the improved calibration details and/or the improved algorithm. in the second category, methodological issues play a primary role. this involves such questions as: is the chosen method for collecting data the most appropriate for this research objective? has the method been applied correctly (for example random sampling, double-blind testing)? does the dataset adequately describe the phenomenon that is being studied (for example representativeness)? how have the integrity aspects of the data been dealt with when cleaning up the dataset (for example contradictory data, presumed measuring errors, incomplete data, etc.)? have the relevant ethical requirements been complied with (for example privacy or animal welfare)? for both categories of data, the documentation accompanying the dataset must account for these matters. the documentation is an essential component of the dataset and effectively determines its quality. if the documentation is not written during or immediately after the production phase, it usually never gets done. quality of data management data management focuses on ensuring the long-term accessibility of the dataset. doing so involves, in part, the same considerations as ensuring long-term access to published digital articles. there are some differences, however. the first difference has to do with technical permanence. someone must guarantee this; the researcher's own hard disk or the vulnerable repository of a small institute are not good enough. in the case of articles, many publishers guarantee accessibility, or there may be the e-depots of libraries (including national libraries) such as the national library of the netherlands. this is not usually the case with research data, however. data archives are set up all over the place, generally on the basis of a particular discipline or subject, and often on a national or international scale. as yet, there appears to be no acute shortage of memory capacity, but the need for selection in the future is recognised. this topic goes beyond the scope of the present study. there is then the question of retrievability, for which good metadata is a basic requirement. the metadata for data collections is more complex than that for articles. (for example, see [23].) the documentation forms part of the metadata but the data formats also need to be recorded. these can be extremely varied, and they relate to a wide range of technical aspects of the data collection. whether digital objects — in this case collections of data — can be traced is also determined by the application of standards. finally, there is the issue of accessibility. the software applications for research data are very varied and often specific to particular disciplines. these applications need to be maintained so as to provide continued future access to the data. it is therefore an obvious step to bring together the research data for particular disciplines or subjects in specialised (international) data archives. as with articles, access to research data may be affected by copyright considerations. a dataset, for example a body of texts, may be made up of subordinate collections that are subject to different copyright rules. in these cases an open access system with standard provisions such as the creative commons licences would provide an effective solution. scientific/scholarly quality: scholarly merit when discussing the assessment of the scientific/scholarly quality of research data, our interviewees often referred to some form of peer review. this could either be peer review prior to storage and provision of the data — as part of the peer review of an article based on that data — or subsequent peer review in the form of annotations by reusers (as in [4] and [7]). in general, the interviewees had their doubts about the feasibility of peer review in advance because of the demand it would make on the peer reviewer's time. (see also [11].) it was also pointed out that such a system would lead to an unnecessary loss of time before the dataset could be made available. (see also [2]). some respondents thought that it was theoretically impossible to assess the "scholarly merit" of a dataset in isolation; the dataset exists, after all, in the context of a research question. in an increasing number of cases, datasets are published along with the articles on which they are based, certainly in the case of "enhanced publications". in such cases, the peer review of the article can also take account of the dataset. in many cases, however, it is only the supplementary data that is published along with the article, i.e. a selection of the data necessary to support the article. respondents doubt whether reviewers actually immerse themselves in that data when arriving at their quality assessment. here too, pressure of time plays a role. a new variant in this context involves special journals for "data publications", i.e. separate articles describing the data collection. these articles can make their own contribution to the prestige of the person who produced the data through citations and the impact factor of the journal concerned. examples include the journals earth system science data [17] and acta crystallographica section e [6]. in the case of subsequent peer review, researchers who re-use the dataset are requested to write a brief review of it. these reviews or annotations are then linked to the dataset. (see also [14]). the question that arose was whether such re-users would in fact be prepared to produce a review. this could perhaps be made a condition for being permitted to re-use the dataset. finally, it was suggested that, rather than setting up a separate quality assessment system for data, one could create a citation system for datasets, which would then form the basis for citation indices. the thinking behind this was that citation scores are a generally accepted yardstick for quality. classification of datasets if research datasets are categorised according to origin, one can draw the following broad conclusions as regards quality control. a. datasets produced for major research facilities with a view to their being used/re-used by third parties. examples include data generated by large-scale instruments (lhc, lofar), meteorological data (royal netherlands meteorological institute), data collected by oceanography institutes (nocd), and longitudinal data on dutch households (liss panel). in this situation, there are a number of mechanisms for monitoring and improving the quality of the datasets, with peers being involved in many cases. the quality control for this data is of a high standard; this category of data was therefore not surveyed as regards quality improvement measures. b. supplementary data and replication datasets that are published along with scientific/scholarly articles. this often involves subsets of larger datasets. the data concerned should basically be taken into consideration during the peer review process for the publication. in actual practice, this only happens on a limited scale. matters are different, however, with the specialised data publications referred to above; here, the documentation that belongs with the dataset is published as a separate article in a specialised journal. this article, and with it the dataset, is then subjected to a specific peer review. at the moment, there are only a few such journals [16] and [17]. c. datasets included in data archives as required in some cases by the body that funds the research, for example the netherlands organisation for scientific research (nwo) [20]. this may also involve category b datasets if the publisher does not have the facilities to guarantee permanent storage and accessibility of the dataset. in many cases, the staff of the data archive subject the metadata and the documentation accompanying the dataset to a quality check and also, if applicable, a check on whether the content of the dataset actually falls within the scope of the particular data archive [18]. dans issues datasets that meet certain quality requirements with its data seal of approval [19]. tailor-made checks on content quality are found in the case of large data collections such as the genographic project [13], the rcsb protein data bank [22], and the world ocean database 2009 [21]. a few data archives have also set up a kind of "waiting room" where researchers can deposit their datasets without these being subject to prior quality control. a selection of the datasets thus deposited is then subjected to a quality check and included in the data archive. the criteria for selection have not been made explicit. d. datasets that are not made (directly) available. this is by its nature a large and virtually unknown area. in the light of the interviews held in the course of our study, it would seem sensible to distinguish between datasets that have been created primarily for the researcher's own use — for example by a phd student — and datasets created by larger groups or by several groups (including international groups), often with a view to their being used by a large number of people and over a lengthy period. in the latter case, the parties concerned often reach agreement regarding quality and set up mechanisms for monitoring and enforcing it. datasets created for a researcher's own use will not generally be accompanied by any documentation.   questionnaire based on the literature survey and the interviews, a list of nine measures was drawn up for improving the quality of collections of research data. these measures were submitted in the form of a questionnaire to a representative sample of all dutch professors and senior lecturers (uhds), in all a total of 2811 persons. the response rate was 14%, which can be classed as good. another 14% explicitly declined to participate, often giving reasons. the large number of explanations accompanying the answers was striking and it seems justifiable to conclude that the topic is one that is of interest to dutch scientists and scholars. virtually all those who filled in the questionnaire are themselves peer reviewers (95%) and more than half (57%) are on the editorial board of a peer-reviewed journal. the respondents also have extensive experience with research datasets: 71% produce/co produce such datasets; 60% have at some point made a dataset available to third parties; and 50% (also) re-use datasets. the responses were subsequently divided according to the disciplinary domains applied by the european research council (no. 24): physical sciences and engineering, social sciences and humanities, and life sciences (excluding some respondents who had indicated more than one discipline). cross-analyses could also be made according to the subgroups of re-users, providers, and producers of datasets. a. peer review of datasets as part of the peer review of articles 1. in my field, many journals request that one make a replication dataset available. 2. in my field, the peer reviewer normally has access to the underlying dataset for a publication and that dataset is taken into account when assessing the publication. 3. in my field, it is feasible for a peer reviewer to simultaneously assess the publication and the underlying dataset. 4. i think it is important that the underlying dataset be assessed along with the publication during the peer review process. in the field of life sciences, 17% of the participating scientists said that journals in their discipline more or less require the author of a publication to also submit the underlying dataset. in this discipline, more than half (51%) of respondents consider it important that the dataset published along with an article be also taken into account during the peer review of the publication. many of them have their doubts, however, as to the feasibility of this (42%: "not feasible"; 37%: "feasible"). physical sciences and engineering and social sciences and humanities follow the same trend but with somewhat lower figures. b. data publications: peer-reviewed publications about datasets 5. in my field, a peer-reviewed journal for data publications could play a valuable role in drawing attention to important datasets. more than half of the participants (51%) see a valuable role for such journals; more than a quarter (28%) are not in favour of such journals. the differences between the disciplines are only slight. c. comments on quality by re-users of datasets 6. these comments on quality could be useful for future re-users of datasets. 7. as a re-user, i would certainly take the trouble to add such comments regarding the quality of a dataset that i have re-used. 8. as the producer of a dataset, i would welcome such comments on quality from others. the proposal received a great deal of support, with hardly any differences between the various disciplines. more than 80% consider these comments to be valuable for future re-users of the datasets. more than 70% say that as a re-user they would take the trouble to add such comments regarding the quality of the dataset, with more than 80% saying that as data producers they would welcome such comments from re users. d. citation of datasets 9. as a re-user, i would certainly cite datasets where this was possible. 10. as a researcher, i would welcome the option for "my" dataset to be cited. this proposal too received a great deal of support and once again the answers given by the respondents from the three disciplines were virtually unanimous. if it were possible, more than three quarters would definitely cite the dataset as a re-user. more than 70% say that as data producers they would welcome the option of their own dataset being cited. e. support for quality control of datasets at an early stage 11. training in data management would meet a need of many colleagues in my field. 12. in my field, data audits, examining how datasets are compiled and managed, would help improve data management. a significant majority (63%) of respondents in the life sciences consider that training in data management and auditing of datasets would be valuable. the figure is much lower in physical sciences and engineering (37%), with social sciences and humanities taking up a mid-range position. popularity index in order to get an idea of how the above options rank in terms of priority, we asked both about the extent to which the above measures can improve the quality of datasets (= 'do' in the table below) and the extent to which those measures meet with objections (= 'don't'). a "popularity index" was then produced by subtracting these percentages from one another. the index was compiled for the above proposals plus three measures that are also referred to in the literature and the interviews: a mandatory section on data management in research proposals; open access provision of datasets, perhaps after an embargo period; a code of conduct for dealing with data. in the table below, the various options are presented according to the different disciplines, listed according to their popularity. physical sciences and engineering, popularity index (n=61) do don't popularity open access provision of datasets, perhaps after an embargo period 59 14.8 44.2 setting up of data publications: peer-reviewed descriptions of datasets 41 6.6 34.4 citing of datasets 32.8 1.6 31.2 comments on quality by re-users, to be published with the dataset 27.9 1.6 26.3 peer review of the dataset as part of peer review of the publication 39.3 29.5 9.8 provision of training in data management 6.6 4.9 1.7 code of conduct for researchers for data management and provision of datasets 13.1 14.8 –1.7 mandatory section on data management in research proposals submitted to bodies financing research 4.9 39.3 –34.4 periodic data audits 6.6 44.3 –37.7     social sciences and humanities, popularity index (n=153) do don't popularity citing of datasets 37.9 4.6 33.3 setting up of data publications: peer-reviewed descriptions of datasets 34 5.2 28.8 open access provision of datasets, perhaps after an embargo period 40.5 16.3 24.2 comments on quality by re-users, to be published with the dataset 27.5 5.2 22.3 code of conduct for researchers for data management and provision of datasets 22.9 7.2 15.7 provision of training in data management 17 3.3 13.7 peer review of the dataset as part of peer review of the publication 36.6 25.5 11.1 periodic data audits 5.9 21.6 –15.7 mandatory section on data management in research proposals submitted to bodies financing research 11.1 30.7 –19.6     life sciences, popularity index (n=147) do don't popularity code of conduct for researchers for data management and provision of datasets 32.7 6.1 26.6 comments on quality by re-users, to be published with the dataset 34.7 8.2 26.5 citing of datasets 25.2 1.4 23.8 provision of training in data management 21.1 1.4 19.7 setting up of data publications: peer-reviewed descriptions of datasets 25.9 8.8 17.1 peer review of the dataset as part of peer review of the publication 42.2 25.2 17 periodic data audits 24.5 19 5.5 open access provision of datasets, perhaps after an embargo period 29.9 26.5 3.4 mandatory section on data management in research proposals submitted to bodies financing research 19.7 29.3 –9.6   analysis scientists and scholars in all disciplines would welcome greater clarity regarding the re-use of their data, both through citations and through comments by re users. setting up special journals for data publications is also popular in all disciplines. the view regarding a mandatory section on data management in research proposals is also unanimous, but negative. the decisive factor here is a fear of bureaucracy. it is striking that the high score in all disciplines for extending the peer review of an article to the replication data published along with it is largely negated by the objections. the reason given in the explanations is the excessive burden on peer reviewers. it would seem that it is here that the peer review system comes up against the limits of what is possible. the popularity (or lack of popularity) of the various other measures is clearly specific to the disciplines concerned. although open access to data is popular in physical sciences and engineering and in social sciences and humanities, those in the field of life sciences have major objections. at the moment, their primary need would seem to be a code of conduct for the provision of data. a need is felt for this in social sciences and humanities too, although to a lesser extent. one reason for this opinion may be related to the two types of data referred to above: physical measurements as opposed to human-related data. there is a certain correlation between data management training and data audits. for both, the score is low in physical sciences and engineering, somewhat higher in social sciences and humanities, and moderate in life sciences. in all the disciplines, training in data management scores better than data audits. this is possibly because training must precede audits. if we round off the popularity indices at 10 points and show each 10 by a + or a —, the result is the following overview:   physical sciences and engineering social sciences and humanities life sciences setting up of data publications: peer-reviewed descriptions of datasets +++ +++ ++ citing of datasets +++ +++ ++ comments on quality by re-users, to be published with the dataset +++ ++ +++ open access provision of datasets, perhaps after an embargo period ++++ ++   code of conduct for researchers for data management and provision of datasets   ++ +++ peer review of the dataset as part of peer review of the publication + + ++ provision of training in data management   + ++ periodic data audits – – – – – – + mandatory section on data management in research proposals submitted to bodies financing research – – – – – –     recommendations the following three quality improvement measures meet a need in all the different disciplines and should therefore be implemented as soon as possible. together, they form a synergistic package and serve to operationalise a current recommendation to the european commission by the high level expert group on scientific data [1]. establish journals for data publications; make the citing of datasets possible; promote the provision of quality-related user comments on datasets. these measures will improve the quality of research data. they are endorsed by the research community and they encourage producers to circulate their data. it is recommended that an analysis be made of the structure, relevant parties, and financing for such measures. surf could act as the initiator for such research, which should also be positioned within an international context. measures that are currently advised against are the introduction of a mandatory section on data management in research proposals and — except in the field of life sciences — the institution of periodic data audits. opposition to these measures is significant, presumably based on a fear of bureaucracy. that fear should be removed by making clear that it is an effective, "light" approach that is being advocated. research councils such as the netherlands organisation for scientific research (nwo) and university associations (the association of universities in the netherlands, vsnu) should take the lead in this. where opinions of the other measures are concerned, there are significant differences between disciplines. open access to data scores well in physical sciences and engineering whereas this discipline expresses little need for a code of conduct, training in data management, or peer review of datasets as a component of peer review of the publication concerned. the response of life sciences is complementary. social sciences and humanities take up an intermediate position. where a code of conduct for life sciences is concerned, the initiative by the nfumc [25] might provide a relevant context. the attitude in physical sciences and engineering would seem to be that quality control of data can best be effectuated through citation of datasets and quality-related comments on those datasets which are made available through open access data publications. no need is expressed for codes of conduct, training in data management, or peer review of data that is published together with articles. in life sciences, there is first and foremost a need for a code of conduct for dealing with data. training in data management fits in with this. a direct judgment on quality can be given through peer review of the data that is published together with articles and through quality-related comments, a derived judgment through data publications and citations. open access to data does not score highly. interestingly enough, life sciences are ahead of the other disciplines as regards open access to articles.   acknowledgement we thank marnix van berchum of surffoundation for his support in the realization of this article.   references publications (in reverse chronological order) [1] riding the wave — how europe can gain from the rising tide of scientific data — final report of the high level expert group on scientific data october 2010. http://ec.europa.eu/information_society/newsroom/cf/document.cfm?action=display&doc_id=707 [2] open to all? case studies of openness in research. a joint rin/nesta report. september 2010. http://www.rin.ac.uk/our-work/data-management-and-curation/open-science-case-studies [3] data sharing policy: version 1.1 (june 2010 update). biotechnology and biological sciences research council uk. http://www.bbsrc.ac.uk/web/files/policies/data-sharing-policy.pdf [4] quality assurance and assessment of scholarly research. rin report. may 2010. http://www.rin.ac.uk/quality-assurance [5] rechtspraak en digitale rechtsbronnen: nieuwe kansen, nieuwe plichten. marc van opijnen. rechtstreeks 1/2010. http://www.rechtspraak.nl/nr/rdonlyres/6f244371-265f-4348-b7bd-22eb0c892811/0/rechtstreeks20101.pdf [6] perceived documentation quality of social science data. jinfang niu. 2009. http://deepblue.lib.umich.edu/bitstream/2027.42/63871/1/niujf_1.pdf [7] the publication of research data: researcher attitudes and behaviour. aaron griffiths, research information network the international journal of digital curation issue 1, volume 4, 2009. http://www.ijdc.net/index.php/ijdc/article/viewfile/101/76 [8] managing and sharing data, a best practice guide for researchers, 2nd edition. uk data archive. 18 september 2009. http://www.esds.ac.uk/news/publications/managingsharing.pdf [9] e-irg and esfri, report on data management. data management task force. december 2009. http://ec.europa.eu/research/infrastructures/pdf/esfri/publications/esfri_e_irg_report_data_management_december_2009_en.pdf [10] australian national data service (ands) interim business plan, 2008/9. http://ands.org.au/andsinterimbusinessplan-final.pdf [11] peer review: benefits, perceptions and alternatives. prc summary papers 4. 2008. http://www.publishingresearch.net/documents/prcpeerreviewsummaryreport-final-e-version.pdf [12] to share or not to share: publication and quality assurance of research data outputs. rin report; main report. june 2008. http://eprints.ecs.soton.ac.uk/16742/1/published_report_-_main_-_final.pdf [13] the genographic project public participation mitochondrial dna database. behar, d.m, rosset, s., blue-smith, j., balanovsky, o., tzur, s., comas, d., quintana-murci, l., tyler-smith, c., spencer wells, r. plos genet 3 (6). 29 june 2007. http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.0030104 [14] dealing with data: roles, rights, responsibilities and relationships, consultancy report. dr liz lyon, ukoln, university of bath. 19 june 2007. http://www.jisc.ac.uk/media/documents/programmes/digitalrepositories/dealing_with_data_report-final.pdf [15] oecd principles and guidelines for access to research data from public funding. 2007. http://www.oecd.org/dataoecd/9/61/38500813.pdf websites (accessed in august and september 2010) [16] acta crystallographica section e: structure reports online. http://journals.iucr.org/e/journalhomepage.html [17] earth system science data. the data publishing journal. http://www.earth-system-science-data.net/review/ms_evaluation_criteria.html [18] appraisal criteria in detail. inter-university consortium for political and social research. http://www.icpsr.umich.edu/icpsrweb/icpsr/curation/appraisal.jsp [19] data seal of approval. data archiving & networked services (dans). http://www.datasealofapproval.org/ [20] nwo-dans data contract. http://www.dans.knaw.nl/sites/default/files/file/nwo-dans_datacontract.pdf [21] world ocean database 2009. noaa atlas nesdis 66. ftp://ftp.nodc.noaa.gov/pub/wod09/doc/wod09_intro.pdf [22] rcsb protein data bank deposition portal. http://deposit.rcsb.org/ [23] data documentation initiative. http://www.ddialliance.org/ [24] european research council. http://en.wikipedia.org/wiki/european_research_council [25] the string of pearls initiative. netherlands federation of university medical centres. http://www.string-of-pearls.org/   about the authors leo waaijers has a long-term commitment to (inter-)national open access developments, first as the university librarian of delft university of technology (1988) and later in a corresponding post at wageningen university & research centre (2001). he concluded his career as the manager of the surf platform ict and research where he managed the national dare programme (2004-2008). in 2008 he won the sparc europe award for outstanding achievements in scholarly communications. after his retirement he advised about the open access infrastructure of the irish universities (together with maurice vanderfeesten) and in 2009 he evaluated the swedish national open access programme "open access.se" (together with hanne marie kvaerndrup).   maurits van der graaf started pleiade management and consultancy in 2000, focusing on archives, libraries and publishers. its activities include market research and evaluation, and strategic studies. recent assignments include a study on repositories in europe, on digital-born cultural heritage in the netherlands and on the usage of e-books (see http://www.pleiade.nl). before pleiade he held various jobs at publishers and libraries, including product manager excerpta medica database (elsevier), director of the dutch current research agency (nboi) and deputy-director of the netherlands institute for scientific information services (niwi). he studied biology at the university of utrecht.   copyright © 2011 leo waaijers and maurits van der graaf collection level description the riding and agora experience search  |    back issues  |    author index  |    title index  |    contents d-lib magazine september 2000 volume 6 number 9 issn 1082-9873 collection level description the riding and agora experience   dr. e.v. brack university of sheffield western bank sheffield s10 2tn tel: +44 (0)114 222 1143 [email protected] david palmer agora/assistant librarian university of east anglia norwich nr4 7tj tel: +44 (0)1603 593523 [email protected] bridget robinson agora communications coordinator ukoln (uk office for library & information networking) university of bath, bath ba2 7ay tel: +44 (0)1225 323343 [email protected] abstract this article will examine the background and results of the elib working group on collection level descriptions (clds) and look at the implementation of clds in two of the elib phase 3 library projects -riding and agora. introduction the clumps projects and agora are based on the need to aid discovery of, and increase access to, the vast scholarly resources available to the higher education community, as proposed in the anderson report [1]. these projects are investigating the use of z39.50 technology in opening up access to these resources, and providing "the means to locate and to gain access to material with reasonable ease, reasonable speed and at reasonable cost to individuals and individual institutions" as recommended by anderson. in order to accomplish this task, many of the projects within elib phase 3, especially the clump projects and the hybrid library projects, require metadata about the resources they are handling. in general, knowledge of library collections, or any other types of collections, is not easy to obtain. the riding gateway [2] intended to address this lack of information by including a searchable database of collection descriptions that would provide information on what was available. agora’s entire approach to the access to, and organization of, the vast array of resources envisaged by a hybrid library is based upon the use of rich metadata as exemplified by collection descriptions. elib working group on clds in september 1998, a group of interested people met in london to discuss the development of a scheme for describing collections; this group was convened under the auspices of ukoln [3] and the first meeting attracted representatives from elib projects, the jisc data centres, software vendors, and the british library. a small national working group resulted from this meeting who were tasked with putting together a set of collection description elements. the emphasis was on producing a simple, practical framework for describing collections in general. the time available for discussions was limited, as the projects -riding in particular -needed to have a basic framework ready for trial within a few months. papers and discussions of this working party are available on the ukoln web site at . this includes an elib supporting study: " collection level description a review of existing practice". subsequently, the riding clump project created a riding collection description working group to discuss the particular requirements for riding collections; this group developed a prototype scheme applicable to riding, and created a number of collection descriptions for the gateway. the music libraries online and agora projects, who use the same gateway software as riding, also required collection descriptions for their user interfaces, and from summer 1999 met regularly with riding to further develop the scheme for general use. the scheme was refined and added to in light of experience from all three projects; this elib collection description group kept in close contact with the national working group to ensure progress was made along common lines. later in 1999, the other clumps projects -cairns and m25 -also contributed to the development of the scheme. the purpose of the riding collection description scheme was to describe, in a standard manner, any type of collection -physical or virtual (electronic), networked or otherwise. the collections could be of anything -books and other library materials, art works, sculpture, living material, digital or physical items, so the scheme needed to be appropriately generic in order to encompass such diverse collections. the metadata elements of the scheme should: allow users to discover, locate and access collections of interest; allow users to perform searches across multiple collections in a controlled way; allow software to perform such tasks on behalf of users based on known user preferences. the first task was to agree on the definition of a 'collection', and the simplest working definition was: a grouping of individual items or other collections obviously there is often a blurring of distinctions here -a collection may be made up of other collections as well as items, or items and collections together. an item itself may be made up of other items, e.g., a catalogue is actually a collection of catalogue records; a web page actually comprises text, images, etc. although people will think of a web page as a single item. it was left up to the person who probably knows the collection best (i.e., the person in charge of the material) to make the decision whether it should be described as a collection. as the purpose underlying the development of the scheme was to aid discovery and location of resources, it was recognised that important additional information about access to a collection should be included, and that access may be via one or more services, particularly in the case of electronic collections; for instance, the medline database is available on cd-rom and on the web. different terms and conditions of use may apply to using the collection, depending on the service used to access it. various existing schemes for describing collections were studied at the beginning of the work, including dublin core [4], gils (global information locator service) [5], and isad (g) (general international standard archival description) [6]. other elib phase 3 projects also worked on producing their own collection descriptions during this time, and the malibu project's scheme was consulted in detail. discussions with archivists and museum personnel also took place. simple collection description the original simple collection description scheme produced by the national working group had 23 elements, 12 of which were taken from the dublin core metadata format; they were grouped into elements describing a collection, and those describing a service used to access a collection. a number of collections were catalogued using this first version of the scheme, and issues arising from this exercise were discussed via e-mail. the scheme was refined and a report was produced in october 1998 [7]. discussions are ongoing and the scheme is constantly under revision; to date there are still several unresolved issues, for example the use of controlled lists for the subject/keywords element. implementation of collection descriptions by riding following on from the discussions by the national working group on collection description, riding personnel took the prototype collection descriptions created and reduced the scheme to a subset of fields, based on those most frequently containing data in the collections described for the national working party report. this produced a set of 15 descriptive elements: title subject/keywords content description collection administrator collection owner publisher language coverage creation date collection type relation location and access access times and availability access policy charging policy the use of standard controlled lists of terms is encouraged, for example in the language field (7) where the iso 639 language codes are used. controlled lists are also used for the collection type (10) and relation (11) fields but in these cases the working group have both created their own and have made additions to an existing authority list using the dublin core relation metadata elements (see appendix b). the riding collection description working group, after discussions and amendments of the proposed description elements, asked all ten riding project libraries to use the scheme to produce a minimum of five collection descriptions each, including one of their library as a whole and one of their opac (which is a collection of catalogue records). the results of this exercise were approximately 60 descriptions of collections held by the riding libraries; these were originally made available as browsable web pages at the riding gateway and later as part of a searchable database. there were obvious difficulties of interpretation in the way that people described their collections, and it also proved difficult to bear in mind (at least for librarians!) that the scheme is intended to be applicable to any type of collection, not just printed material. further discussions looked at the use of fixed fields and authority lists versus free text, and added, merged and deleted elements. the scheme implemented at the riding gateway shows some differences to the elib scheme as the riding version was implemented while the elib scheme was still under development. the elib scheme the discussions by the elib cld group so far have resulted in a collection description scheme that comprises 29 descriptive elements, only a few of which are mandatory. where possible authority lists are used rather than free text and urls are included if required. riding had specific needs for some fields to be mandatory, although other projects do not require this. the elements are in two groups, those describing the collection itself and those describing access to the collection. the elements of the scheme are listed in appendix a. each element has a number, a name (meaningful to humans), a label (for software), a semantic description, and specifications for field type and length, status, and searchability. as far as possible elements from dublin core have been used, though one -coverage -has been split into two elements, spatial coverage and temporal coverage. implementation of collection descriptions by agora the riding project has been an important catalyst in discussion and implementation of clds. as riding draws to a close in terms of ongoing development agora now takes up the challenge and is playing a key role in the creation of clds in relation to the landscape function. agora [8] is a consortium-based project led by the university of east anglia; partners are ukoln, fretwell-downing informatics and cerlim (the centre for research in library and information management at manchester metropolitan university). the project also works with several associate groups: libraries, service providers and systems developers. the project is developing a hybrid library management system (hlms) to provide integrated access to distributed information services. in parallel with this it is also developing library skills and experience in the management of hybrid resources. the agora development framework agora is based on concepts that emerged from the models project, moving to distributed environments for library services [9]. models has been developing frameworks for managing distributed resources to enable truly integrated access. the central part of the agora framework is a layer of 'broker' services or 'middleware' which shields the user from the complex and repetitive processes involved in interacting with individual services. the web provides the primary end-user access point to the agora gateway. agora is based on fretwell-downing's vdx software which is also the basis of the riding project. the concept of information landscapes is integral to the agora organization and presentation of resources. the term "landscape" is used to describe a way of presenting different views of information resources to users according to their interests and needs. agora is exploring the construction of information landscapes as part of its user-centred focus. in order to provide information landscaping, it is necessary to match information about users against information about resources -collection level descriptions. this work is leading to the integration of collection level descriptions as a metadata standard for describing resources and to facilitate discovery and organization of resources. the collection level descriptions in the agora project have been created using the same schema as riding. all the participating library associates were sent scheme templates, guidelines for completion, and supporting documentation. they then used this information to complete clds describing their own collections and library catalogues. this was largely successful although the descriptions would have benefited from more detail. the resulting clds have been loaded into agora release 1. service providers were also asked to complete clds relevant to their collections. the response from the suppliers was limited but those that did respond provided full data. in the absence of supplier-sourced data, agora project staff input limited data; the intent being to provide only enough data to allow the system functionality to be tested. there are currently 58 clds in the first release of the agora hlms, these include catalogues (library and internet), gateways, commercial databases and other mixed media. as yet, no non-electronically accessible sources have been included in release 1. the clds are critical to the agora concept of the hlms. technically, the clds are held within vdx and are inextricably linked to the target/database information in vdx . no resource is available for searching without a cld and a link to a target/database. the clds can be used to provide a guide to the aggregation of resources into "landscapes" and as a guide to the resources themselves. however, there are some limits to their use as "aggregator-searcher" as the search can only be initiated across one attribute, e.g., subject or title or collection type. a total of nine separate attributes may be used as a search term. another limitation is the experimental nature of the schema itself -for example, no existing "collection type" has been found to cover oclc databases and therefore they cannot, as yet, be searched by that attribute. the clds can be created and modified by each associate site. it is hoped that this local editing facility will lead to the generation of fuller, richer clds as the project progresses. it also allows for the inclusion of additional targets and collections in response to user requirements. user case studies release 1 of the agora hlms is currently providing a platform for user case studies. these are being carried out at five library associate sites. the studies are diverse in nature, examining a range of issues pertaining to the implementation and use of the hlms. the studies address three areas: functionality of the system, how different user groups use the system, and the training/management tools required. under the area of functionality special emphasis will be placed on the location and organization of resources within the system. this will in turn provide invaluable feedback as to the usefulness of clds. the results of the case studies will be published at the end of the year. in the meantime, there will be a second release of agora incorporating a more advanced cld search facility, allowing combination of several cld attributes in one search. the improved functionality and more intuitive interface should provide the project with additional information on the use and relevance of clds and their contribution to the hybrid library. issues and future developments the issues surrounding the development of clds can be divided into two discrete areas, albeit interconnected; one, the theoretical development of clds and their components, and two, the implementation of clds in "real" hlms systems. in considering the "theoretical" issues, several come to the forefront: the utility of the schema across collection types, the definition of collection, and the use of controlled vocabulary. the scheme is very useful for describing special collections but is not yet completely suitable for large, general collections, e.g., for describing the whole of a university library’s holdings. also requiring further discussion are issues such as how to define what is a collection and what is a subject strength; how to describe a collection that is not physically together; how to describe the level of a collection. the use of controlled lists for the subject element of the scheme has raised several important issues, which have been discussed within elib, and nationally, and internationally. it was firmly agreed that a controlled list or subject classification should be used but it was not clear which one. subject classifications such as dewey and the library of congress subject headings were too complex at lower levels and too simple at the highest levels. large, general library collections covering broad subjects such as social sciences, and narrow definitions such as fairground history, need to be included, and a hierarchical scheme is the only way in which these needs can be met. however, there still remains the problem of a general list of subject fields that cover all subjects, both academic and popular, with enough detail to be useful. a number of listings and approaches were attempted, also, but none were felt to be particularly useful, especially as quite often there was an obvious academic bias. the problem was discussed with the other participants in the elib collection description scheme and also with the other clump projects, who suggested different solutions. a consensus has emerged that the solution offering the most promise is the subject headings that the bubl link database [10] has created. this list was originally based on dewey and library of congress classifications but has been adjusted over several years to uk requirements. the bubl scheme has roughly 170 main subject headings, all mapped to dewey classes, and in addition has around 1400 narrower terms. further development is being proposed through funding for a project called hilt (high level thesaurus) under the research support for libraries programme [11], and led by dennis nicholson of strathclyde university. in the meantime, agora and riding decided to adopt a pragmatic solution -using a simple scheme with a limited number of broad subjects. it is based on the m25 and research assessment exercise list of subject terms. in turning to the implementation of clds, it should be noted that any comments from agora will be very preliminary as the results of release 1 user case studies have not yet been seen; these will be published in autumn 2000. for example, it is not clear as yet whether clds are used more as search tools to identify collections or as metadata to simply describe collections. however, there are some issues that have already arisen which are either new, or reflect the issues identified in the "theoretical" development of clds. agora has found that the lack of controlled vocabulary has been both less, and more, of a problem than first anticipated. less, in that most library associates within the project had chosen to use the values that were already in the system, and more, in that there has been little or no development within the project of any theoretical underpinning for a controlled vocabulary. as noted above, the initial values entered into the agora system were to facilitate compliance testing and were not in any way an attempt at a comprehensive or cohesive set of controlled language terms. further work with release 2 of the agora hlms hopes to address this issue. another issue is the possibility of different interpretations of any schema, not simply that used in agora. this becomes an issue where suppliers are asked to provide clds and may well have a different interpretation of the components of the schema than the system administrator, or from each other. clear content guidelines need to be absorbed by the suppliers and agreements have to be made between the system host and supplier as to the responsibility and authority for any changes to cld data. because of the low response rate from suppliers, the extent of this problem and the consequent issue of editing of cld data have yet to be fully explored. another, albeit lesser, issue has been the use of the elib working group schema itself. the project itself is aware that development of clds may well be moving in another direction theoretically but given the necessity of "freezing" a version of cld schemas for development and evaluation, agora (release 1) and riding do not necessarily reflect the most recent thinking in this area. the incorporation of clds within any hlms in an operational sense will require some stability in the state of the schema and values within the schema. conclusion the experience of both riding and agora have shown that the use of clds is useful and desirable but that, as with any new scheme of metadata, there are still many issues to resolve, both in terms of the development of the clds themselves and in their implementation in "real-life" situations. use of controlled language, gaining commitment to a common schema, definition of "collection" and its component parts all need to be addressed within the context of the development of clds. the release 2 phase of the agora project will look at issues of implementation and use of clds, and hopefully spur further research by other projects or institutions in this area. references [1] report of the group on a national/regional strategy for library provision for researchers, hefce, 1995; [2] the riding gateway: [3] ukoln uk office for library & information networking is funded by resource : the council for museums, archives & libraries (the organization succeeding the library and information commission, the joint information systems committee (jisc) and the european union. ukoln also receives support from the university of bath where it is based. [4] dublin core website: [5] gils (global information locator service): [6] isad (g) (general international standard archival description): [7] (ukoln collection description working group: work in progress [8] the agora project website: [9] models website: [10] bubl link database: [11] research support for libraries programme: appendices appendix a: elib collection description scheme july 1999 appendix b: controlled lists for use with the elib collection description copyright© 2000 dr. e.v. brack, david palmer, and bridget robinson top | contents search | author index | title index | monthly issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/september2000-brack   model-oriented scientific research reports search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine may/june 2011 volume 17, number 5/6 table of contents   model-oriented scientific research reports robert b. allen college of information science and technology, drexel university rba@boballen.info doi:10.1045/may2011-allen   printer-friendly version   abstract while the familiar text-based scientific research report format has served well, it has shortcomings that would be ameliorated by a model-oriented approach. although many aspects of scientific research reports are already structured, we propose extending structure as far as possible. models which have been developed in computer and information science for structuring information would provide the framework for scientific research reports while text would play a smaller role. we introduce an approach to describing research design and provide a framework for conceptual process models. we also outline some possibilities for user interfaces and a library of research reports that leverage the model-oriented approach. model-oriented research reports would improve the traditional imrd format by providing much greater structure and specification of constraints.   1. from text-based research reports to model-oriented research reports scientific communication takes many forms but the research report holds a special place. research reports are the primary record of research activities and are extensively used as teaching tools. however, while textual research reports allow a flexible and rich expressiveness, they have several limitations: (a) indexing for text depends on the terms which happen to be used; (b) textual research reports are not readily navigable by browsing; (c) information extraction from text is tedious and error-prone; (d) textual research reports do not readily support completeness and consistency checking; and (e) they must be translated across languages. early research reports were simply narrative descriptions. structure has increasingly been used to address some of the limitations of that simple narrative text. by the 1950s, most research reports had adopted the imrd (introduction, methods, results, and discussion) format [11]. recently, structured abstracts (e.g., [1]), which impose specific topics to be addressed in the abstract, have been adopted in some fields. in addition to increasing structure in written reports, we see greater use of structure for data description and preservation (e.g., [7]). similarly, workflows are increasingly common for describing and replicating aspects of scientific processing [12]; for instance, the kepler project allowed workflows to be specified for data analysis procedures. adding even more structure to research reports can further overcome some of the limitations of text and improve communication. we propose applying to scientific communication modeling techniques such as have been developed by computer and information science (cis). a scientific research report has interlocking constraints from end to end; the research question affects the research design and the researchdesign in turn determines the data that is collected and the analysis of those data. a highly-structured research report (cf., [3]) would incorporate specification of the interlocking constraints and the constraints could be highlighted for human users. because the research report would be highly structured, some consistency check would be possible but we do not emphasize formal compilation or validation of the relationships. the constraints would be flexible in the way that thesaurus terms are a simple knowledge organizing system. the limitations of text identified earlier could be ameliorated by this approach as it has often proved difficult to develop exact semantics for the real world. we explore the possibility of model-oriented research reports by outlining a framework for them. in section 2, we outline some of the major features of our approach. in section 3, we consider variations of model-oriented research reports for different research paradigms. section 4 discusses how the model-oriented research report approach might be deployed, from research using such reports, to libraries of model-oriented research reports and beyond. section 5 is the conclusion. in the technical appendix, we describe a specific approach to conceptual process models.   2. major features and implications of model-oriented research reports there several ways in which models could be used to enhance research reports. in this section, we consider two major areas of innovation that form the basis of the approach. in section 3, we present a comprehensive scenario for model-oriented research reports. as illustration, we use pasteur's classic experiment for the causes of spoilage of nutrient-rich solutions such as broth, beer, and wine [8]. 2.1. toward a notation for research designs research designs are highly structured. as an extreme case, a complex analysis of variance is often described with a model for nesting, repeated measures, etc. we seek to capture such structures and other aspects of research methods and to incorporate them as a part of model-oriented research reports. campbell and stanley [5] use a simple notation to describe the sequences of manipulation and observation in experimental and quasi-experimental designs. their purpose was to illustrate differences in a set of prototypical research designs; we use this notation as the basis for a simple language for describing the research design and extend it with additional operators such as initialization, randomization, timing, grouping, conditionals, repetition (i.e., looping), and data recording. we apply this notation to pasteur's experiment. pasteur showed that broth would not spoil when air could reach it but microbe-laden dust could not. spoilage occurred when dust was allowed to come into contact with the broth. for pasteur's experiment, the basic research design was straightforward: (1) initialize the conditions (boil broth, place it in a flask and create the swan neck for the flask), (2) wait, observe and record, (3) manipulate (tilt the flask so that some of the broth flows into the swan neck and comes into contact with the dust), and (4) wait, observe and record. 2.2. systematic description of conceptual process models in general, conceptual models represent aspects of the world. they include any non-physical model and range greatly in detail. some conceptual models simply describe entities and systems while other conceptual models incorporate interactions and processes. we are especially interested in the latter, which we term conceptual process models. the technical appendix elaborates on using entities/systems and interactions/processes as the bases for conceptual process models. articulating conceptual process models is an important, if sometimes implicit, activity for science [12] and scientific research attempts to establish, explore or validate them. in research reports, the conceptual models being tested are typically presented in the introduction, and described in a relatively predictable format. swales [11] has described such a process of developing a hypothesis for testing as "creating a research space" and proposed a set of functional action units for accomplishing that. swales's approach has been used to characterize textual genres. we propose that it be applied in conjunction with conceptual process models.   3. an example of a more complete model-oriented research report for scientific experiments the previous section described distinctive features and implications of a model-oriented research report. a complete research report would comprise a unified wrapper with four components. each component may have extensive textual commentary. these four components parallel the traditional imrd framework but enhance it by providing much greater structure and specification of constraints. the elements of each of these components could be specified with a notation like xpath and xpointer which identify specific locations in xml documents. 3.1. metadata, research question, and background the first component of a model-oriented research report provides the context of the report — metadata, research questions, and background. the metadata describes attributes of the report itself such as the author and date. the research question would be framed as a model, but unlike typical cis models, it would be incomplete, indicating the question to be researched. the research question is elaborated by linking to previous research reports. it suggests areas which require further investigation, and proposes an investigation strategy. this process of critiquing previous results would be accomplished with citation links and, ultimately, the conceptual process models to be tested would be generated. for pasteur's microbe experiment (as introduced in section 2), the research question was to identify the entities and processes which lead to spoilage of food products. earlier experiments had shown that there was no spontaneous generation for large organisms such as flies but the source of the microbe which spoiled food products was still debated. the research question was incomplete as to processes. 3.2. conceptual process models to be tested, and research strategy and design the conceptual process models that are to be tested need to be elaborated, based on previous findings and on the specific research question. further, the research strategy must be specified based on previous findings. for pasteur, the critical issue was allowing air to enter the flask without also allowing microbes. determining a suitable set of research manipulations can be a creative act, a form of abductive reasoning which is not automated in this approach but is documented along with the description of the components of the model. 3.3. research implementation/execution, results, and analyses research strategies are high-level. the research implementation/execution provides the details of the research strategy and describes what was actually done in the research. for pasteur, this includes descriptions of heating flasks and creating swan necks for the flasks. observations about specific details of the implementation can be recorded. the results record data resulting from the research implementation/execution related to the research question. at this point, techniques from a wide variety of data preservation strategies could be introduced. for pasteur's study, the data is primarily whether spoilage was observed or not. research involving statistics might include data integration and/or analyses. like text-based research reports, model-oriented research reports would also include observations beyond the immediate goals of the research but still be related to the conceptual model being tested. 3.4. research report conclusions finally, the implications of the study would be examined and possibly extended. if no models had been proposed originally, then possible models consistent with the data could be explored at this point. for pasteur's experiment, this section of a model-oriented research report might also include consideration of parameters for some of the treatments for contamination. for instance, the researcher might ask how much heating is needed to safely kill the microbes as pasteur did when he developed pasteurization.   4. implications and deployment if it were implemented, this proposal would have many benefits and could substantially change the way that researchers access the scientific literature. ultimately, we might imagine a set of services which allowed these reports to be part of a scientist's workbench. 4.1. research interaction with model-oriented reports one way the structure of the model-oriented reports could be used is to provide structure for browsing the reports. for instance, the models could be used for schematic visualization of the processes being researched. in the strong form of our approach, the models would be preeminent with minimal text. we could also imagine a hybrid approach in which both text and schematics of the conceptual process models were presented. the use of such schematics could make the reports more accessible for non-experts. indeed, because they would have a systematic structure, the models might be used for tutorials in which levels of complexity were adapted to the users' backgrounds. user tools could also be developed for authoring model-oriented research reports and for browsing the library. 4.2. research report library collected model-oriented research reports could form a highly interwoven library of research reports. the notion of a unified library of research literature is related to current digital libraries of articles linked by citations. shum et al. [10] have proposed a library in which discourse claims of research articles would be linked. the approach proposed here, particularly the inclusion of process descriptions, should allow a much richer linking of research reports. in our approach, the library would hold several types of content. first, it would archive completed, reviewed model-oriented research reports. second, it would include master records for entities/systems and interactions/processes. as such, it might include a range of empirical values in addition to that accepted by consensus. the library would include a record of research reports which relate to a particular entity as well as its properties and dimensions, states, and their observed values. third, the library would hold annotations, abstractions, classifications, and conceptual frameworks. in addition, it could also contain theoretical studies that combine information from other sources in the library. fourth, it would include standard descriptions of instruments and procedures. the primary attribute would be the functionality — what the instrument measures. there would also be descriptions of standard research processes such as laboratory and analysis workflows. citations, which are links between or among objects in the library, are first-class objects. that is, they are stored separately from the reports they describe. they are links between reports and previous reports in the library. this is comparable to a citation network among collections of text documents. ambiguity should be reduced because of the relative systematicity of the formal notation. there could be several explicit citation types (e.g. [14]). furthermore, the links could be multi-headed to allow components to be combined as needed. indeed, a citation to a new methodology might link to the workflow model for implementing that method. such citations should be more useful than traditional citations because the anchors and the roles will be explicitly defined. the library would be updated as new observations and results are added and components are refined. as described below, special users may provide additional consistency checks and updates for the contents of the library. in addition, it should be possible for readers to annotate entries in the library. 4.3. deployment to the extent that model-oriented research reports ameliorate the limitations of text-based research reports, the approach would provide better service to researchers and students. the advantages of the model-oriented approach will encourage journals to adopt it much the way that structured abstracts are being adopted. we expect deployment and adoption of model-oriented research reports to follow the familiar s-curve for new technologies. the editorial processing of model-oriented research reports may be organized much the way editorial processing of text-based research reports is now organized, though the cleaner structure of model-oriented research reports and the ease of linking to other research reports may facilitate the review process. the library would be maintained through several channels. in many cases, there would be routine additions of new data when the library is updated upon the library editor's approval. in other cases, new findings might require substantially revising conceptual structures in the library. we also envision that theoreticians may recommend the addition or revision of entities/systems or interactions/processes within the library. there may be human "gardeners" who would participate in revisions, but, eventually sophisticated version management techniques would be required. 4.4. applying the model-oriented research report concept beyond scientific research reports research is extremely varied. nonetheless, model-oriented research reports might be used for other paradigms for scientific research and other types of research, beyond that exemplified by pasteur's experiment. exploring properties and states: science often involves empirical observations of parameters for existing models, such as exploring melting points or mapping phase diagrams of substances. this is readily handled with the model-oriented approach by simply providing links to the entities whose properties are being studied and presumably to standard procedures for that type of investigation. describing specific natural phenomena: natural phenomena are often the result of combinations of processes (e.g., [4]). analyses of such phenomena are often a form of forensics. in this case, the research often involves compiling and analyzing evidence in support of instantiations of conceptual processes. this process can be readily described by a model-oriented research report. induction of, fitting, or selecting models from observations: in some cases, so little is known about a phenomenon that there is no plausible conceptual process model to describe it that may be tested. in such cases, the best strategy may be simply to gather and then organize observations through induction. the model-oriented research report can provide a framework for describing such research. closely related to induction of process models is fitting models to data. research employing structural equation models typically attempts to select among the possible models given the data. usually, these structural equation models are applied to modeling systems. structural equation models (e.g., [2]) are similar to conceptual process models and may be incorporated into model-oriented research reports. this fitting of models to data might also be done with a regression analysis which results in a regression equation, although this does not explicitly address the processes involved. a third related approach is simulation. while in some cases simulation is used simply to develop a visual animation of a system or phenomenon without full consideration for the underlying processes, in other cases those underlying processes are carefully modeled. simulation can employ a broad range of specific models such as neural networks and autonomous agents. data-driven science: there has been considerable discussion of data-driven science in which data collection is separated from the research question. typically, a large data set is made available as a resource for whatever questions may be presented. descriptions from this research paradigm are easily handled within the model-oriented research report; there would simply be no conceptual process models to be tested and no manipulations in the research design. non-scientific research procedures: many activities beyond science that combine model-driven expectations, complex procedures, and rich data sets could be described in model-oriented research reports. for instance, medical tests are often conducted for the doctor to investigate specific hypotheses in a diagnosis about the patient's health. presumably, the tests reflect instantiations of processes models that the physician believes may be relevant.   conclusion recently, there has been a great deal of emphasis on systematizing the description, organization, and preservation of data in scientific research. we extend that effort to systematizing the reporting of the entire research process. this outline of a model-oriented research report and research report library suggests that this approach deserves further exploration. developing such an approach would ameliorate many of the limitations of current text-based research reports.   a. technical appendix developing the parameters of conceptual process models is the most complex part of the proposal for model-oriented research reports. there are several ways that the conceptual process models could be implemented. here, we explore one possible approach. at the top-level, the model is similar to a natural language statement with subject, verb, and object — the basis of process. the model is primarily a discrete (qualitative) rather than a continuous (quantitative) model, which is consistent with the extensive literature on qualitative models of cognition [6]. this model can be extended to include quantitative values, especially with respect to the interactions of components of a system. a.1. entities and systems in this approach, entities and systems are conceptual models (rather than conceptual process models). we define entity descriptions as frames having properties and dimensions. properties have a single value. some properties are defining properties. for instance, gold atoms have a specific number of protons; that number is a defining property. dimensions allow sets of states. for instance, gold atom electrons have quantum levels, with quantum levels generally being the dimension and the specific level an electron is in being a state. an instance of a gold atom will have specific properties (e.g., location) and dimensions (e.g., be in a specific quantum level). entities can be related to other entities in a variety of ways. for instance, both gold atoms and a collection of gold atoms would be entities. collections of gold atoms will have a dimension of phase with possible states of solid, liquid, or vapor. the full entity description for collections of atoms would include a phase transition table. isotopes and isomers may be seen as distinct entities and as sub-divisions of higher-level entities. entities can be organized into classes with associated properties. for example, the periodic table is a two-way classification that predicts atomic number. other classification systems such as biological classifications are hierarchical and in those, properties may be inherited and/or more abstract. we define systems as combinations of entities whose internal structure is not easily dissected. for instance, in pasteur's experiment, when microbes reach the broth, it becomes a contaminated broth. the distinction between entities and systems can be subtle and is a matter of convention, community consensus, context, and editorial judgment. a system is sometimes a weak form of entity that also includes constituent entities. in some cases, for example, we can treat an atom as a distinct entity but in other cases, it is important to consider its constituents, in which case we focus on it as a system. in other cases, the system is a collection of related entities but is not itself considered a distinct entity. a black hole and its accretion disk form a natural system but not an entity. in still other cases, systems can be ad hoc, such as random mixtures of chemical solutions. a system can also be defined just by framing two or more apparently unconnected entities. for example, the gravitational interactions in the earth-jupiter system could be studied but we do not normally focus on that pairing. in such studies the components are often treated as entities. in some notable cases, such as relativity, the frame of reference itself can be re-conceptualized. a.2. interactions and processes interactions are transitions involving entities and systems. they may include state transitions for a single entity or the creation, association, disassociation, or destruction of entity instances. for some interactions, there may be a functional relationship that determines the outcome. for instance, hydrogen and oxygen molecules interact to form water as well as release energy in an interaction that follows certain ratios. there may also be constraints on the entities involved for the interaction to occur. for instance, the entities may need to be in a certain state (gaseous when making water) and a spark (in the case of making water) or some other trigger or catalyst is needed for the interaction to occur. or, as with the reaction of hydrogen and oxygen, the interaction may release energy. the effect of that released energy would generally be considered in the broader system in which the interaction takes place. there would be many cases where we explore systems within systems. just as types of entities can be related to each other, there are also families of interactions. some of these are interactions which are common across a set of related entities. for instance, in chemistry there is a typology of interactions such as oxidation-reduction reactions or substitution reactions. processes are chains of interactions. the process of spoilage of a broth might be said to be composed of two interactions — first, the contamination of the broth by microbes, and, second, the growth of the microbe population spoiling the broth. as with the distinction between entities and systems, there is considerable flexibility in the distinction between interactions and processes. many interactions could be viewed as a process and decomposed into more granular levels of interactions. for example, an infection causes a disease but the mechanics of infection may be decomposed into lower-level processes and interactions. generally, as complex phenomena are explored more fully, their component processes are increasingly refined. some behavior of systems could be described with methods such as those familiar from object-oriented data models. for example, animals of a species may have typical instinctive behaviors that could be included as a part of the dynamics of a system. in addition, interactions and processes may be modeled with the unified modeling language (uml) that is widely used for modeling discrete processes in information systems. in particular, the activity diagrams within uml provide workflow models that would be particularly useful. moreover, components of systems may have complex behavior, and formalisms from uml can be used to describe that behavior. in fact, uml has been proposed as a conceptual foundation for systems biology [9]. while uml can be useful, there are many complex systems for which it is inadequate. these systems may be modeled with techniques such as simulations. there are a variety of simulation techniques ranging from autonomous agents to numerical analysis of coupled differential equations. for some systems which show learning, such as natural selection in the evolution of species, the underlying representation may be modified.   acknowledgment jordon steele and michael zarro provided useful comments on earlier drafts of this paper.   references [1] ad hoc working group for critical appraisal of the medical literature. a proposal for more informative abstracts of clinical articles. 1987, annals of internal medicine, 106, 598-604. [2] aickin, m., 2002, causal analysis in biomedicine and epidemiology: based on minimal sufficient causation. marcel decker inc., new york. [3] allen, r. b., 2007, highly structured scientific publications. acm/ieee joint conference on digital libraries, 472. doi:10.1145/1255175.1255271 [4] allen, r. b., wu, y. j., & jun, l., 2005, interactive causal schematics for qualitative scientific explanations, icadl (lncs 3815/2005) 411-415. doi:10.1007/11599517_50 [5] campbell, d. t. & stanley, j. c., 1966, experimental and quasi-experimental designs. chicago: rand-mcnally. [6] forbus, k. d., 1996, qualitative reasoning. crc hand-book of computer science and engineering. crc press. [7] hunter, j., 2006, scientific models — a user-oriented approach to the integration of scientific data and digital libraries, in victorian association for library automation. [8] pasteur, l., 1879, studies on fermentation: the diseases of beer, their causes, and the means. translated by f. faulkner and d.c. robb. macmillan, london. [9] roux-rouquie, m., & schuch da rosa, d., 2006, ten top reasons for systems biology to get into model-driven engineering, icse. [10] shum, s. b., motta, e., & domingu, j., 2000, scholonto: an ontology-based digital library: server for research documents and discourse, international. journal on digital libraries, 3(3), 237-248. doi:10.1007/s007990000034 [11] swales, j. m., 1990, genre analysis: english in academic and research settings. cambridge university press, cambridge uk. [12] taylor, i. j., edelman, e., gannon, d. b., & shields, m., 2007, workflows for e-science: scientific workflows for grids. springer, london. [13] thagard, p., 1992, conceptual revolutions. princeton university press: princeton nj. [14] trigg, r., 1983, a network-based approach to text handling for the online scientific community, phd dissertation, department of computer science, university of maryland.   about the author robert b. allen was a pioneer in the development of recommender systems. recently, he has explored novel access techniques for digital history such as text extraction from collections of digitized historical newspapers and interactive timeline interfaces. dr. allen is at the ischool at drexel university. he has prepared a comprehensive online informatics textbook online "information: a fundamental construct". before joining drexel, he was at the university of maryland, a senior scientist at bellcore, and a member of technical staff at bell laboratories. his ph.d. was in social and cognitive experimental psychology from ucsd. dr. allen was editor in chief of the acm transactions on information systems and chair of the acm publications board.   copyright © 2011 robert b. allen the open video digital library search   |   back issues   |   author index   |   title index   |   contents d-lib magazine december 2002 volume 8 number 12 issn 1082-9873 the open video digital library   gary marchionini gary geisler university of north carolina at chapel hill (march, geisg)@ils.unc.edu abstract in this article we describe the primary goals of the open video digital library (ovdl), its evolution and current status. we provide overviews of the ovdl user interface research and user studies we have conducted with it, and we outline our plans for future open video related activities. introduction digital video presents important challenges to digital librarians. the challenges are due to file sizes, the temporal nature of the medium, and the lack of bibliographic methods that leverage non-textual features. there are increasing volumes of digital video available from traditional producers (e.g., news and entertainment media, educational and government institutions) and huge amounts created by individuals with access to inexpensive digital cameras and editing tools who collect and exchange video with family and friends. librarians have traditionally indexed image and video collections with textual bibliographic data that mainly parallels text works, e.g., producer, date, run time. (see rasmussen, and also fidel, for discussions of image indexing [rasmussen, 1997, fidel, 1997].) enser and sandom have assessed video indexing in film archives and reinforce the usage of human-assigned bibliographic data while raising the limitations of automated context-based indexing [enser & sandom, 2002]. nonetheless, there is considerable effort to find ways to automate indexing in video dls. the increasing volumes of content and increasing number of users with growing expectations have stimulated the digital library community to take up the challenges of acquiring, storing, indexing, retrieving, preserving, and transferring video content. to date, the most prominent and comprehensive effort to build a digital library (dl) of digital video is the informedia project [christel et al., 1997, 1998; smith & kanade, 1998; wactlar et al., 1999; witbrock & hauptmann, 1998]. informedia uses a variety of visual features (e.g., color, faces, text superimpositions) as well as textual features (e.g., speech to text transcripts) to make a large volume of digital video retrievable. the project has demonstrated the efficacy of many technical processes for organizing, searching, and scaling video dls. while there has been substantial research on particular aspects of digital video retrieval, e.g., segmentation and feature detection (see chang et al. for an overview [chang et al., 1997]), informedia addressed many of the integration challenges in incorporating different research products into a demonstration dl. other important projects include ibm's cuevideo, which has been integrating a variety of segmentation, indexing, and user interface techniques developed in the almaden and watson labs [ponceleon et al., 1999], and the digital video multimedia group at columbia [1], which has been engaged in several streams of work including efforts to automate video summaries [chang et al., 1997]. the multimedia information retrieval group at dublin city university has been developing the físchlár project, which provides broadcast video for the university community. this group has developed innovative user interfaces for the video repository [lee & smeaton, 2002]. the european union's echo project [2] is developing archives of historical footage from different european countries and has focused on creating metadata schemes and cross-language access techniques. each of these large-scale projects draws upon substantial efforts by the engineering communities devoted to finding effective signal-processing techniques for digital video. the open video digital library [3] aims to capitalize on advances in engineering as well as in library and information science to create usable services for the research and educational communities. in this article we describe the primary goals of the open video digital library, its evolution and current status. we provide overviews of the user interface research and user studies we have conducted with it and outline our plans for future open video related activities. open video goals the open video digital library (ovdl) is motivated by several theoretical and practical goals. theoretical goals the first theoretical goal is to instantiate and evaluate the sharium concept for digital libraries [marchionini, 1999]. the idea is to directly leverage human time, effort, and resources—what libraries mostly have done indirectly (through government and non-government support) by inviting contributions and direct patron participation in digital libraries (dls). the internet makes it possible to get people directly involved, which is especially important in dls where physical space, signage, and reference support are not available. the sharium concept adds disintermediation to the many existing layers of intermediation, thus adding depth missing in many dls and impractical to achieve in physical libraries. one example of this is what jones calls "contributor-run" dls [jones, 2001] using the linux archive as a case study. open video will provide an environment for studying ways to involve patrons more actively in dls; the open video special collections page, which highlights and facilitates user access to significant contributions, is a first step in this direction [geisler et al., 2002]. the second theoretical goal is to understand browsing and searching in electronic environments. unlike physical libraries where the catalog and indexing aides are clearly distinct from the texts, tapes, and other primary media, dls provide both pointer information and primary materials in the same interface. this is convenient for users but also challenging because the lack of clear demarcations in pointer and primary information can be confusing or overwhelming. in electronic environments, analytical searching, browsing, and reading/viewing are tightly coupled and lead to new information-seeking strategies and behaviors [marchionini, 1995]. video provides particularly interesting opportunities to study the tightly coupled, interactive information-seeking process in multiple channels. the third theoretical goal is to instantiate and evaluate a framework for dl interfaces. user interfaces are the analogs to library space and librarian services such as reference in physical libraries. thus, user interfaces are crucial to the success of all dls. we have been developing a framework for interactive user interfaces that gives people multiple views of information spaces and allows them to quickly and easily change these views to search, browse, view and reflect on the process. the ovdl provides a test bed for developing and evaluating an interaction framework we call "agileviews" [marchionini et al, 2000; geisler, in press]. agileviews builds upon dynamic query [shneiderman, 1994] and other interactive, graphical interfaces [card et al., 1991]. at present, we are creating and evaluating previews, overviews, reviews (or history views), peripheral views, and shared views. practical goals an important practical goal of the ovdl is to build an open source digital video test bed for dl research and development. currently, each video research team must acquire and manage its own video content. because content characteristics are important to test results (e.g., visual quality can influence feature detection algorithms), an open source test bed will be useful for video retrieval researchers in two ways—first to provide easy access to content and allow users to focus on their research goals, and second, to make it possible to compare results with other approaches using the same content. in our case, the corpus provides us with the content for our user interface studies related to the theoretical goals above. the educational community will benefit from an open source dl available to instructors and students for in-class, at-a-distance, group, and individual uses. the corpus also provides opportunities for educational researchers to study teaching and learning with video content. another practical goal is to provide a training ground for information and library science students to gain skill in using and operating dls. just as students may intern in physical libraries or corporate information centers to gain practical experience, students can intern in the ovdl or incorporate the resources in their coursework and independent studies. for example, stachowicz used the repository to compare results of manual indexing of video, based on either viewing the full video or viewing surrogates only [stachowicz, 2002]. other students have worked to build, assemble, and evaluate user interfaces; develop project tools for managing digital video and extracting keyframes; create metadata, markup, and database schemes; write middleware software; and develop programs for surrogate generation. finally, the ovdl can serve the practical needs of the public for an open source repository of digital video. evolution and current status with these theoretical and practical goals in mind, the project has been evolving since 1996 when we worked with discovery channel video to provide multimedia materials to middle school science and social studies teachers in the baltimore learning community project [marchionini et al., 1997]. in that work, we indexed short segments of documentaries and integrated them with images, texts, and educational www sites in a dynamic query user interface that provided slide show and storyboard previews for the video as well as provided a lesson-plan construction tool. a number of user studies were conducted to assess the effectiveness of different implementations of the video surrogates [ding et al., 1997; komlodi & marchionini, 1998; tse et al., 1998]. in 1999, the task of creating a publicly accessible digital video repository and test bed began in earnest. a framework for creating a digital video repository was developed [slaughter et al., 2000], and the usefulness of the repository was discussed at both the sigir workshop on video retrieval in berkeley in august and at a video retrieval symposium hosted in chapel hill in october. with support from the unc-ch provost and the school of information and library science, the open video project was formally launched in the late fall of 1999. the initial public version of the ovdl consisted of 120 files in mpeg-1 format. these files were segments from 8 different video programs obtained from u.s. government agencies such as the national archives and nasa, representing about 12 hours of content. by spring 2000, contributions from carnegie mellon's informedia project, the howard hughes medical institute, and the prelinger archives helped grow the collection to 225 files and more than 40 hours of content. additional contributions in 2001 from informedia, the internet archive, and other sources increased the collection to about 1500 files and broadened the range of available file formats to include mpeg-2 and mpeg-4 in addition to mpeg-1. at the time of writing this article (fall 2002), the ovdl provides 1800 video files (more than .5 terabytes of content), representing 460 hours of video footage. table 1 describes the basic characteristics of the video in the current collection. table 1. characteristics of current ovdl content genre duration color sound documentary 446 less than 1 minute 185 in color 811 with sound 1464 educational 36 1 to 2 minutes 238 in black & white 1003 silent 350 ephemeral 1132 2 to 5 minutes 275         historical 184 5 to 10 minutes 239         lecture 16 more than 10 minutes 879         at present, the ovdl includes mpeg-1, mpeg-2, mpeg-4, and quicktime files. we believe providing digital files rather than streaming video is important to the goals of the project and better serves the research and education communities. our partnerships with contributors have been substantial and are expanding. the informedia project has contributed about 30 hours of its government documentary video, including extensive metadata and transcripts. the university of maryland's human-computer interaction laboratory has contributed its files of videos from 20 years of annual symposia. rick prelinger contributed several of his ephemeral films early on in the project, and today we point to most of his entire archive, which is available in the internet archive [4]. we have a cooperative agreement with nasa for 16 additional programs and are working with professors around the world to add selected videos they can use in their classes. for the past year, the ovdl web site has averaged at least 2000 unique visitors each month. visitors come in nearly equal numbers from the .edu, .com, and .net domains, with a substantial percentage of visitors accessing the collection from outside the u.s. system architecture as shown in figure 1, the heart of the ovdl is the mysql database of metadata and an agileviews interface module currently implemented as php middleware between the database and browse, search, and contribution services. the database schema has been revised over time, from the initial schema of one table with a dozen attributes to the current schema that includes about 15 tables and 100 attributes (including all primary and foreign keys) [5]. the current schema being used is dublin core compliant, and ovdl is an open archives initiative data provider [6]. figure 1. open video architecture the browse interface presents access clusters by genres (documentaries, educational, lectures, ephemerals, historical), duration (less than a minute, 1-2 minutes, 2-5 minutes, 5-10 minutes, and more than 10 minutes), color (color or black and white), sound (with sound or silent), and contributing organization (e.g., cmu, internet archive, etc.). for each category, posting data is given for the number of segments in that category. this layout provides an overview of the entire collection as well as browse access. browse facilities are available at all levels of the interface. the search interface supports three kinds of search. attribute search provides pull-down menus or radio buttons for key attributes such as genre or producer. this offers a quick way to partition the database into videos with specific characteristics of interest. two types of text-based search options are also available. an input field is provided for user-entered queries matched on the full text search of bibliographic records as well as transcripts for those videos with transcripts available. a pull-down menu of keywords that can be used as search criteria is also available. once the user has partitioned the database through top-level search or browse tools, increasingly detailed overviews for partitions and previews for specific segments become available. techniques for "looking ahead" before moving to a more detailed level or beginning to download a video file are an important part of the agileviews interface framework. these "look aheads" are particularly crucial to practical work with video libraries containing many very large files. figure 2, for example, shows the results page displayed when a user selects the "historical" genre from the browse page. when the user "hovers" (places the mouse) over the details icon for a video, a brief description of that video appears in a pop-up box. clicking on the details icon yields the full bibliographic record including a tab option for the bibliographic record for the full video to which the segment belongs and a tab to a visual preview for the segment (shown in figure 3). at any of these points in the interaction, the user can begin to download the complete segment. figure 2. video description displayed in pop-up figure 3. video preview page backend "technical services" operations are depicted in figure 1. we aim to use as many open source tools developed by others as we can in order to focus on video curation and user interface development and evaluation. digitization is done in our interaction design lab working from ntsc or betasp tapes. some of the newer content is arriving on digital tapes. at present we manually segment the tapes, either before digitizing or in real time. although excellent segmentation algorithms exist, manual segmentation gives our students first-hand experience with the content. in some cases, instructors provide segmentation parameters specific to class plans for tapes they want to digitize. once digitized, segments are saved with systematic names on a disk array. segmentation and digitization have been done on wintel systems, but we are shifting some of this to a macintosh platform. we have used a variety of techniques for keyframe extraction. in most cases, we used the university of maryland's merit software suite to extract keyframes [kobla et al., 1998]. this software extracts keyframes without decompressing the mpeg-1 files. we have also ported merit to both the linux os and mac os x [7] to increase our flexibility in using it to extract keyframes. in other cases, we use our own scripts and java programs [8] to extract keyframes from video files using nth-frame algorithms. we have also done a small amount of manual keyframe extraction. after keyframes have been extracted, we manually prune the keyframes and identify representative poster frames using our own web-based applications. at present, keyword identification and implementation as text or audio is mainly a manual process. for videos that have keywords as part of the metadata record, we use those terms. in other cases, we have been manually assigning keywords with a two-person team, one to identify and one to validate. once assigned, keywords are automatically added to the pull-down menu in the keyword search facility on the web site. to produce audio implementations of keywords, a speech synthesizer is used to insure audio consistency (although some terms must be rerecorded or adjusted to improve accuracy). the ovdl video files themselves are distributed on various servers in the internet2 distributed storage initiative network [9], the sils interaction design laboratory [10], and other sites such as the internet archive and the library of congress. files can either be contributed directly for storage and management in the ovdl or providers can maintain their own files while ovdl simply maintains the metadata and hyperlinks to the files. in addition to the production system available at , we have developed several prototype interfaces that incorporate more extensive "agile views", and we are continuing to develop new interfaces based upon our research and user studies. user interface research the nsf interactive systems program supported a grant to conduct user interface studies for the 2001-2004 period. our primary effort is devoted to the creation and evaluation of highly interactive user interfaces that allow people to select representations and control them quickly and easily to achieve their information-seeking needs. the design framework guiding this effort is the concept of agileviews. we aim to give people several classes of views: overviews of collections of video segments; previews of specific video objects; reviews or history views of past searches or community activities; peripheral views of related objects; and shared views instantiated by others active in the sharium environment [marchionini, et al., 2000; geisler, in press]. see figure 4 and figure 5 for examples of this research direction. figure 4. an agile views storyboard preview figure 5. an agileviews shared view showing user recommendations for the ovdl, we have focused on developing surrogates to help people quickly understand video gist (which includes perceptual and media-specific senses of the content) and thus gain quick overviews or previews allowing rapid decision making about whether to obtain more detailed surrogates or the full video segment. we have been working with three types of surrogates and variations within them: slide shows, storyboards, and fast forwards. slide shows display keyframes at rapid intervals (e.g., 250 ms). they minimize screen real estate and thus avoid window management loads for users. previous studies demonstrate that people are able to take advantage of slide shows at very high rates (less than 100 ms) but prefer the control provided by other types of surrogates [ding et al, 1997; komlodi & marchionini, 1998]. storyboards display an array of keyframes. they consume screen real estate and even though they tend to require more user time to perform tasks due to repeated visual scanning, most users have shown a preference for the storyboards. for both slide shows and storyboards, we have experimented with textual and audio keywords added to the keyframes. fast forwards (implemented by choosing every nth frame) provide some sense of motion and may make it easier for people to detect narrative lines. our studies suggest that people tend to like fast forwards [wildemuth et al., 2002; wildemuth et al., in review]. user studies to address our theoretical research goals, we are guided by a user study agenda that includes performance and preference-dependent measures as well as four classes of independent variables influencing these measures (see figure 6). one of our central contributions is developing and validating the variables and measures in this research agenda. the independent variables and facets of interest are: people (facets include: domain experience, video experience, cultural experience, computer experience, information seeking experience, metacognitive abilities, and various demographics) content (facets include genre [documentary, narrative], topic [literal, figurative], and style [visual, audio, textual, and pace]) surrogates (facets include: keywords, fast forwards, storyboard w/ audio, storyboard w/ text, slideshow w/ audio, slideshow w/ text, and keyframes [poster frames]) tasks (facets include: linguistic gist, object recognition, action recognition, and visual gist) in the next few years, we will continue to expand these user studies to determine tradeoffs for different types of surrogates and establish boundary conditions for human performance. the tradeoffs will ultimately lead to a cost-benefit tradeoff function guiding the kinds of alternative views we provide in the ovdl interface. studies integrating different overview and preview surrogates are planned for 2003. the 2002 studies aimed to establish some of the boundary conditions that will help us determine default settings for display speeds and keyword supplements. a study currently underway uses eye-tracking to determine how people use displays of search results having visual (poster frame plus several keyframes) and textual cues in the results lists. a study is planned to compare inline and pop up displays of poster frames in results lists. thus, the study goals are to both inform designs for the ovdl and other digital video interfaces and to develop metrics for assessing video retrieval behavior. figure 6. user study framework plans and future directions there are several threads of work planned for the coming months. in addition to the ongoing user studies that inform the agileviews framework and interfaces for dls, we will be looking for ways to make patron contributions more automatic. at present, contributors must provide some minimal set of metadata, and we manually work with them to insure their contributions are appropriate and properly distributed. automating the process will require more than simple forms. (early on we had an upload form with a few required metadata fields but removed this as too simplistic an approach.) we are debating whether to use the internet archive and ibiblio collection development policy approach that allows anything to be harvested or contributed and only remove it if there are complaints; or to use a more traditional collection development policy that uses a review board or librarian filter to accept contributions. we are developing a number of specialized tools as we gain more experience with the ovdl. eventually, these tools should be integrated into a digital librarian toolkit. we endeavor to use open source tools whenever possible, and the programs we develop are licensed under gpl licenses. one tool called the interactive shared educational environment (isee) provides users with facilities to collaboratively and synchronously study video over the internet [mu & marchionini, 2002]. the environment provides a private or shared video window, a text chat window, a shared web browser window, and a multicast video window, as well as tools for managing the communication process. the environment is being tested in three classes at unc-ch in the fall 2002 semester. xiangming mu has also developed the vast tool for selecting nth frames of video and converting the resulting "fast forwards" into quicktime surrogates (see wildemuth et al. for details and evaluation [wildemuth et al., in review]). richard gruss has developed scripts to crawl digital video websites and extract metadata. another tool is a peer-to-peer tool for librarians to view and exchange video segments as part of the contribution and collection development processes. this tool has been prototyped by richard gruss and will be developed for use by our regular contributors and partners. meng yang has begun work on a metadata viewer tool that aims to aid librarians with manual indexing or metadata editing. the isee tool is well along and has been demonstrated at meetings such as the internet 2 conference in the summer of 2002. these and other tools continue to evolve and, over time, we hope to create an integrated toolkit that can be shared with other dls. finally, we are concerned with longitudinal evaluation of dls. our perspective is to use a multi-faceted evaluation approach [marchionini, 2001] that integrates different sources of evidence from human (e.g., user studies, transaction log analysis), technical (e.g., performance, costs and design tradeoffs) and informational (e.g., indexing and metadata acceptance) facets. conclusion the ovdl is an ongoing project. we are learning by doing and aim in this article to share some of the practical experiences of building and maintaining a digital video library. we are driven by theoretical and practical goals and strive to leverage the synergy of working toward both ends. our user interaction research goals have been well served by the efforts to build the production system, thus confirming the possibility of useful interactions between theory and practice. finally, we hope others will use the library and contribute to it, and we hope the library will serve as a useful public resource and test bed for a variety of research questions. acknowledgements this work received partial support from nsf grant iis 0099638. thanks also are due to the other members of the the open video project team: richard gruss, anthony hughes, xiangming mu, curtis webster, barbara wildemuth, todd wilkens, and meng yang. notes [1] see the digital video multimedia group at columbia university at . [2] the home page for the european union's echo project is at [3] the open video digital library (ovdl) is at . [4] the internet archive home page is at . [5] susan dennis and christina pattuelli developed the first dublin core compliant database scheme in the spring 2000 semester, and adam smith extended the scheme to its current form in the fall of 2000. [6] michael nelson and gary geisler are responsible for making ovdl an oai data provider. [7] richard gruss did these ports. [8] xiangming mu and richard gruss developed these programs. [9] see the internet2 distributed storage initiative network at . [10] see the sils interaction design laboratory at . references [card] card, s. robertson, g., & mackinlay, j. (1991). the information visualize; an information workspace. proceedings of sigchi conference on human factors in computing systems, (new orleans, april 27-may 2, 1991). 181-188. [chang et al, 1997] chang, s., smith, j., meng, h., wang, h., & zhong, d. (1997). finding images/video in large archives. d-lib magazine 1997 (february): . [christel, et al., 1998] christel, m., smith, m., taylor, c.r., & winkler, d. (1998). evolving video skims into useful multimedia abstractions. proceedings of chi '98: human factors in computing systems (los angeles, april 18-23, 1998). 171-178. [christel et al., 1997] christel, m., winkler, d. & taylor, c.r. (1997). improving access to a digital video library. paper presented at the human-computer interaction: interact97, the 6th ifip conference on human-computer interaction, sydney, australia, july 14-18, 1997. [ding et al., 1997] ding, w., marchionini, g., & tse, t. (1997). previewing video data: browsing key frames at high rates using a video slide show interface. proceedings of the international symposium on research, development, and practice in digital libraries, (tsukuba, japan) p. 151-158. [ding et al., 1999] ding, w., marchionini, g., & soergel, d. (1999). multimodal surrogates for video browsing. proceedings of digital libraries '99. the fourth annual acm conference on digital libraries (berkeley, ca, august 11-14, 1999). 85-93. [enser & sandom] enser, p. & sandom, c. (2002). retrieval of archival moving imagery: cbir outside the frame? proceedings of the international conference on image and video retrieval (london, july 18-19, 2002). berlin: springer (lecture notes in computer science 2383). 206-214. [fidel] fidel, r. (1997). the image retrieval task: implications for design and evaluation of image databases. new review of hypermedia and multimedia, vol. 3. 181-199. [geisler] geisler, g. (in press). agileviews: a framework for creating more effective information seeking interfaces. unpublished doctoral dissertation, university of north carolina at chapel hill. [geisler et al., 2002] geisler, g., giersch, s., mcarthur, d., & mcclelland, m. (2002). creating virtual collections in digital libraries: benefits and implementation issues. proceedings of the joint conference on digital libraries (jcdl 2002). pp. 210-218. [jain & vailaya] jain, a. & vailaya, a. (1996). image retrieval using color and shape. pattern recognition, 29(8), 1233-1244. [jones] jones, p. (2001). open(source)ing the doors for contributor-run digital libraries. cacm, 44(5), 45-6. [kobla et al.] kobla, v., doermann, d. and faloutsos, c. developing high-level representations of video clips using videotrails. proceedings of the spie conference on storage and retrieval for image and video databases vi, pages 81-92, 1998 . [lee & smeaton] lee, h. & smeaton, a. (2002). designing the user-interface for the físchlár digital video library. journal of digital information, 2(4), special issue on interactivity in digital libraries, may 2002. [komlodi & marchionini] komlodi, a. & marchionini, g. (1998). key frame preview techniques for video browsing. proceedings of acm dl '98. (pittsburgh, pa, june 24-26, 1998). [marchionini, 1995] marchionini, g. (1995). information seeking in electronic environments. ny: cambridge u. press. [marchionini, 1999] marchionini, g. (1999). augmenting library services: toward the sharium. proceedings of international symposium on digital libraries 1999 (tsukuba, japan, september 28-29, 1999). 40-47. [marchionini, 2001] marchionini, g. (2001). evaluating digital libraries: a longitudinal and multifaceted view. library trends. 49(2). 304-333. [marchionini et al., 2000] marchionini, g., geisler, g., & brunk, b. (2000). agileviews: a human-centered framework for interfaces to information spaces. proceedings of the annual meeting of the american society for information science (chicago, nov. 12-16, 2000). p. 271-280. [marchionini et al., 1998] marchionini, g., plaisant, c., & komlodi, a. (1998). interfaces and tools for the library of congress national digital library program. information processing & management, 34(5). 535-555. [marchionini et al., 1997] marchionini, g., nolet, v., williams, h., ding, w., beale, j., rose, a., gordon, a., enomoto, e., harbinson, l., (1997). content + connectivity => community: digital resources for a learning community. proceeding of acm digital libraries 97 (philadelphia, pa: july 23-26, 1997). 212-220. [mu & marchionini] mu, x. & marchionini, g. (2002). interactive shared educational environment (isee): design, architecture, and user interface, april 2002. sils technical report: tr-2002-09 < http://ils.unc.edu/idl/isee/isee_tech_report.pdf>, unc-chapel hill. [ponceleon] ponceleon, d., amir, a., srinivasan, s., syeda-mahmood, t., & petkovic, d. (1999). cuevideo: automated multimedia indexing and retrieval. acm multimedia '99 (orlando, fl, oct. 1999). p. 199. [rasmussen] rasmussen, e. (1997). indexing images. annual review of information science and technology, vol. 32. medford, nj: information today. 169-196. [shneiderman] shneiderman, b. (1994). dynamic queries for visual information seeking, ieee software, 11(6). 70-77. [slaughter] slaughter, l., marchionini, g. and geisler, g. (2000). open video: a framework for a test collection. journal of network and computer applications, special issue on network-based storage services. pp. 219-245. [smith] smith, m. & kanade, t. (1998). video skimming and characterization through the combination of image and language understanding. proceedings of 1998 ieee international workshop on content-based access of image and video database. (bombay, india, january 3, 1998). los alimitos, ca: ieee. 61-70. [stachowicz] stachowicz, christine. the effectiveness of storyboard surrogates in the subject indexing of digital video . a master's paper for the m.s. in l.s. degree. april, 2002. [tse et al.] tse, t., marchionini, g., ding, w., slaughter, l. & komlodi, a. (1998). dynamic key frame presentation techniques for augmenting video browsing. proceedings of avi '98: advanced visual interfaces (l' aquila, italy, may 25-27). 185-194. [waclar et al.] wactlar, h., christel, m., gong, y., & hauptmann, a. (1999). lessons learned from building a terabyte digital video library. computer, 32(2). 66-73. [wildemuth et al., 2002] wildemuth, b. marchionini, g., wilkens, t., yang, m., geisler, g., fowler, b., hughes, a., & mu, x. (2002). alternative surrogates for video objects in a digital library: users' perspectives on their relative usability. proceedings of the 6th european conference ecdl 2002 on research and advanced technology for digital libraries, berlin: springer. (rome, september 16-18, 2002) 493-507. [wildemuth et al., (in review)] wildemuth, b. marchionini, g., wilkens, t., yang, m., geisler, g., hughes, a., & gruss, r. (in review). how fast is too fast? evaluating fast forward surrogates for digital video. [witbrock & hauptmann] witbrock, m. & hauptmann, a. (1998). artificial intelligence techniques in a digital video library. journal of the american society for information science, 49(7). 619-632. (corrected link to the second reference in the references section 8/31/05.) copyright © gary marchionini and gary geisler top | contents search | author index | title index | back issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/december2002-marchionini   taking control: identifying motivations for migrating library digital asset management systems search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine september/october 2015 volume 21, number 9/10 table of contents   taking control: identifying motivations for migrating library digital asset management systems ayla stein university of illinois at urbana-champaign astein@illinois.edu santi thompson university of houston libraries sathomp3@central.uh.edu doi: 10.1045/september2015-stein   printer-friendly version   abstract this paper analyzes and discusses results from "identifying motivations for dams migration: a survey," which traces institutions' motivations for migrating from one dams to another. using data from the survey, the researchers ask two questions: "what motivations prompted institutions to migrate from one dams to another?" and "in what directions are institutions moving?" the researchers find that respondents desire more local control over the library dams and, when faced with the decision to migrate, institutions are more often than not choosing open source software systems. the researchers conclude the paper by reviewing lessons learned from the research methodology and discussing future areas of exploration related to this study. the findings of this study can inform future dams selection and development.   1 introduction in the last two decades, digital asset management systems (dams) have become important tools for collecting, preserving, and disseminating digitized and born digital content to library patrons. over time, libraries have started to re-assess their dams based on the changing needs of users, the increased expertise of library professionals, and the ever-growing creation of web-based technologies. as a result of this re-evaluation process, some libraries decide to migrate to a new dams solution. while anecdotal evidence for the purpose and outcomes of these migrations are available, no extensive study on the process and rationale for library dams migration exists in the professional literature. in the fall of 2014, the researchers of this study conducted a survey titled "identifying motivations for dams migration: a survey" in order to better understand why institutions are migrating from one system to another. the results from the survey were used to answer the following two research questions: what motivations prompt institutions to migrate from one dams to another? in what directions are institutions moving? for example, are institutions moving from proprietary systems to open source systems?   1.1 definitions in the context of this paper, the researchers drew on the following terms: proprietary: "...any software that is copyrighted and bears limits against use, distribution and modification that are imposed by its publisher, vendor or developer. proprietary software remains the property of its owner/creator and is used by end-users/organizations under predefined conditions"1 open source: "open-source software (oss) is computer software for which the source code and certain other rights normally reserved for copyright holders are provided under a software license that meets the open-source definition in the public domain...it is very often developed in a public, collaborative manner"2 home grown: "in house software development"3 digital asset management system (dams)4: "software that supports the ingest, description, tracking, discovery, retrieval, searching, and distribution of collections of digital objects"5   2 literature review   2.1 selecting initial dams evaluating and choosing a digital asset management system can be a long, complex, and resource-intensive process. as librarians identified dams for their institutions, they began to document their methodology for others to reference and adapt. some of the earliest works addressing the selection of a dams focused on policy and planning considerations. h. frank cervone identified high-level planning strategies to make selecting a dams manageable for information professionals.6 other literature described how information professionals developed specific evaluation criteria for selecting a system. hoe-lian goh, et al., created an instrument that selected a dams based on a numeric score derived from comparing prospective dams against twelve categories generated by the authors, including content management, metadata, preservation, and the user interface.7 deridder refined the dams evaluation process further by encouraging others to conduct a formal needs assessment.8 jennifer l. marill and edward c. luczak established criteria for both an initial assessment of systems and a more exhaustive examination of a limited number of final candidates.9 collectively, these studies highlighted different aspects of dams that resonated with librarians who were charged with creating, implementing, and managing digital libraries.   2.2 case studies of digital library migration over a decade has passed since many institutions initially selected dams for their digital assets. in that time, libraries have refined their needs for these systems and, consequently, have migrated, or wanted to migrate, to another dams. although the literature on the migration process and the implications it has on the library profession is limited, in the following section we present several case studies of repository migration. in one of the earliest case studies involving dams migration, indiana university (iu) moved their content from variations, their original, homegrown digital music library system, to a newly developed system called variations210. the rationale for migrating systems included several reasons: demand for additional media and document formats; expanding the metadata; and the need to support new tools for "access, synchronization, and navigation."11 iu developed the first variations system in order to distribute music recordings over a network within the music library and it was developed in-house because none of the commercial software available at the time was capable of meeting all of their needs.12 additionally, the projects were supported by grants from the national science foundation (nsf) and the national endowment for the humanities (neh). the migration from variations to variations2 is significant because it resulted in a system that not only provided access to specialized content but also provided tools that allowed users to actively interact with the content in more dynamic ways.13 in her presentation "migrating from oclc's digital archive to duracloud," lisa gregory shared the state library of north carolina's experience with dams migration.14 the state library of north carolina found extensive reports, fixity checks, and virus scans offered by oclc's digital archive to be critical pieces of their preservation solution. however, they also identified several issues with the dams, including: difficulties finding and retrieving items and confronting upload requirements. as a result of these issues, they migrated to duracloud in 2012. while duracloud still presented barriers to dams administration (particularly when it came to searching content and overwriting data), librarians praised it for its user-friendly interface, robust reporting, and collaborative support community. her presentation demonstrated the important role digital preservation plays in deciding to migrate from one dams to another and reiterated the need for preservation issues and standards to be incorporated into the tools and best practices used by librarians when implementing a dams migration.15 the college of charleston libraries (cocl) manage the lowcountry digital library (lcdl), which is made up of partner institutions who contribute digitized materials about the lowcountry region.16 originally, the lcdl was built in contentdm.17 however, due to dissatisfaction with contentdm technical support, inaccurate search results, and license and maintenance fees, the cocl made the decision to find another solution.18 cocl assessed omeka and dspace as possible alternatives, but due to the unique nature of the lcdl, they determined that omeka would not be able to scale to the level lcdl required. dspace was deemed an acceptable plan b, but its limited visual content display functionality among other concerns kept cocl from adopting it. the version of islandora available at the time was also deemed to not be robust enough for the lcdl's unique needs. determining no out of the box solution (open source or proprietary) would meet their needs, cocl decided to build a new system using disparate open source software programs, which were: fedora commons (for storage), openwms (for ingest), blacklight (for a discovery layer), and drupal (for web interface).19 cocl found that the new system was flexible enough to meet the needs of the consortial lcdl and improved upon the issues they experienced under contentdm.20 this case is significant because the staff who undertook the vast majority of the work consisted of two people who happened to be involved with the lcdl and had some coding skills, showing that building a new digital asset management system for an entire consortium did not require full time web developers.21 since the digital resources library department at texas tech university library identifies the main user community for their digital library content as external researchers rather than local ones, they rely heavily on traffic from search engines and external websites. using google analytics, they copied a collection that was already in contentdm to dspace. they found that the collection in contentdm did not appear on the first page of results, while the collection in dspace consistently did. due to the dramatic difference in search engine optimization (seo) rankings, ttu libraries opted to migrate all of their digital library content to dspace. these results are significant because they show the importance of search engine traffic to the discoverability of library digital collections.22 in 2010, archives new zealand (anz) was charged to develop and implement a government digital archive.23 since anz's sister institution, the national library of new zealand (nlnz), had already put in a substantial amount of work and resources into their own digital asset management system for digital preservation, called rosetta24, anz decided to use that system as their final dams for preservation.25 in 2008, anz implemented fedora commons as an interim solution since planning for a "complete digital preservation system" was already in progress.26 in addition to building on previous work, anz wanted to migrate away from fedora commons because they found that "it had limited functionality to support the business processes involved in accepting and managing a digital archive."27 this case study is significant, because it highlights an instance of an institution migrating from an open source dams to a commercial one, albeit one that was developed in partnership with ex libris.28 examples from other states show the important role collaboration plays in the dams migration process. the florida council of state university libraries (csul) digital initiatives subcommittee (disc) assessed institutional needs surrounding dams for digital collections and identified a single dams to share among the 10 academic university libraries that comprise its membership. the subcommittee concluded that libraries needed improved resources and skills, including more robust and scalable systems, to meet their digital library needs.29 to aid in the selection process representatives from the different institutions created evaluation criteria based on the existing needs and projected uses of the membership. like other evaluation methods, csul explored issues involving architecture, content, metadata, ingest, search and retrieval, display and use, export, management, and budgets.30 in 2012, the committee recommended that the institutions select islandora because it had "the most robust architecture, supported by the largest number of developers and the largest user community, running on the most widely available open source platform."31 there are several reasons why this case is significant. first, it is a demonstration of a large consortial effort to evaluate and select a single system for system-wide use. they went from multiple dams, proprietary open source, and homegrown, to a single open source solution. this is one of the first examples of content in multiple and disparate systems being combined under a consortial infrastructure. another example of a multi-stakeholder digital library migration is the j. willard marriott digital library at the university of utah. the university of utah library dams review task force was charged by the technology services council to review and evaluate their current dams as well as others.32 the task force undertook a comprehensive evaluation of their current system and the other identified dams by soliciting input from users both of the j. willard marriott digital library and other partner libraries; comparing features and capabilities of their system to others and scoring each dams based on defined criteria; reviewing vendors and governing organizations by giving them a list of specific questions and evaluating them not only by the technical capabilities of the dams but also on the perceived enthusiasm of the vendors when responding to the questions; and finally performing a complete strengths weaknesses opportunities threats (swot) analysis of contentdm, for both its current and future hosted iterations, and of the hydra project repository software.33 after their extensive review process, the university of utah library dams review task force decided to form a dedicated group to begin development of a repository built on the hydra project repository software architecture. this case is also significant for several reasons. it is a showcase of one of the most comprehensive and detailed documented evaluations, especially considering the use of prepared questions to dams vendors. the task force ultimately chose a solution that will require them to migrate from multiple proprietary dams to a single instance of an open source dams. analysis of these seven case studies reveals interesting trends among institutions and their shift from one dams to another — particularly around their motivations for migrating, the results of the dams selection, and methodologies used to select a new dams. the rationales for migrating systems are as varied as the institutions conducting the case studies. some organizations used their dissatisfaction around key functions and services as a way to determine criteria for evaluating new systems. others were driven by future needs, particularly a system's scalability and extensibility. the results of the case studies also begin to suggest a larger trend in the kind of platforms to which institutions are migrating. four out of the seven libraries transitioned from proprietary to open source platforms; a consortium also selected an open source solution for all of its members, resulting in some institutions also transitioning from proprietary to open source. anecdotally, these results suggest that institutions are increasingly looking towards open source systems as solutions for their dams needs.34 table 1: chart of migration rationale and movement institution initial dams new dams rationale for migration movement college of charleston/lowcountry digital library contentdm fedora, drupal, openwms, blacklight dissatisfaction with: vendor technical support inaccurate search results in dams license and maintenance fees proprietary to open source state library of north carolina oclc's digital archive duracloud dissatisfaction with: upload limits and requirements item discoverability in dams want to keep: report features fixity check virus scans proprietary to open source florida council of state university libraries contentdm, digitool, sobekcm islandora want increased: robustness scalability multiple proprietary & open source to single open source texas tech university contentdm dspace dissatisfaction with: search engine optimization (seo) rankings proprietary to open source indiana university variations variarions2 want increased: content type support file format support metadata expansion new features and capabilities home grown to home grown archives new zealand fedora rosetta dissatisfaction with: limited dams functionality open source to proprietary j. williard marriott digital library at the university of utah contentdm hydra want increased: scalability robustness community support dams customiazation proprietary to open source     3 methodology   3.1 survey design and distribution for the purposes of this study, the researchers analyzed data from their survey, titled: "identifying motivations for dams migration: a survey". the survey was created and delivered with the qualtrics survey software, which utilized survey flow35 and skip logic36 functionality. researchers solicited participation from eligible institutions from july through september 2014. institutions were eligible if they met one of the following criteria: completed migration from the "old dams" to the "new dams" were currently migrating from the "old dams" to the "new dams" selected a "new dams" but had not started the migration process if a respondent indicated that their institution did not meet one of these qualifications, the survey ended without allowing respondents to answer additional survey questions. the survey asked respondents to choose the top five motivations from the thirteen topics. respondents were then asked to prioritize their five selections in order from most important to least important. based on these rankings, the qualtrics survey software presented questions only in the topical areas respondents indicated, in the order they were ranked.37 since the complete survey has over 100 questions, the researchers used this method to reduce the overall time needed to complete it.   3.2 development of survey sections, topics, and questions the survey introduction explained the scope and purpose of the survey, defined key terms, and outlined each section of the survey for respondents. the researchers stated that the purpose of the survey was to focus on "identifying libraries' motivations for transitioning from one digital asset management system (dams) to another, in order to provide access to primary source research materials." the scope emphasized that the survey did not focus on systems used exclusively as institutional repositories, which the researchers define as repositories that provide access to university scholarship. because existing data and case studies regarding dams migration were often limited to posters and powerpoint presentations as opposed to formal articles or research studies, the researchers elected to conduct a survey to answer their research questions. to generate the content for the survey, the researchers studied existing methods for initially selecting dams to identify key themes. drawing upon deridder,38 hoe-lian goh, et al.,39 marill and luczak,40 and andro, et al.,41 the researchers distinguished thirteen topic areas used to evaluate and select dams. table 2: survey topics and descriptions name of topic description of topic implementation & day-to-day costs the software, hardware, and personnel costs of implementing the "new dams" and the software, hardware, maintenance, and personnel costs of operating the system on a day to day basis. user administration the management of user accounts including adding, restricting, and removing of accounts; levels of user accounts with varying permissions; user authentication methods such as ldap, shibboleth, oauth, etc. organizational viability the governing organization's business model, defined mandate, and budget. technical support the availability and quality of the "new dams's" technical documentation, how-to manuals, active developer and/or user communities, formal help desk support, customer service, and bug reporting. system administration the "new dams's" automation of tasks; system security; usage tracking and analytics; system performance and reliability; and use of common technologies, such as windows or linux server software. extensibility the ability to incorporate additional functionality and capabilities to the "new dams" via viewing and manipulating the system code base, apis, social media integration, or other measures. information retrieval & access the quality and relevancy of the "new dams's" search results, search engine optimization rankings, and browsing capabilities. content management collection content and administration in the "new dams", including file formats, ingest issues, scalability, and rights management information. preservation the integration of preservation strategies into the "new dams", including fixity verification and the creation of checksum values, backups, synchronization, and/or the generation of archival information packages (aips). user interface customization the user interface, including the ability to customize and brand the interface as well as to adapt, edit, and revise the design and features based on user and repository needs. interoperability the "new dams's" ability to export metadata into other dams and digital program environments. the "new dams" should support international and/or industry standards for interoperability, including oai-pmh, z39.50, and sru/srw protocols. reputation the number of institutions that have implemented the "new dams" and their satisfaction with it. metadata standards the "new dams's" support of established metadata standards, user generated metadata, and linked data technologies. the researchers crafted specific questions for each of these thirteen topics. the questions were designed to understand how important specific issues were to institutions as they were selecting a new system or migrating from the "old dams" to the "new dams". the researchers generated two types of questions for the body of the survey. most questions used a likert scale of 1 [not important], 2 [somewhat important], 3 [important], and 4 [very important], in order to gauge the degree to which specific issues impacted the decisions to migrate and the new dams' selection process. other questions asked respondents to select all the options that applied to a specific issue. these questions were intended to identify which specific tools, software, or standards were desired in the "new dams". the researchers also asked respondents demographic questions about their institutions. specifically, respondents were asked to identify: where their institution was in the migration process42 what software they were using for their "old dams" what software they were using or going to implement for the "new dams" how long the decision making process took what type of library the respondents worked for the demographic section also had an optional question for respondents to disclose the name of the library and the parent institution. that information will not be disclosed in this paper. table 3: respondent's type of library type of library n % academic 30 61 research 8 16 public 4 8 special 2 4 special collections/archives 2 4 government 2 4 academic library consortia 1 2 museum 0 0 total responses 49 9943 while the researchers would prefer to analyze and discuss the results of every question from the survey in this paper, due to time and space constraints this is not possible. the researchers used the top five categories identified by respondents to answer the first research question. they discuss these results by analyzing the mean, standard deviation, and variance of the likert scale questions. additionally, the researchers decided that the needs discussed in depth should not include "obvious" results. for example, all respondents indicated that the new dams should have "the ability to support descriptive metadata standards", as important or very important. it is well known that descriptive metadata is mandatory for the discovery of resources in dams, so this question was not included for further analysis.   4 results forty-nine respondents completed the survey. since the researchers solicited anonymous responses from listserv subscribers, they did not have the information needed to calculate a response rate. once initiated, the survey had a completion rate of 47%. the survey responses are organized into the following sections: background on the migration process, system movement, and factors for migration. the final section in results has two subsections, 'the top five reasons for migrating' and 'priority rankings'.   4.1 background on the migration process as part of the demographics section, participants were asked at what stage of the migration process their institutions were at the time of the survey. the results are as follows: table 4: respondent's status in the migration process timeline response n % respondents are currently migrating from the "old dams" to the "new dams". 18 37 respondents have completed the migration process from the "old dams" to the "new dams". 17 35 respondents have selected a "new dams" but have not started the migration process. 12 24 respondents are in the process of selecting a new dams. 1 2 respondents are moving data into a preservation system from old system, but keeping both for a time. 1 2 total responses 49 100 almost 40% of all respondents were currently in the process of migrating from the "old dams" to the "new dams" at the time of the survey. thirty-five percent had completed the migration process and 24% had selected a new system but were still preparing to migrate. the migration selection process took most respondents six months to one year to complete. thirteen respondents took over one year to migrate and twelve respondents took six months or less to complete the migration process. table 5: how long did the decision making process take to select the "new dams"? response n % 6 months 1 year 24 49 1-2 years 13 27 0-6 months 12 24 other 0 0 total responses 49 100     4.2 system movement migrating dams provides institutions the opportunity to transition from one type of repository (proprietary, open source, or home grown) to another. prior to migration, a slight majority of respondents (52%) used proprietary systems to administer their digital library environments, including digitool and contentdm. just over one-quarter (27%) of respondents used open source repositories before migrating. a handful of other respondents were using home grown software or "other" approaches. of the institutions that elected to migrate from their "old dams" to their "new dams," a majority of respondents indicated that they would be migrating to an open source platform (64%), primarily islandora, hydra/fedora, and dspace. nearly one in five respondents (19%) indicated that they were migrating to a proprietary dams. a smaller percentage of respondents were creating home grown dams solutions or selecting "other" options. table 6: the "old dams" that the respondents were or are using response n % type of repository digitool 13 23 proprietary contentdm 11 20 proprietary dspace 9 16 open source other 5 9 other/na fedora 4 7 open source home grown 3 5 home grown luna 2 4 proprietary islandora 1 2 open source digital commons 1 2 proprietary eprints 1 2 open source application extender 1 2 other/na adobe bridge 1 2 other/na filemaker pro custom 1 2 other/na mimsy xg 1 2 proprietary sitesearch (oclc) 1 2 proprietary file system 1 2 other/na hydra/fedora 0 0 open source greenstone 0 0 open source total responses 5644 10245       table 7: "old dams" type of repository totals and percentages type of repository n % proprietary 29 52 open source 15 27 other/na 9 16 home grown 3 5 total responses 56 100     table 8: the "new dams" that the respondents currently are or will be using response n % type of repository islandora 13 25 open source hydra/fedora 6 12 open source dspace 4 8 open source other 4 8 other/na fedora 3 6 open source sobekcm 3 6 open source home grown 3 6 home grown digital commons 2 4 proprietary greenstone 2 4 open source rosetta 2 4 proprietary contentdm 1 2 proprietary digitool 1 2 proprietary ibase 1 2 proprietary vital 1 2 proprietary undecided 1 2 other/na preservica 1 2 proprietary luna 1 2 proprietary blacklight 1 2 open source nuxeo 1 2 open source eprints 0 0 open source total responses 5146 10147       table 9: "new dams" type of repository totals and percentages type of repository n % open source 33 65 proprietary 10 20 other/na 5 10 home grown 3 6 total responses 51 10148     4.3 factors for migration the methodology section described the process participants went through to pick their top areas of concern. the following table depicts the results from that exercise. the topic areas are displayed in order of most chosen to least chosen. the number of respondents who chose the topic area as a "top priority" is also displayed. table 10: top areas of concern when selecting a "new dams" and priority ranking response n % rank #1 rank #2 rank #3 rank #4 rank #5 content management 30 61 5 6 9 6 4 metadata standards 25 51 2 5 5 4 9 extensibility 24 49 6 5 8 4 1 preservation 23 47 9 2 4 6 2 user interface customization 23 47 3 7 3 4 6 technical support 21 43 2 6 4 2 7 implementation & day-to-day costs 20 41 10 4 2 1 3 interoperability 19 39 3 1 5 4 6 information retrieval & access 18 37 3 5 1 6 3 user administration 13 27 2 2 5 3 1 system administration 12 24 0 3 0 6 3 organizational viability 9 18 4 3 1 0 1 reputation 8 16 0 0 2 3 3     4.4 top five reasons for migrating survey results indicated that the following topics were the highest priority for those migrating from an old to a new dams: content management metadata extensibility preservation / user interface customization technical support note that the preservation and user interface customization categories received the same number of responses. therefore, both will be included in the list of top five reasons for migration. this tie creates a total of six categories in the top five reasons for migration. the questions in table 11 address content management issues. as a reminder, likert scale questions had responses ranging from 1 [not important] to 4 [very important]. table 11: survey questions related to content management question n mean sd variance the capacity of the "new dams" to contain increasing amounts of digital objects over time without negatively impacting performance. 28 3.75 0.52 0.27 the ability to batch upload content into the "new dams". 28 3.71 0.60 0.36 the ability to support various file formats. 28 3.61 0.79 0.62 the ability to allow embargoes or restricted access in certain circumstances. 28 3.57 0.69 0.48 the capacity to display both simple and complex digital objects. 29 3.48 0.78 0.62 the ability to display rights and intellectual property information. 28 3.46 0.69 0.48 the storage capacity to hold both simple and complex digital objects. 28 3.46 0.79 0.63 the number of items the "new dams" can batch upload at one time. 27 3.37 0.69 0.47 the size limit for ingesting content. 28 2.86 1.01 1.02 tables 12 through 23 present responses to additional content related questions, including types of objects and file formats, metadata, preservation, identifiers, and support. table 12: what types of objects did you desire the system to display? response n % manuscripts 24 83 images 24 83 maps 23 79 books 22 76 video content 21 72 audio content 21 72 spreadsheets 11 40 graphs 9 31 3d objects 8 28 other 5 17 research data 2 7 gis 1 3     table 13: what file formats did you desire the "new dams" to support? response n % pdf 28 98 jpeg 26 90 mp3 22 76 jpeg2000 21 72 tiff 21 72 mp4 19 66 mov 17 59 csv 16 55 doc 13 45 docx 12 41 kml 2 7 wav 2 7 gis 2 7 kmz 1 3     table 14: survey results related to metadata question n mean sd variance the ability to support multiple metadata schema. 22 3.68 0.57 0.32 the ability to support administrative, preservation, structural, and/or technical metadata standards. 22 3.59 0.80 0.63 the ability to support local metadata standards and practices. 22 3.32 0.95 0.89 the "new dams" supports linked data technologies. 22 2.82 1.10 1.20 the ability to support user created metadata, such as tags or "folksonomies". 22 2.59 1.05 1.11     table 15: what descriptive metadata standards/schema did you desire the "new dams" to support? response n % dublin core 19 90 mods 16 76 ead 12 57 marc 10 48 vra core 7 33 pb core 3 14 ddi 3 14 all schema/schema-less 3 14 gns 1 5 table 15 shows that while dublin core was the most popular response, several other standards/schema had high responses, which suggests that future systems should support multiple descriptive schema. additionally, the researchers received several free text responses that said dams should support all metadata schema or should be schema-less.   table 16: what administrative, preservation, structural, and/or technical metadata standards did you desire the "new dams" to support? response n % mets 18 90 premis 15 75 tei 8 40 vra core 5 25 mix 2 10 pb core 2 10     table 17: what linked data technologies did you desire the "new dams" to support? response n % rdf/xml 16 89 json 10 56 rich snippets/rich data 2 11 other 1 6     table 18: survey results related to extensibility question n mean sd variance institutions can create their own modules/plugins/widgets/etc. for the "new dams". 21 3.67 0.58 0.33 the "new dams" has an available api. 22 3.64 0.58 0.34 the code base is available for everyone to see and use (open source). 22 3.55 0.60 0.35 the "new dams" supports digital object identifiers. 22 3.23 0.97 0.95 the governing organization creates modules/plugins/widgets/etc. to fit the needs of your institution. 22 3.00 0.76 0.57 the "new dams" supports personal digital identifiers. 21 2.24 0.94 0.89 the "new dams" natively supports sharing to social media. 21 2.14 1.01 1.03 the "new dams" authenticates with social media and other sso (single sign on) services (facebook, twitter, openid, gravatar, etc.). 22 1.95 0.90 0.81     table 19: what digital object identifiers did you want the "new dams" to support? response n % doi 17 61 ezid 4 14 ark 3 11 handle 2 7 urn:nbn 1 4 local identifiers 1 4     table 20: what personal digital identifiers did you want the "new dams" to support? response n % orcid 12 46 ark 5 19 researcherid 4 15 other 3 12 mads authorities 1 4 isni 1 4     table 21: survey questions related to dams preservation features and functionality question n mean sd variance the ability generate checksum values for ingested digital assets. 20 3.55 0.76 0.58 the ability perform fixity verification for ingested digital assets. 19 3.53 0.77 0.60 the ability to assign unique identifiers for each aip49. 19 3.32 1.00 1.01 the ability to support premis or local preservation metadata schema. 20 3.30 0.98 0.96 the ability to produce aips. 20 3.15 0.88 0.77 the ability to integrate with other digital preservation tools. 20 3.15 0.88 0.77 the ability to synchronize content with other storage systems (including off site locations). 20 3.10 0.91 0.83 the ability to support multiple copies of the repository — including dark and light (open and closed) instances. 20 2.90 0.97 0.94     table 22: survey questions related to dams user interface customization question n mean sd variance the ability to change interface features of the "new dams" to fit local needs. 20 3.80 0.41 0.17 the ability to custom brand the interface of the "new dams". 20 3.75 0.55 0.30 the interface of the "new dams" supports responsive web design. 20 3.35 0.93 0.87 the "new dams" can be accessed by mobile or tablet users. 20 3.30 0.80 0.64 the ability of the "new dams" interface to meet national and/or international accessibility standards. 20 3.20 0.89 0.80 the governing organization will do custom branding or feature selection for subscribing institutions. 20 2.40 1.35 1.83     table 23: survey questions related to dams technical support question n mean sd variance technical documentation for the "new dams" is readily available. 19 3.47 0.77 0.60 there is dedicated technical support available from the developer or from a third party. 20 3.45 0.76 0.58 effectiveness of customer service support. 20 3.35 0.75 0.56 promptness of customer service support. 20 3.30 0.86 0.75 there is an active developer community around the "new dams". 20 3.05 1.23 1.52 there is a ticket submission feature for reporting issues. 20 2.90 0.85 0.73     4.5 priority rankings results surrounding the ranking of the categories do not align neatly with the overall top five topics. some categories, while not receiving enough votes to enter the top five, had high priorities for respondents (for example, implementation & day-to-day costs). the survey software used this information to determine the display order of topic areas and their questions. the researchers determined that topics which fell outside of the top five are considered out of scope for this article.   5 discussion the researchers used a survey focused on discovering the motivations for migrating from one dams to another in order to answer two questions: what needs and/or factors prompt institutions to migrate? what "direction" are institutions migrating? after analyzing the results, the researchers believe that an overarching need for self-autonomy and control drive organizations to migrate from one dams to another. because institutions prefer to define the dams they use in their own terms, it should come as no surprise that respondents to the survey are trending towards dams derived from open source software. in this section, the researchers use survey results to answer these two questions in depth; at the same time, they also discuss the implications of their research and identify limitations to this study.50   5.1 what needs and/or factors prompt institutions to migrate? content management the top content management need, as identified by survey respondents, is: "the ability to support various file formats". the high average of 3.61, combined with the relatively low standard deviation and variance indicate a consensus among survey respondents that the support of various file formats is important in the new dams. these results suggest that, while preferred file format standards exist for preservation purposes,51 non-preservation dams that only support a limited set of file formats are not as useful for libraries. however, when these results are compared to the question "what file formats did you desire the "new dams" to support?" (select all that apply), the most popular text and image file formats largely align with the sustainable recommendations, e.g. pdf, jpeg/2000, and tiff. for audio-visual formats, this is not the case. because preservation file formats have not yet been standardized for all digital av materials, the researchers could not include an array of preservation quality or sustainable av file format options. mp3, mp4, and mov formats were all considered important for the new dams to support, which surprised the researchers considering their proprietary nature. these results may suggest that libraries are primarily concerned with providing access to av content in their digital library dams, and not necessarily placing a priority on the preservation of digital av files. another top need identified in the content management topic is: "the capacity to display both simple and complex digital objects." the high mean, combined with the relatively low standard deviation and variance scores, indicate that it is an important function for the new dams. these results suggest that libraries are increasingly hosting diverse types of content in their digital environments. when these results are compared to those of the question "what types of objects did you desire the system to display?" (select all that apply), the most popular objects were simple digital objects, with manuscripts, images, and maps all ranking above complex digital objects,52 such as books, video, and audio content. still other types of complex digital objects, including research data, gis data, spreadsheets, and 3d objects, failed to garner more than 40% interest from respondents. these results also pertain to the scope of the survey, which focused on systems that provided access to primary source content, such as digitized special collections materials, and specifically excluded systems used exclusively as institutional repositories.53 the last point of interest to discuss in the content management section was not considered a top need by respondents. the question, "the size limit for ingesting content", received an average of 2.86, a standard deviation of 1.01, and a variance of 1.02, meaning that most respondents did not consider this to be an important criteria for choosing a new dams. however, the high variance indicates that there is not a consensus among respondents. the researchers found the lack of interest in the size limit for ingesting content to be puzzling considering that respondents indicated that the new dams should support, store, and display complex digital objects, including audiovisual materials, which are typically much larger than pdfs or still images. these results could suggest that institutions: are not uploading large amounts of audiovisual content to their dams at a time, relegating the ingest size limit to a lower priority; are comfortable uploading larger files programmatically; or that respondents may not be considering the technical needs of larger and more complex content. metadata the researchers found results pertaining to questions in the metadata section to be predominantly predictable.54 in spite of this, there are some illuminating trends. for example, the results for the question "the ability to support multiple metadata schema" show a distinct positive consensus for this need. this result is notable because the current generation of dams are built around one or two specific metadata schema, e.g. contentdm uses a dublin core variation; dspace has qualified dublin core as the default schema; islandora generates dublin core data streams by default, with optional mods metadata for increased descriptive metadata, etc. the argument can be made that the use of particular metadata schemas is heavily driven by the implemented systems. until recently, most systems locked libraries into dublin core, with its incumbent advantages and disadvantages. additionally, the results from the question "what descriptive metadata standards/schema did you desire the 'new dams' to support?", suggest that libraries want richer and more comprehensive metadata capability. these results, combined with the general trend in movement from proprietary to open source, suggests respondents seek increased metadata schema flexibility than is currently offered by most turnkey systems. the responses to the survey question "the new dams supports linked data technologies" indicated a lack of consensus on whether or not linked data technologies were considered necessary for new dams. the lack of consensus reflects the present status of applied linked data technologies in the library world. until relatively recently, linked data was, and still often is, an abstract or intangible concept. while research, investigation, and infrastructure development on library linked data has been underway for several years, it was not until the release of fedora 455 and kuali ole56, that native linked data library systems became readily available. even between these two systems, only fedora 4 can function as a dams.57 the results showing the demand for multiple/all-schema support seem almost contradictory to the results regarding linked data, because systems that support linked data could (in theory) support multiple metadata schemas. the researchers believe that the lack of consensus regarding linked data is the product of confusion around general linked data knowledge, and a dearth of affordable and reliable linked data functioning dams. there is still a significant amount of work that needs to be accomplished before linked data technology is within reach of most libraries. responses to the question, "the ability to support user-created metadata such as tags or folksonomies" also indicated a lack of consensus among survey respondents. the researchers suspect that the type of user-created metadata needed in dams has changed over time58, and research-oriented user-metadata features, like highlighting and annotating, would be rated more highly. this topic is an area of future investigation that the researchers hope to explore further with research data and scholarship repositories. extensibility the top needs identified by survey respondents in the 'extensibility' section indicate that institutions want a larger degree of local control over their dams. for example, the top need identified by participants in the extensibility section is: "institutions can create their own modules/plugins/widgets/etc. for the "new dams". the high mean of 3.67, combined with a low standard deviation of 0.58 and a variance of 0.33, indicate a high level of consensus among survey participants that it is very important for institutions to be able to create their own modules, plugins, and related functionality. similarly, survey responses demonstrate that it is also important for the "new dams" to have an available api. this need is closely followed by the third highest ranked criteria in extensibility: "the code base is available for everyone to see and use (open source)", which had a mean of 3.55, a standard deviation of .60, and a variance of .35, indicating unified support for open source software. this result, possibly more than any other in this section, directly reinforces the overall trend that organizations are demanding increased local control of their dams. the last extensibility result that will be discussed is: "the governing organization creates modules/plugins/widgets/ etc. to fit the needs of your institution". this need qualifies as important because it has a mean of 3.00, a standard deviation of 0.76, and a variance of 0.57. the result somewhat surprised the researchers because it seems to directly contradict previous results in that respondents still want the governing organization (or vendor) to develop and create modules, plugins, and related functionality. some libraries may not have the resources to hire or retain local developers. while respondents may want additional control of the "new dams," this does not necessarily signal an end to governing organizations providing services or systems. digital preservation the high mean values for most of the digital preservation actions suggest that respondents desire the dams to execute preservation related tasks. results from the questions "the ability to generate checksum values for ingested digital assets," with a mean of 3.55, a standard deviation of 0.76, and a variance of 0.58, and "the ability to perform fixity verification for ingested digital assets," with a mean of 3.53, a standard deviation of 0.77, and a variance of 0.60, show that respondents thought the process of generating and verifying checksums through the dams was very important. these findings suggest that many information professionals are focused on creating a mechanism to ensure the integrity of digital objects.59 respondents viewed additional curatorial actions as important, including "the ability to produce aips," with a mean of 3.15, a standard deviation of 0.88, and a variance of 0.77, and "the ability to integrate with other digital preservation tools," with a mean of 3.15, a standard deviation of 0.88, and a variance of 0.77. compared to the results for digital preservation related questions on file formats and detailed technical metadata, these results indicate a disconnect between what respondents know they should be doing in theory (e.g. creating and preserving aips) and what they do on a daily basis (collecting audio content as mp3 files instead of wav files). this is surprising because most respondents ranked digital preservation as one of their top five priorities for migrating to a "new dams." other results related to digital preservation functions in a "new dams" were inconclusive. while some functionality had favorable response scores "the ability to synchronize content with other storage systems (including off site locations)," with a mean of 3.10, a standard deviation of 0.91, and a variance of 0.83, "the ability to assign unique identifiers for each aip," with a mean of 3.32, a standard deviation of 1.00, and a variance of 1.01, and "the ability to support premis or local preservation metadata schema" with a mean of 3.30, a standard deviation of 0.98, and a variance of 0.96, all have high standard deviations and variance scores. these responses were somewhat perplexing to the researchers because each of these questions cover core elements of digital preservation. these inconclusive results lend further support claims of a disconnect between digital preservation theory and daily practices. finally, respondents were also divided over the ability of the new system to "support multiple copies of the repository — including dark and light (open and closed) instances." unlike topics mentioned in the previous paragraphs, this question's mean scored slightly lower than 3.0, indicating that it was not favored by most respondents. while it is possible that low scores were a result of genuine disinterest in the ability to store multiple copies, the researchers believe that the wording of the question may also be playing a role in the final score. the question fails to provide a storage location or examples of specific mechanisms that support multiple repository copies. however, not stipulating a location or tool may have misled respondents into thinking that these copies would be stored locally as opposed to being distributed geographically. user interface customization analyzing the results of the user interface customization section showed that local control and responsive interface design were both important aspects of a "new dams." responses to "the ability to change interface features of the "new dams" to fit local needs," with a mean of 3.80, a standard deviation of 0.41, and a variance of 0.17, and "the ability to custom brand the interface of the "new dams,"" with a mean of 3.75, a standard deviation of 0.55, and a variance of 0.30, scored very important. however, respondents scored "the governing organization will do custom branding or feature selection for subscribing institutions," with a mean of 2.40, standard deviation of 1.35, and a variance of 1.83, as not important. survey responses illustrate that respondents desire the freedom to make interface decisions and changes, even with the option of having the governing organization maintain and customize the interface on behalf of the subscribing institution. additionally, respondents ranked "the 'new dams' can be accessed by mobile or tablet users," with a mean of 3.30, a standard deviation of 0.80, and a variance of 0.64, as important. this last consideration is especially significant because it shows an awareness of current and future technology developments. the majority of web traffic now comes from mobile devices,60 a trend that will only increase and potentially lead to more diverse access points with the advent of wearable technology and the so-called "internet of things". technical support responses to "there is dedicated technical support available from the developer or from a third party," with a mean of 3.45, a standard deviation of 0.76, and a variance of 0.58, ranked as important among respondents. when it comes to this support, respondents felt that the "promptness of customer support service," with a mean of 3.30, a standard deviation of 0.86, and a variance of 0.75, was also important. with a mean of 3.05, a high standard deviation of 1.23, and a variance of 1.52, results for "there is an active developer community around the "new dams"" showed a lack of consensus among respondents. this result suggests that respondents desire the ability to solicit support for technical problems from a designated group. initially these results seem contradictory to the researchers' argument that institutions are moving towards local control. however, the ideas of local control and third party support are not mutually exclusive; while institutions' autonomy and control over dams should increase, vendors and governing organizations are still necessary to the long term maintenance and reliability of dams software.   5.2 in what "direction" are institutions migrating? while just over half of the respondents originated with a proprietary dams, nearly two-thirds of the same institutions selected open source dams as the systems to which they were migrating. these results indicate that there is a trend to move towards open source software when migrating from one dams to another. this movement aligns with the other trend that institutions desire more local control and autonomy over their dams, and they are moving to the software solutions that will better meet these needs. graph: migration from "old" to "new" dams by type   6 conclusion   6.1 limitations during the analysis of the data, the researchers identified several limitations with this study. because researchers defined the scope of the project to include only those repositories administering digitized special collections materials, they did not ask questions nor collect data focused on other types of repositories (i.e. discipline, institutional, or data repositories). additionally, the way that researchers constructed survey questions may have impacted the study's results. there are several examples that illuminate this limitation. first, researchers did not supply enough appropriate answer options or failed to "follow up" on broad questions with more specific questions. in the metadata section, for example, the researchers focused entirely on user-created vocabularies, and did not include examples of added-value metadata, e.g. annotations. second, broad questions regarding aip creation and system integration with additional preservation tools left researchers with little understanding on how respondents desired this process to work. third, the vocabulary used in particular questions may have created ambiguity for survey participants. this could have allowed one respondent to interpret the question differently than another respondent. for example, the survey question "there is an active developer community around the "new dams" may have implied a demand for an in-house developer as opposed to a broad user community of adopters. finally, the researchers did not have a thorough enough understanding of some key concepts related to this survey. for example, they did not fully realize the nuances of oss, especially regarding the complexity and overlap between hybrid oss and proprietary systems.   6.2 next steps and future research in the process of conducting this study, the researchers identified areas of future inquiry. research can build off of the existing data set created through this study. for example, analyzing motivations by library type (academic, public, special, government) might yield different results depending on the demographics of an institution. similarly, analyzing survey results from the topics that received high importance scores but were eliminated from this paper because they fell outside of the top five most important, could reveal additional motivations. subsequent surveys could complement the results of this one, particularly those investigating migration patterns among other types of repositories excluded from this study. this research may also inform longitudinal studies focused on how the idea and adoption of oss in libraries has changed over time. finally, this type of study has exposed other related and unexplored research topics, including the overlap of homegrown and oss dams systems and the creation of a rubric for selecting a library dams for migration. in conclusion, the researchers believe that both the motivations for migrating from one dams to another and the trend of institutions moving from proprietary software to oss derives from their desire for increased self-control over library systems. dams governing bodies and vendors should note these trends. organizations should incorporate more strategic input and active participation from their customers if they are to continue to be relevant for the library community.   acknowledgements the authors would like to extend their sincerest thanks to the following individuals for their valuable feedback: members of the university of houston libraries' writers support group, as well as keith komos, sean watkins, drew krewer, and micah kenfield.   notes 1 "what is proprietary software?" techopedia, (2015). 2 k.t. anuradha, r. sivakaminathan, and p. arun kumar, "open-source tools for enhancing full-text searching of opacs," program 45, no. 2 (april 26, 2011): 231-39. http://doi.org/10.1108/00330331111129750 3 n. venkatesh, "development of digital assets management software for research centre imarat," desidoc journal of library & information technology 32, no. 5 (sep. 2012). 4 the researchers used the term 'digital asset management system" instead of 'digital library' or 'digital repository' because it seemed to carry less preconceived connotation. it is not within the scope of this paper to explain the nuanced differences between these terms. for a more in-depth discussion on the concepts of dams, digital repositories, and digital libraries, please see the following articles: christine l. borgman, "what are digital libraries? competing visions," information processing and management 35, no. 3 (may 1999): 227—43, http://doi.org/10.1016/s0306-4573(98)00059-4; digital library federation, "a working definition of digital library," (1998); clifford lynch, "digital collections, digital libraries and the digitization of cultural heritage information," first monday 7, no. 5 (may 6, 2002). 5 richard pearce-moses, "digital assets management system," a glossary of archival and records terminology, society of american archivists (2005). 6 h. frank cervone, "some considerations when selecting digital library software," oclc systems & services: international digital library 22, no. 2 (2006): 107-110, http://doi.org/10.1108/10650750610663987 7 dion hoe-lian goh, alton chua, davina anqi khoo, emily boon-hui khoo, eric bok-tong mak, maple wen-min ng, "a checklist for evaluating open source digital library software," online information review 30, no. 4 (july 13, 2006): 360—379, http://doi.org/10.1108/14684520610686283 8 jody l deridder, "choosing software for a digital library," library hi tech news 24, no. 9 (2007): 19—21. deridder emphasized that the needs assessment should address: computer hardware and personnel, users of the digital library, software programming needs, the cost of maintaining the digital library, and the interoperability of metadata (pp. 19—21). 9 jennifer l. marill and edward c. luczak, "evaluation of digital repository software at the national library of medicine," d-lib magazine 15, no. 5/6 (may 2009). http://doi.org/10.1045/may2009-marill 10 jon w. dunn, donald byrd, mark notess, jenn riley, and ryan scherle, "variations2: retrieving and using music in an academic setting," communications of the acm 49, no. 8 (august 1, 2006): 53. http://doi.org/10.1145/1145287.1145314 11 dunn et al., "variations2," p. 53. 12 ibid. 13 ibid. for example, users were able to annotate content and variations2 would save these interactions to local metadata files for later reuse. the variations projects are also significant because the research conducted with the experimental variations2 system led to variations3 and then ultimately to the avalon media system that is being developed in conjunction with northwestern university; see: avalon media system project, "the project," avalon media system, 2015. 14 lisa gregory, "migrating from oclc's digital archive to duracloud," (presentation, best practices exchange, december 2012). 15 gregory, "migrating from oclc," slides 12-30. 16 see search the lowcountry digital library. 17 heather gilbert and tyler mobley, "breaking up with contentdm: why and how one institution took the leap to open source," the code4lib journal, no. 20 (april 17, 2013). 18 gilbert and mobley, "breaking up with contentdm." 19 ibid. 20 ibid. 21 ibid. 22 joy marie perrin, "contentdm to dspace — why?," (poster presentation, texas conference on digital libraries, 2013). 23 jan hutař "archives new zealand migration from fedora commons to the rosetta digital preservation system," ipres2013 proceedings, (2013). 24 exlibris rosetta. 25 hutař, "archives new zealand migration." anz and nlnz were also both brought under the new zealand department of internal affairs, providing further support for using a single system and building on the work that had already been done. 26 ibid. 27 ibid. 28 ibid. 29 the florida council of state university libraries digital initiatives subcommittee, "digital initiatives subcommittee survey: digital library inventory". csul meeting at university of west florida," (2009). 30 gail clement, laurie n. taylor, mark v. sullivan, lee dotson, "features desired in a digital library system," (university of florida libraries: gainesville, fl, 2010). 31 emily gore, lee dotson, lois widmer, "digital library platform working group final report," (2012). 32 kinza masood and anna neatrour, "digital asset management system options: report of the university of utah libraries dams review task force" (webinar, online, february 6, 2014). 33 masood and neatrour, "digital asset management system options". 34 a third trend that emerged from these case studies was the absence of any formal methodology used to evaluate and select a dams in which to migrate. the researchers will not be addressing this issue in their current study. 35 "about survey flow," qualtrics. 36 "skip logic," qualtrics. 37 most important to least important. 38 deridder. "choosing software for a digital library," pp. 19—21. 39 goh et al., "a checklist for evaluating open source digital library software," pp. 360—379. 40 marill and luczak, "evaluation of digital repository software." 41 mathieu andro, emmanuelle asselin, and marc maisonneuve, "digital libraries: comparison of 10 software," library collections, acquisition and technical services 36, no. 3—4 (2012): 79—83. http://doi.org/10.1016/j.lcats.2012.05.002 42 see eligibility questions. 43 total does not sum to 100% due to rounding. 44 researchers believe that this total resulted from some institutions having multiple "old dams." 45 total does not sum to 100% due to rounding. 46 researchers believe that this total resulted from some institutions condensing multiple "old dams" into one "new dams." 47 total does not sum to 100% due to rounding. 48 total does not sum to 100% due to rounding. 49 aip is an acronym for an archival information package. for more information on the role of aips in digital preservation, please see the open archival information system reference model. 50 as mentioned in the methodology, while the researchers would prefer to analyze and discuss the results of every question from the survey in this paper, due to time and space constraints this is not possible. the researchers discussed needs for each of the top five topics. 51 "recommended format specifications — resources (preservation, library of congress)," web page, recommended format specifications. 52 defined here as: "includes two or more content files (and their format variants or derivatives) and corresponding metadata. the content files are related as parts of a whole and are sequenced logically, such as pages." california digital library, "glossary," cdl glossary. 53 see survey: "identifying motivations for dams migration: a survey". 54 for example, every respondent answered "the ability to support descriptive metadata standards" was 'important' or 'very important'. 55 "features | fedora repository," duraspace. 56 "describe and manage module," kuali. 57 in the context of this paper. 58 especially considering the proliferation of tablets, touchscreens, and phablets, which are defined as "a smartphone having a screen which is intermediate in size between that of a typical smartphone and a tablet computer." "phablet," oxford dictionary. 59 michael seadle, "archiving in the networked world: authenticity and integrity," library hi tech 30, no. 3 (2012): 545-552, http://doi.org/10.1108/07378831211266654. as michael seadle (2012) states, checksums "give a reasonable approximation of whether two files are identical" by "adding up the number of bytes or bits in a file or part of a file. the checksum from a file ought to be identical with its copy. any change indicates an integrity loss" (p. 551). seadle also states "not all checksum algorithms will necessarily detect a simple situation where two bits have flipped, but most bit-rot problems and almost any deliberate alteration of the digital object tend to create changes on a larger scale, making checksums an effective means of integrity assurance" (p. 551). for more on the role of integrity in digital objects, see clifford lynch, "authenticity and integrity in the digital environment: an exploratory analysis of the central role of trust", council on library and information resources, washington, dc, (2002); reagan w. moore and mackenzie smith, "automated validation of trusted digital repository assessment criteria", journal of digital information 8, no. 2 (2007); seamus ross, "position paper on integrity and authenticity of digital cultural heritage objects", digitcult: integrity and authenticity of digital cultural heritage objects 1, (2002). 60 sarah perez, "majority of digital media consumption now takes place in mobile apps," techcrunch.   bibliography [1] "about survey flow." qualtrics. [2] andro, mathieu, emmanuelle asselin, and marc maisonneuve. "digital libraries: comparison of 10 software." library collections, acquisition and technical services 36, no. 3—4 (2012): 79—83. http://doi.org/10.1016/j.lcats.2012.05.002 [3] anuradha, k.t., sivakaminathan, r., and kumar, p. arun. "open-source tools for enhancing full-text searching of opacs." program 45, no. 2 (april 26, 2011): 231—39. http://doi.org/10.1108/00330331111129750 [4] borgman, christine l. "what are digital libraries? competing visions." inf. process. manage. 35, no. 3 (may 1999): 227—43. http://doi.org/10.1016/s0306-4573(98)00059-4 [5] california digital library. "glossary." cdl glossary [6] cervone, h. frank. "some considerations when selecting digital library software." oclc systems & services 22, no. 2 (april 2006): 107—110. http://doi.org/10.1108/10650750610663987 [7] clement, gail, laurie n. taylor, mark v. sullivan, and lee dotson. "features desired in a digital library system (2010)." university of florida george a. smathers library. september 8, 2010. [8] deridder, jody l. "choosing software for a digital library." library hi tech news 24, no. 9 (2007): 19—21. http://doi.org/10.1108/07419050710874223 [9] "describe and manage module." [10] "digital initiatives subcommittee (disc) survey: digital library inventory." csul meeting at university of west florida, 2009-09-10." september 10, 2009. [11] digital library federation. "a working definition of digital library [1998]." text. none, december 7, 2004. http://old.diglib.org/about/dldefinition.htm [12] dion hoe-lian goh, alton chua, davina anqi khoo, emily boon-hui khoo, eric bok-tong mak, and maple wen-min ng. "a checklist for evaluating open source digital library software." online information review 30, no. 4 (july 1, 2006): 360—79. http://doi.org/10.1108/14684520610686283 [13] dunn, jon w., donald byrd, mark notess, jenn riley, and ryan scherle. "variations2: retrieving and using music in an academic setting." communications of the acm 49, no. 8 (august 1, 2006): 53. http://doi.org/10.1145/1145287.1145314 [14] gilbert, heather, and tyler mobley. "breaking up with contentdm: why and how one institution took the leap to open source." the code4lib journal, no. 20 (april 17, 2013). [15] gore, emily, lee dotson, and lois widmer. "digital library platform working group final report." islandora flvc. february 20, 2012. [16] gregory, lisa. "migrating from oclc's digital archive to duracloud." 19:32:14 utc. [17] hutař, jan. "archives new zealand migration from fedora commons to the rosetta digital preservation system." ipres2013 proceedings, 2013. [18] lowcountry digital library. "search the lowcountry digital library." (2012). [19] lynch, clifford. "digital collections, digital libraries and the digitization of cultural heritage information." first monday 7, no. 5 (may 6, 2002). [20] lynch, clifford. "authenticity and integrity in the digital environment: an exploratory analysis of the central role of trust." in authenticity in a digital environment. washington, d.c.: council on library and information resources, 2000. [21] marill, jennifer l., and edward c. luczak. "evaluation of digital repository software at the national library of medicine." d-lib magazine 15, no. 5/6 (may 2009). http://doi.org/10.1045/may2009-marill [22] masood, kinza, and anna neatrour. "digital asset management system options: report of the univeristy of utah libraries dams review task force." webinar, online, february 6, 2014. [23] moore, reagan w., and mackenzie smith. "automated validation of trusted digital repository assessment criteria," september 2007. [24] pearce-moses, richard. "digital assets management system." society of american archivists, 2005. [25] perez, sarah. "majority of digital media consumption now takes place in mobile apps." techcrunch. [26] perrin, joy marie. "contentdm to dspace why?" in tcdl 2013, 2013. [27] phablet — definition of phablet in english from the oxford dictionary. [28] "recommended format specifications — resources (preservation, library of congress)." web page. recommended format specifications. [29] "reference model for an open archival information system (oais)." space data and information transfer systems, no. 2 (2012). [30] "rosetta digital management and preservation: overview." ex libris: the bridge to knowledge. [31] ross, seamus. "position paper on integrity and authenticity of digital cultural heritage objects." digicult, no. 1 (august 2002): 6—8. [32] seadle, michael. "archiving in the networked world: authenticity and integrity." edited by kathrin ambrozic. library hi tech 30, no. 3 (august 31, 2012): 545—552. http://doi.org/10.1108/07378831211266654 [33] "skip logic." qualtrics. [34] venkatesh, n, k. nageswara rao, and s. kalpavalli. "development of digital assets management software for research centre imarat." desidoc journal of library & information technology 32, no. 5 (september 2012): 447—51. [35] "what is proprietary software? — definition from techopedia," techopedia.com.   about the authors ayla stein is a metadata librarian at the university of illinois at urbana-champaign. she holds a ba in east asian studies from the university of arizona and a ms of information from the university of michigan.   santi thompson is currently the head of digital repository services at the university of houston libraries. he holds an mlis and ma, public history from the university of south carolina.   copyright © 2015 ayla stein and santi thompson 2012 census of open access repositories in germany: turning perceived knowledge into sound understanding search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine november/december 2013 volume 19, number 11/12 table of contents   2012 census of open access repositories in germany: turning perceived knowledge into sound understanding paul vierkant humboldt-universität zu berlin paul.vierkant@hu-berlin.de doi:10.1045/november2013-vierkant   printer-friendly version   abstract germany's open access repository landscape is one of the largest in the world. it is shaped by institutional, subject and cross-institutional repositories serving different needs which range, for example, from a mere theses server to a repository integrated into an institutional information infrastructure. to date this landscape has never been fully surveyed. this article presents and interprets the results of a 2012 census of open access repositories in germany. this census covered crucial issues ranging from repository size and software, various value-added services, to general aspects of open access. the key findings of this survey shall help stakeholders in their decision making by identifying trends in the development of open access repositories in germany.   introduction in early 2012 i came across an interesting study that investigated deposit rates in dutch open access repositories. the so called "census of open access repositories in the netherlands" covered among other aspects issues such as location and document type of deposit. [1] the data used in this study was collected from narcis, the national academic research and collaborations information system [2], a national information infrastructure which also facilitates easy information gathering, such as this sort of study. this was a perfect example of a study influencing other ideas: it made me wonder about the situation of publication deposits in german open access repositories. several extensive studies on institutional repositories have been conducted in north america [3,4]. however, although i have worked as an open access professional and advocate for a number of years, i had never heard of such a comparative study in germany. this was especially irritating since germany and its repositories had seemed to be at the forefront of the open access movement for quite sometime. the more i thought about the development of open access repositories in germany, and my alleged knowledge of it, the more i was convinced that my thoughts did not have any kind of empirical basis. looking back into the mist of the past it seemed unclear whether it was just a few steps that we progressed, or rather a long and winding road that we had taken. this uncertainty as to how to evaluate earlier achievements is best described by ulrich herb's critical article on how germany's open access community has become too harmonic lacking friction and self-reflection [5]. in this context, the absence of a study covering germany's green road — its open access repositories — seems odd given the fact that germany has a strong open access community. hence, the present 2012 census of open access repositories (hereinafter "census") tries to fill this knowledge gap by assessing different characteristics of open access repositories in germany. it is fully recognized that other surveys have previously been conducted in germany. the data collected, however, concentrated on just a small amount of repositories and less issues — too small to draw general conclusions for all of germany [6], [7]. having taken a holistic approach from the start, it soon became clear that the census would need future iterations to improve its structure and broaden its horizon, providing a solid basis for future evaluations of open access repositories in germany. the goal of the census was therefore to analyze as many aspects of an open access repository from different perspectives as possible. it should be noted that this survey is not flawless, but rather a first effort to be improved in future iterations. the results of this effort shall point out best practice examples and help stakeholders to improve open access repositories on different levels in germany.   materials and methods realizing the idea of a census of open access repositories at the information management department at the berlin school of library and information science was only possible due to the diligent and (voluntarily) effort of michaela voigt, jens dupski, sammy david and myself. designing the census from scratch, all kinds of questions touching technical, functional as well as structural issues were raised, collected and discussed. we soon discovered that in order to be able to answer all questions raised, we would have to do a survey interviewing repository managers. due to limited resources however, we opted not to interview repository managers. instead, we decided to autonomously check the web pages of the open access repositories for issues that could be addressed (semi-) automatically or where citable resources offered reliable data. this approach was certainly a more laborious one, yet by doing so we did not depend on the answers of repository managers, as that would have very likely resulted in a low response rate adversely affecting the data collected. even at this stage in the census we discovered that any future census should cover the following issues (not exhaustive): is the repository integrated into a cris (current research information system) or university bibliography? is the upload form of the repository open to users or is it restricted (e.g. registration, fee, etc.) or even closed (e.g. institutional affiliation)? which persistent identifier system (e.g. doi, urn, etc.) does the repository support? which licenses (creative commons, deposit licenses, etc.) does the repository support? does the repository offer "enhanced publications"? does the repository support versioning of publications? which author identifier system (orcid, pnd, etc.) does the repository support? does the institution responsible for the repository have a corresponding open access policy? in addition, we discovered the urgent need for an adequate definition of what actually makes up an open access repository. this should differentiate between institutional and subject repositories. despite extensive research no definition suited the need of our study to classify open access repositories. we therefore developed the following definition: the census "[...] definition of open access repository includes repositories that are institutional, cross-institutional or disciplinary providing (in the majority of cases) full-text open access scientific publications together with descriptive metadata through a gui (with search/browse functionality). the repositories are registered with a functioning and harvestable base url in at least one of the following registries: roar, opendoar, oai, dini and base." [8] it is worth noting that we did not take into account digital collections, open access journal (aggregators) or research data repositories, despite the fact that these services are listed in open access repository registries, and despite fitting a broader interpretation of an open access repository. our decision not to include them was based on the fact that because the structure, scope and content of these services are very different in character, they are difficult to compare. altogether, 293 services (including several duplicates) were found in all five registries, out of which the total number of 141 (date of survey: 2012-02-14) offered a functioning and harvestable base url and at the same time suited our definition of an open access repository (data was collected from 2012-03-09 until 2012-09-20.) [9]. it is worth returning to the aforementioned questions raised in the design of the census. the following aspects of an open access repository were addressed in the study: size of open access repositories by means of amount of content geographical distribution of open access repositories on a national level open access repository hosting services value-added services language support of open access repositories open access repository software metadata formats open access repository registries support of the open access movement (e.g. open access fund)   results and discussion   size of open access repositories to get an overview of how much content german open access repositories really provide, we looked to the most important service provider of open access resources: the bielefeld academic search engine (base) which supplied us with the number of open access items held in each german open access repository. the repository size of open access repositories not listed in base (base covered 94%, see also registries of open access repositories) or with a temporarily unavailable base-url were manually checked on the websites of the respective repository the very same day (date of survey 2012-09-14). we preferred using the term "item" in contrast to "document" or "publication" due to several reasons. first of all, despite the focus on publications there might be all kinds of data formats (e.g. audiovisual and powerpoint files) in a repository. secondly and even more important a harvesting service cannot verify whether a metadata entry has a full text or not. base states that "[a]bout 70-80% of the indexed documents in base are open access, the rest are mere metadata entries without full text or can only be accessed, if you are authorized for accessing this particular data source." [10] due to the enormous difference in size between the smallest and the largest repository and to keep up comparability we categorized them into three size ranges each covering a similar number of open access repositories (see table 1). table 1: size ranges of open access repositories in germany size ranges of open access items total number of open access repositories (%) 1 — 1,000 (small) 57 (41%) 1,001 — 5,000 (medium) 47 (33%) 5,001 — 50,000 (large) 37 (26%)   repositories containing 1-1,000 open access items make up the biggest share (41%) of all 141 german open access repositories. this might result from limited (financial) resources to promote the local service but also the fact that the institutions running these repositories are indeed small. to identify the reasons for this correlation, a future census will have to classify the repositories according to their function (see the materials and methods section above). such a future census could categorize a responsible institution according to its type [11]. knowing the type of higher education institution or research institution (both hereinafter: "institution") could tell us which institutions actually run a repository. figure 1: size ranges of and software used for open access repositories in germany. taking all 141 open access repositories in germany into account, the total number of items is 704,121, resulting in an average size of 4,994 items. most of germany's open access repositories can be found in the most heavily populated länder north rhine-westphalia (27), baden-württemberg (28) and bavaria (22). a possible correlation between the number of institutions in a land and the number of repositories per land will again be assessed in a future census. among the top five of the largest open access repositories there are four subject based repositories (see table 2). this result supports the hypothesis that researchers — due to higher visibility in their community — are more willing to deposit their works in subject based repositories than in the repository of their home institution. however, a closer look at the big players reveals that the large number of working papers, conference proceedings, etc., seems to inflate the size of subject based repositories. this is probably due to the publishing behaviour of the researchers in the respective fields such as economic science (econstor). table 2: top 5 largest open access repositories in germany repository number of items 1. elib publikationen des dlr 46,136 2. econstor 45,268 3. german medical science 41,753 4. pub universität bielefeld 32,695 5. epic — awi 29,480     hosting of open access repositories an interesting aspect of hosting open access repositories is how many, and what kind of, institutions are clients of hosting services. to find out how many open access repositories are hosted, the websites of hosting services in germany were used as a reference (date of survey 2012-04-24) [12-15]. the main findings were that about one third of germany's open access repositories are hosted — this reflects the key role hosting services play. of these 53 repositories nearly all installations (51) are running the german repository software opus. this result confirms the impression that germany is an "opus country". especially the southern länder geographically reflect the history of the opus software which originated from a research project (see the software section below). moreover, looking at the size of hosted repositories it becomes clear that about 60% of the small repositories are hosted, whereas only a minor part of the larger repositories use this service (see table 3). table 3: share of hosted open access repositories in germany size ranges of open access items total number of open access repositories total number of hosted open access repositories (%) 1 — 1,000 (small) 57 34 (60%) 1,001 — 5,000 (medium) 47 14 (30%) 5,001 — 50,000 (large) 37 5 (14%)     language support since open access knows no borders, and since most german institutions claim to be international, we studied how many open access repositories offer a german and english gui. one in two of all open access repositories offer a bilingual user interface (76 of 141 repositories, representing 54%, see figure 2 above). the bigger a repository the more likely it supports both languages. main factors might be that smaller institutions do not have an international scope or do not have the resources to maintain all provided information (policies, faq, deposit license, etc.) in english. furthermore, it is questionable that users searching for publications use an open access repository as a primary search entry and therefore an english gui would not justify the effort needed. figure 2: repositories running the respective software offering a german and english gui. nonetheless, to improve the repository service for non-german researchers at their institution who deposit their publications in the repository, it is recommended that repositories and hosting services should make some effort to internationalize their services.   value-added services most value-added services that have been around for years are not common in several german open access repositories. the present census covers basic services of an open access repository within the following parameters: bibliographic export (at least one format, e.g. ris) is available on item or collection level. usage statistics (e.g. downloads, views) are available for unregistered users on item level. checksums (e.g. md5, sha1) of full-text publications are available on item level. a functioning rss feed is available on the home or browsing page. social bookmarking (at least one service, e.g. delicious) is available on item level. social networking (at least one service, e.g. facebook, twitter or addthis) is available on item level. bibliographic export bibliographic export is supported by only 56% of all open access repositories offering standard formats like ris or bibtex (meaning that at least one format, e.g. ris is available on item or collection level). it is noteworthy that opus is by far the best repository software to choose when looking for bibliographic export (71% of all opus repositories, see figure 3). this result is confirmed by the rather astonishing discovery that the smaller a repository, the more likely it supports bibliographic export. giving back the metadata to the researchers that often painstakingly produced them, makes bibliographic export the most important value-added service of an open access repository. repository operators should therefore consider offering bibliographic export, especially for researchers that use reference management systems. figure 3: value-added services supported by open access repositories and repository software. usage statistics interest in alternative metrics to measure the impact and importance of scientific works is growing. it thus seems rather odd to find that only one quarter of all open access repositories in germany offer usage statistics, such as the number of downloads or views available for unregistered users on item level (see figure 3). the following is less surprising, though: the bigger a repository, the more likely it is to offer usage statistics. this may be due to the fact that larger institutions try to find new ways to measure and quantify the success of their publication output. this kind of institutional evaluation is also asked for by funding organizations. the small number of repositories offering public usage statistics for its users clearly shows the need for projects such as open access statistics that try to promote internationally comparable usage statistics in germany [16]. besides bibliographic export, providing usage statistics is another crucial value-added service that researchers can directly benefit from. in times of "publish or perish" such metrics are more than just numbers. they can influence hiring and promotion of researchers or funding of research projects. moreover, such statistics could provide new arguments for the potential impact of open access publications that once had the status of being closed access. checksums the provision of checksums to detect integrity and authenticity is offered by repositories (a somewhat technical service, focusing on long term archiving aspects). availability of checksums (e.g. md5, sha1) of full-text publications on item level is supported by 36% of open access repositories in germany. one in two opus installations offers checksums; making the german repository software the leading product in this category [17]. this finding is supported by the surprising correlation that the bigger a repository, the less likely it is to publicly provide a checksum. larger institutions seem to question the spirit and purpose of checksums as a basic tool for researchers when they decide not to publicly offer this information although their software supports it. repository operators should internally track the integrity and authenticity of their stored documents, however from a researchers point of view the relevance of checksums seems questionable. figure 4: value-added services supported by open access repositories and repository software. rss feed if a website offers an rss feed it provides a common service to regularly supply users with information. almost half of all german open access repositories support this tool for researchers to monitor repository content on the home or browsing page (see figure 4). when it comes to rss feeds, size doesn't matter: small, medium and large repositories are equally likely to offer this basic service. since the usual usage of an rss feed is subject oriented the need for such a service depends on the kind of repository. it is unlikely that researchers subscribe to an rss feed from their institutional repository which already covers publications from all academic fields present at the institution. however, an rss feed of a subject repository is certainly a realistic use case that the repository operators should consider. social bookmarking social bookmarking is an informational tool of the web 2.0 era. as with rss feeds, about one in two german open access repositories support social bookmarking, meaning that at least one service, e.g. delicious, is available on item level (see figure 4). most repositories running opus provide social bookmarking for their users, since older versions of the software offer this service out of the box. strangely enough, the bigger a repository the less likely it is to support social bookmarking. one reason why the majority of institutions running an open access repository do not support such a tool of information supply and sharing might be that users prefer social bookmarking plugins in their browsers to social bookmarking buttons on websites. social networks what is true for social bookmarking is even truer for social networks: a mere 11% of all open access repositories in germany offer at least one service, e.g. facebook, twitter or addthis button on item level (see figure 4). however, larger institutions use the social web in the academic sphere to enhance the visibility of the publications in their repository. a conclusion one could draw from the correlation is — the bigger a repository, the more likely it has integrated social network functions. overall, the census reveals that the above examined tools are to a large extent not supported as value-added services in german open access repositories. a lack of resources might be one of the main causes for small repositories. large repositories might be reluctant to offer these services because they might fear the possible efforts needed to set up and maintain them every time there is a software update. another reason might be that institutions running an open access repository (rightly) question the utility of the examined value-added services. open access repositories should aim to be an integrated part of the research and publishing process by offering basic services such as bibliographic export or usage statistics. services such as social bookmarking and social networks integration could be easily provided by adding an html snippet to the repository web pages. there is still room for improvement in the realm of value-added services in most open access repositories in germany. whether this improvement is sought after depends not so much on the resources available, but rather on the self-acceptance by the institutions that they run a repository primarily as a service to their researchers.   software repository software is an issue that touches almost all aspects of the census: a repository can only be as good as its software. looking at the different software solutions in germany, three major players — opus, dspace and eprints — can be identified apart from several smaller proprietary developments. with 77 out of 141 repositories using opus, a software developed via a joint project of several german universities funded by the german research foundation dfg, germany can truly be called an "opus-country". as mentioned before, due to hosting services, opus installations are spread over germany, though with a strong focus on southern germany. despite difficulties in the development of the latest version, opus remains the preferred software mainly because it already is compliant with german specific requirements such as the obligatory deposit of theses to the german national library (dnb). additionally, opus is used by 75% of all small and 55% of all medium open access repositories (see figure 1). this could either indicate that opus seems to fit the need of smaller institutions or just represent the aforementioned number of opus hosted repositories. dspace repositories can only be found in the northwest of germany (see figure 5). only 9 out of 141 are running dspace — the most widely used repository software in the world. this seems like a rather contradictory situation: whereas germany is one of the top three countries regarding the total number of repositories [18], the internationally leading software dspace (40.6%) is comparatively underrepresented (4.6%). this might stem from the dominance of opus. eprints is used by 19 institutions primarily from the western and southern parts of germany (see figure 5). eprints was developed in southampton, uk and seems to be an adequate solution for larger repositories (see figure 1): five repositories in the top ten of the largest open access repositories in germany are running the software. in germany one in four institutions uses a repository software other than opus, dspace or eprints to run their open access repository. there are repositories all over germany running with proprietary developments or less well-known repository software. this is also true for most large repositories with 46% of them running "other" software. figure 5: national distribution of repository software in germany. [these maps were created using "locator map berlin in germany.svg" by nordnordwest, wikimedia commons, licensed under cc-by-sa-3.0-de.] the 2012 repository landscape of germany can be characterized by "the national software" opus and minor proprietary regional or local developments such as mycore [19]. furthermore, the software maps also indicate a regional concentration of dspace and eprints installations which might stem from possible networking of institutions in the respective länder. the development of new features and sustainability are fundamental issues in the choice for a repository software, not least because funding and resources for open access repositories are notoriously low, as a small survey during a dini repository management workshop in 2012 in göttingen, germany, confirmed [20]. this situation will eventually bring german repository operators to choose software that is in use internationally. first steps towards this goal can be seen in: the networking of dspace and eprints repositories using national mailing lists and workshops to spread their software to build a community that would make new developments of features more feasible; the fact that from early 2012 on the library service centre baden-wuerttemberg (bsz), one of the three major hosting services in germany, offers a dspace hosting service [21]. with future needs, such as implementing semantic web standards or the integration of repositories into current research information systems (cris), this concentration and evolution of repository software will continue in germany and beyond.   metadata formats from a librarian's perspective, even more important than size, value-added services and software of an open access repository, are the supported metadata formats. studying the offered metadata formats gives a bigger picture of which metadata standards are de facto or just theoretical. prerequisite was that repositories offer their metadata formats via oai-pmh. the listed metadata formats ("?verb=listmetadataformats") were validated. only de facto functioning metadata formats were taken into account (period of survey: 2012-06/07). figure 6: metadata formats supported by german open access repositories. simple dublin core is supported by 99% (see figure 6) of all german open access repositories, which makes it the only de facto metadata standard. two reasons for the widespread dissemination of simple dc might be its simplicity and the fact that is has been around many years (dating back to 1998) [22]. its simplicity is also a reason for the development of more complex metadata formats in the past. however, with a dissemination of merely 4%, qualified dublin core does not seem to have found its way into germany's open access repositories. although compatible with dublin core, xmetadiss is supported by less than one third of all german open access repositories (see figure 6). xmetadiss is a german metadata format that was introduced in the late 1990's as a national standard for the submission of doctoral dissertations to the german national library [23]. its successor xmetadissplus did not become a national standard either (43% support xmetadissplus, see figure 6). these figures suggest that either open access repositories use other methods to submit their thesis metadata to the national library or that these repositories do not contain theses and thus do not need this metadata format. figure 7: linked open data supported by german open access repositories. when it comes to linked data, german open access repositories tend to support semantic web standards such as the resource description framework, rdf [24] (7%) or open archives initiative object reuse and exchange, ore [25] (2%, see figure 7). this suggests that there is still a long journey ahead to reach the realm of the semantic web. in sum, the dissemination of metadata standards in german open access repositories is less a result of the number of years of existence than its supposed complexity. hence, dublin core simple via oai-pmh is the general rule today, and linked data the exception.   registries of open access repositories when it comes to visibility of open access repositories, registries play an essential role. however, the coverage ranges from 56% in the oai data provider list to 94% in base (see figure 8, date of survey: 2012-04-04). in addition to the small number of registered open access repositories, the number of duplicates and outdated base urls confirm the impression that some of these registries are obsolete. for example, the following repositories have duplicates in roar: "hochschulschriftenserver der katholischen universität eichstätt-ingolstadt" is indexed three times (id: 958, 2391, 3126), the "gkss publication database for open access full text documents" twice (id: 3406, 579), and "goescholar" twice (id: 3904, 590). the base url of the "volltextserver der fachhochschule würzburg-schweinfurt" is outdated in roar (id 974) and in the oai data provider list (last checked on 2013-03-14). figure 8: coverage of all 141 german repositories in repository registries. as a "living" registry depending heavily on the reliability of its sources, the meta search engine base is the only registry covering almost all german open access repositories. in order to be listed in the census, a repository would have to be registered in at least one of the four registries. bearing that in mind it is surprising that only 40% of germany's open access repositories were registered in all five registries.   open access the final section of the census deals with selected issues of the open access movement in germany. not the repositories as such but the institutions running an open access repository were examined according to the following measures: signing the berlin declaration on open access to knowledge in the sciences and humanities. (if an umbrella organization, e.g. helmholtz association of national research centers signed the berlin declaration, all subsidiary institutions running an open access repository were counted (date of survey: 2012-04-27) [26]. offering an open access publication fund supported by the german research foundation dfg, a special fund for articles in fully gold open access journals. (institutions taking part in the funding program "open access publizieren" date of survey: 2012-04-26) [27]. being a member of the confederation of open access repositories (coar). (if an umbrella organization, e.g. helmholtz association of national research centers is a member of coar, all subsidiary institutions running an open access repository were counted (date of survey: 2012-04-26) [28]. constituting a reasonable share, more than one quarter of all institutions running open access repositories took at least one of the three measures. institutions running large repositories are in the majority among those in all three categories. this might be due to the fact larger institutions have either enough staff or resources to support open access at their institution in different ways (e.g. with an open access publication fund) or the administration of these institutions regards open access to be important enough to support it ideologically (e.g. by signing the berlin declaration). figure 9: dissemination of dfg supported open access publication funds in germany. [this map was created using "locator map berlin in germany.svg" by nordnordwest, wikimedia commons, licensed under cc-by-sa-3.0-de] looking at the geographical distribution of open access publication funds supported by the deutsche forschungsgemeinschaft (german research foundation, dfg) one might think that the german border still existed (see figure 9). it is surprising that by spring 2012 not one institution running an open access repository came from the new länder, i.e., the former east. however, since spring 2012 tu dresden, tu chemnitz and fu berlin offer a dfg-supported open access publication fund. one reason might be that open access plays a minor part and stakeholders in institutions of the german east thus set other priorities than applying for dfg funding. however, it is more likely that most institutions from this region have difficulty meeting the application requirements for dfg funding (e.g. a university bibliography documenting the institutional publication output). these requirements particularly exclude smaller institutions with limited resources from dfg funding. this funding strategy leaves researchers at smaller institutions behind, and benefits researchers working at institutions tha are probably already "open access players." remedy could be found if the dfg provided open access funds on länder or even national level so that all researchers in germany have equal chances to receive funding for their open access publications. yet the strength of germany's open access movement cannot be measured using the present criteria. there is massive support for open access at many german institutions that goes beyond simply being signatories of the berlin declaration and providing open access funds. with networks such as the deutsche initiative für netzwerkinformation e.v. (german initiative for network information, dini) [29], schwerpunktinitiative "digitale information" der allianz der wissenschaftsorganisationen (priority initiative digital information) [30], aktionsbündnis urheberrecht für bildung und wissenschaft (coalition for action "copyright for education and research") [31] the german community is at the global forefront of open access. funding organizations such as the dfg promote open access by funding projects that support the setup of repositories and other services. open access repositories are a vibrant part of this movement with ever changing roles, ranging from being solely a thesis server to being an integrated part of library retrieval or research information systems.   conclusions touching several essential aspects of an open access repository the 2012 census of open access repositories in germany leaves us with the following seven key findings: opus is the most used repository software in germany, however a trend towards globally used software such as dspace and eprints is visible. simple dublin core is by far the only metadata format supported by german open access repositories that deserves to be called a standard. in comparison to all other repository registries the bielefeld academic search engine (base) has the best coverage of german open access repositories. more than one quarter of institutions running a repository support open access, in a variety of ways. most small open access repositories in germany are hosted. one in two of all open access repositories in germany have a bilingual gui supporting german and english. there is a need for more basic value-added services in germany's open access repositories. to sum up, the 2012 census of open access repositories in germany represents an unprecedented snapshot of germany's repository landscape providing the community with substantial information about crucial issues of repository management. the census revealed shortcomings and strengths of repository software supporting institutions with future decisions concerning the development of their repository. furthermore, the census indicated the strong and growing open access movement in germany. ultimately, the census points out that most open access repositories in germany lack basic functions such as bibliographic export, usage statistics, social media or multilingual support today. the present study provided us with unprecedented findings but left many questions unanswered. the information management department at the berlin school of library and information science will conduct a future census of open access repositories in germany in the course of a seminar that will attempt to address unresolved issues and critically assess its structure and scope [32]. the ascent of the green road to open access heavily depends on visibility, features, and functionalities leading to the acceptance of open access repositories. if uploading, searching and exporting of references is integrated into the everyday life of a researcher, the promotion and uptake of open access will be facilitated. however, the advent of open access in research does not only depend on the repository itself but also on the enduring will of repository managers, funders and other stakeholders to tackle the issues presented in this article.   acknowledgements michaela voigt, jens dupski and sammy david, humboldt-universität zu berlin, conceived and performed the study. credit also goes to mathias lösch, university library bielefeld, germany providing essential data from the bielefeld academic search engine (base) and maxi kindling, prof. dr. peter schirmbacher and najla rettberg for their extensive feedback.   references [1] gerritsma w (2012) a census of open access repositories in the netherlands. wow! wouter on the web: comments on the library and information science world. [2] narcis: the gateway to scholarly information in the netherlands (2013). narcis national academic research and collaborations information system. [3] rieh sy, markey k, st jean b, yakel e, kim j (2007) census of institutional repositories in the us: a comparison across institutions at different stages of ir development. d-lib magazine 13: 4. http://doi.org/10.1045/november2007-rieh [4] markey k, rieh sy, st. jean b, kim j, yakel e (2007) census of institutional repositories in the united states: miracle project research findings. clir publication no. 140. council on library and information resources. [5] herb u (2012) die open-access-community: harmonie, fehlende reibung und die vorstellung des goldenen open access. telepolis science news. [6] windisch n (2009) repositorien an wissenschaftlichen einrichtungen: bestandsaufnahme und ausblick berlin: humboldt-universität zu berlin. [7] offhaus n (2012) institutionelle repositorien und universitätsbibliotheken — entwicklungsstand und perspektiven.köln: fachhochschule, institut füür informationswissenschaft. [8] vierkant p, voigt m, dupski j, david s, lösch m (2012) 2012 census of open access repositories [9] vierkant p, voigt m, david s, lösch m, dupski j (2013) 2012 census of open access repositories in germany. figshare. http://doi.org/10.6084/m9.figshare.677099. [10] bielefeld university library (2013) faq — bielefeld academic search engine. base. [11] german rectors' conference (2013) higher education institution. hochschulkompass. [12] bsz (2013) opus im bibliotheksservice-zentrum baden-württemberg. [13] kobv (2013) opus & archivierung — referenzen/teilnehmer. [14] hbz (2011) hbz — opus. [15] open repository (2013) customers | open repository registered dspace provider. [16] dini e.v. (2013) why oa-statistics?. [17] kobv (2013) opus 4: überblick. [18] opendoar (2013) proportion of repositories by country worldwide. [19] mycore (2013) mycore. [20] dini e.v. (2012) dini-workshop: repositorymanagement 2012. [21] bsz (2013) dspace. [22] dublin core metadata initiative (1998) dublin core metadata element set, version 1.0: reference description. [23] deutsche nationalbibliothek (2005) xmetadiss eine xml-basierte struktur für das metadatenformat metadiss. [24] w3c — rdf working group (2004) resource description framework (rdf). [25] open archives initiative (2013) object reuse and exchange. [26] max planck society (2013) oa mpg signatories. [27] dfg (2013) gepris geförderte projekte informationssystem. [28] coar (2012) members and partners by country. [29] deutsche initiative für netzwerkinformation e. v. (2013) about dini. [30] allianz der deutschen wissenschaftsorganisationen (2013) schwerpunktinitiative "digitale information". [31] aktionsbündnis "urheberrecht für bildung und wissenschaft" (2013) coalition for action "copyright for education and research". [32] schirmbacher p, kindling m (2013) "die digitale forschungswelt" als gegenstand der forschung. lehrstuhl informationsmanagement. information — wissenschaft & praxis 64: 127—136.   about the author paul vierkant is working at the berlin school of library and information science of the humboldt-universität zu berlin, germany in the re3data.org project creating a global registry of research data repositories. prior to his current position he was project manager at the university of konstanz building a research data repository and institutional repository.   copyright © 2013 paul vierkant digital library as network and community center: a successful model for contribution and use search   |   back issues   |   author index   |   title index   |   contents d-lib magazine december 2006 volume 12 number 12 issn 1082-9873 digital library as network and community center a successful model for contribution and use   cathryn a. manduca, sean fox, and ellen r. iverson science education resource center carleton college {cmanduca, sfox, eiverson@carleton.edu} educational digital libraries (dls) are grounded on the premise that educators can do a better job with less effort if they can easily find and reuse materials that are created by others (1, 2, 3, 4). in this model, success is predicated on the ability of the dl to engage its community in building relevant and useful collections and in using these resources to enhance teaching. in this article we highlight some key approaches that have been successful for the science education resource center at carleton college (serc) in engaging geoscience educators, both as contributors and users in the teach the earth library. educational dl efforts have faced two major challenges in recent years. first, creation of digital libraries that add value to the resources they index has proven to be labor intensive (5). much of this added value lies in providing context for resources – for example, links between related resources, information that describes how and when resources can be used effectively, or collections of resources that address specific educational issues (6, 7, 8, 9, 10, 11). creating this context requires professional knowledge of the resources and of the target user population, is still largely a human endeavor and, as such, is labor intensive. second, educators, in masse, have been slow to adopt the use of digital libraries (12). faculty and teachers do not routinely go to the web to find the materials developed by their colleagues for adaptation and use in their teaching. while such behavior could arguably be more efficient and improve instruction by spreading the use of highest quality resources, a culture supporting this behavior, particularly at the undergraduate level, is not yet in place (13). the teach the earth digital library for geoscience education arose from our work collaborating with a number of professional development projects to enhance their efforts through the use of the web. by integrating the use of dl tools and services into this work, we were able to bring the expertise and efforts of this professional development community to bear directly on the challenges facing education digital libraries. our partners have been particularly effective in three areas: creating a dl rich in contextual information, creating processes for community contribution and enhancement of these resources (14, 15, 16, 17), and integrating the use of the library into the professional development of geoscience faculty and teachers. creating a library from the work of a group of partners naturally led to a networked, non-centralized approach to search and discovery of resources. resources of interest within our library can be identified from many dispersed starting search points, yet once a single resource in the library is accessed, the associated context and framework for that resource and all associated resources are apparent. as shown in figure 1, the resulting library design emphasizes the relationships between resources as well as tight integration between library resources and the context needed for effective use. this approach explicitly attempts to align with the diversity of users' information seeking strategies: use of commercial search engines, trusted sources, and links (18). thus we avoid the issues associated with relying on a centralized portal to bring users to the site while addressing the important need for context to support the effective use of our library resources in geoscience education (6, 7). figure 1. bold blue boxes and arrows represent areas where dl value is added as users navigate to content (thin black boxes). in the traditional educational dl, only users who actively decide to use the dl portal benefit. in the networked model, much of he dl value is integrated into the content and presentation of the sites themselves, and so impacts a wider range of users. this approach carries with it some major challenges. bringing together the work of individual projects, each with its own leadership, goals, identity and passions, requires flexibility. it is more difficult to impose uniform standards, to require review using a specific process, or to mandate the breadth of library holdings. creating an integrated network from these resources requires knowledge of the breadth of resources and the breadth of user communities. to date, we have worked within a community we know well, with a limited number of partners whose resources we know well. as the size of our library increases, the number of partners grows, and the diversity of users expands, our methods must be developed into scalable models. our network model for navigation, while promoting exploration, can sometimes leave users feeling lost in the woods. as we move forward, we will need new tools to better understand our users' movement through the library and scalable guidelines for creating a network that supports and leads rather than confuses. we have only begun to address these challenges of a mature digital library. below we describe our work in implementing this community dl model through a set of services that enabled geoscience education projects to collectively build the teach the earth educational digital library. we focus on three aspects of this work: 1) facilitating community publishing, 2) creating a navigational and organizational framework that integrates the work of all included projects into a dl, 3) and identifying the ways in which the network centric dl that results from these efforts meets users' needs by complementing their natural search behaviors. created by partners – the value proposition a successful educational digital library is as much a social process as a technical problem. it requires creation of a culture that fosters contribution to and use of the library (5, 8, 9, 11). we have addressed creation of this culture by working with nsf-funded projects focused on the professional development of geoscience faculty as teachers. each of these projects partnered with serc to create its project website. they seek two primary services in this partnership: 1) tools, resources and experts that assist them in creating high quality project websites and 2) placement of their resources in a network that enhances dissemination and use of their work. we created a win-win situation that yields rapid production of content for the library and facilitates use, by allowing our partners the flexibility to meet their own project goals while contributing to the overarching digital library. while the contributing projects all have different goals and strategies, they share a focus on creating resources that improve geoscience education and bringing them into widespread use by faculty and teachers. two strategies have been particularly effective in creating content for the library. first, the projects, which bring substantial professional expertise, contribute resources central to their work in geoscience education and professional development to the library. for example, the on the cutting edge project (19) used the content of a set of very popular workshop notebooks refined over five years of program development to create library contributions on preparing for an academic career, managing your early career, and designing effective courses. these resources are now in widespread use within their workshop series, bringing users to the library as an integral part of their professional development. second, a major goal of many of our partnering projects is to draw on the expertise distributed throughout the geoscience community and convert it to forms that are shareable through the digital library. geoscience faculty rely heavily on informal sharing of information between colleagues to inform their teaching (20). using workshops, editorships, professional society sessions and collaborations, our partners have engaged more than 400 scientists and educators in writing materials for their project websites. this work is inculcating the use of the web as a vehicle for more widespread sharing among colleagues. by aggregating and integrating these contributions, the library provides a robust source of community expertise, and these contributors become agents for promoting use of the library by their colleagues. our work with partnering projects has resulted in robust collections of resources that are rich in the contextual information now recognized as central to the value of educational dls (10, 11, 20). developed as tools for professional development aimed at addressing specific issues in geoscience education, our partners have developed a wide variety of mechanisms for linking resources with the context required for their effective use including: tutorials that embed resource references (21, 22), topical collections of resources for specific geoscience courses (23), and modules that link information on pedagogic methods to teaching activities that use these methods (24). partnering with projects focused on improving education provides a scalable model for addressing the social aspects of library building. in less than five years, through the work of our partners, teach the earth has grown to include 3,600 pages of content, over 1,000 teaching activities and 6,000 cataloged resources. a poll of geoscience faculty, a primary target audience, indicates that use extends beyond the pool of early adopters to approximately 1/4 of geoscience faculty in the us (25). international use is also strong with approximately 1/3 of use coming from time zones that do not cover the united states. creating a library from project contributions a digital library is more than a collection of project websites on a common topic. we use two evolving strategies to create a digital library from the individual work of our partner projects: a set of common formats for resource elements that cross projects (e.g. teaching activities) and digital library tools that support referencing and reuse of resources across the projects. this work is facilitated by the common use, by all projects, of our web-based editing system (26) which allows projects to create websites with consistent design and coherent information architectures. common formats for contributing projects many of our partner projects are engaged in sharing teaching materials: examples of different ways of teaching a particular concept; data sets and activities that use them; and course syllabi. to facilitate sharing, we designed single page templates that summarize the information needed to use each of these types of resource effectively: activitysheets (27), datasheets (28), coursesheets (29), and fieldtripsheets (30). these sheets include information about learning goals, context for use, teaching tips, and assessment as well as other information specific to the particular type of resource. sheets go beyond traditional descriptive metadata by including rich contextual information (e.g., how and why to use the resource) as well as links to web resources or downloadable files that enable full implementation of the activity. these sheet formats provide many of the benefits that we associate with traditional metadata due to their structure and consistency. this approach has a number of advantages: the common format allows users to efficiently scan through a large collection of resources after gaining some initial familiarity with the standard structure. this benefit multiplies as a format is adopted by multiple collections and projects. the particular choice of required fields in each template has evolved from experience to include the information users find most valuable and to reflect educational best practice. as authors put their materials online via a sheet format, they are guided by the format itself toward reaping the benefits of insights into user needs gained by other projects and from educational research. for example, activitysheets and coursesheets both emphasize goal setting, an important part of the educational design process (31). sheets link the information needed for use of a resource to the resource itself, acting as a vehicle for sharing individual experiences with the larger community and supporting educational reuse (32, 33, 11). sheets serve as a natural locus around which to link data, activities, references, and other information relevant to the use of a specific resource. they facilitate the creation of resource-centric, rather than metadata-centric, collections (6). adoption of the sheet format by our contributing projects enables automated generation of searchable collections of activities, data sets, and courses across the entire teach the earth library, all of which share a common look and feel (34, 35). these shared formats lie at the heart of creating, what users view as, an integrated library without compromising the need of individual contributing projects to maintain their own visual identity. equally exciting is the ease with which new vocabularies can be attached to specific sheets to create new collections addressing specific topics in teaching. for example, this strategy has been used to develop an activity collection focusing on teaching quantitative skills in the geosciences (36). by supporting this creative reuse, sheets move the library beyond a discovery service to a community resource for the generation of new ideas, resources, and knowledge. digital library tools promoting reuse and referencing a primary goal of the library is to foster synergistic relationships between contributors. just as authors of new books and journal articles learn from and reference the work of their predecessors, the library includes a number of tools to promote the ability of new contributions to capitalize on previous ones. these tools help contributors produce quality scholarly materials for their projects and at the same time promote connections between resources in the library in ways that add value for users. contributors have access to a full collection of metadata records describing websites and print references that have been contributed by previous projects (including the full digital library for earth system education catalog). these records can be selected and organized using either of two tools: a resource list tool and a faceted search tool (26). faceted search is driven by a set of overarching vocabularies as well as customized vocabularies that can be created by individual projects to address specific needs of highly targeted user communities (37, 26). this strategy has been used effectively to create collections addressing geology and human health, and geoscience and public policy (23, 38). the catalog record for each resource displays the location of references to it across the library (26). thus a user who finds a resource in one context can quickly move to the other contexts within the library where this resource is referenced (39). similarly, a contributor can understand the previous use of the reference and the implications of that work on the project at hand. contributors can incorporate modules of material created by other projects directly in their project. the information in the module is 'wrapped' with the look and feel of their website and then embedded within that site with conventional links. the service integrates provenance information into the content, attributing authorship with the original source of the module. projects are now employing this service to reuse modules describing teaching methods in association with discipline specific content. for example, a module on teaching with the socratic method was developed by the starting point project for geoscience faculty. this module has been reused by the microbial life educational resources project as a resource for biologists (40). a related links tool enables contributors to make connections between their work and related work in other projects. this tool can be used to make associations between pages at any level of the website. for example, a project top page can reference other project top pages (41); a teaching example can reference a variety of pedagogic modules (42); or the browse interface to a collection can reference other related collections (43). this tool creates a standard format for related links across the library, again facilitating an integrated experience for users. the resulting web of resources complements traditional search and browse and the overall site architecture as a discovery mechanism for users, moving them laterally between resources. as contributors use these digital library tools to their advantage, they enable the development of a networked inter-project digital library where connections, structure and value are embedded at all scales from individual resources up to the overarching portal. thus, the library has a different look and feel than is traditional, focusing on creating a web of resources. while an overarching portal complements this network, usage data indicate that the portal plays a secondary role in use and discovery through the library. of 900,000 users in 2005, only 3% visited the primary portal. in contrast, interviews with users indicate that cross-site movement initiated from a point within the library where they are comfortable is an important strategy for exploring new resources. a web of resources: taking advantage of users reliance on google research on information seeking behavior shows that users explore information on the web using a complex and sophisticated set of interrelated strategies (18, 44). the goal is to maximize the value of the library in supporting this exploration. as emphasized in figure 1, users find added value throughout the library in the form of contextualizing information and connections among resources. figure 2 shows three different paths that users with varying needs, backgrounds, and styles of information seeking behavior might take through a specific slice of resources in the library. arrows represent connections that are facilitated by a range of tools and features in the library design including faceted searching, highly contextualized topical collections within sites, related links tags, and embedded links. figure 2. (for a larger view click here.) there are three primary advantages to this type of library design in supporting users in information seeking behavior: existing user search behavior that uses google as a touchstone is supported. the dl community continues to discuss whether dl's can offer better alternatives to commercial internet search engines such as google (45). a resource web acknowledges the powerful and popular role that google and other commercial search engines play in discovering resources and creates strategies that capitalize on this behavior. teach the earth is designed for entry via commercial search engines. the goal is to maximize the ability of users to explore the network of resources from wherever they enter. authors and reviewers are encouraged to make sure that each page can function effectively as a first page, and care is taken to ensure that a user can understand the local content and organization no matter where they enter the site. the resource network enables exploratory searching and supports finding resources more than once. users can explore the site using a variety of navigation elements: library wide search and browse, project-specific navigation including search and browse across topical collections, related links tags, and embedded links. from any point within the library, users can find a rich set of links for further exploration on multiple axes. in this way they can continually guide their exploration toward resources that are aligned with their interests and expectations (18). interviews indicate that satisfaction with the site reflects confidence that users can find and refind highly relevant resources on the site. this confidence arises from the ability to navigate to the same resource using multiple pathways using redundant navigational elements. the primary challenge of this design is ensuring that users are not overwhelmed or confused. in particular, users note that they have a sense that they cannot comprehend the full scope of the site and they lack confidence that they have found all resources on the site that might be of interest. users seeking a highly structured, linear experience, particularly those who would like a definitive list of related items such as a ranked search return, express frustration with the open ended navigation process. the typical limits of users' search behavior are circumvented, leading to increased exploration on the site. when using discovery interfaces that present resources as a list of isolated, decontextualized search returns, users may have a tendency to stop as soon as they reach a resource that seems good enough rather than digging further (46, 47). this well-documented, satisficing behavior is heavily colored by users' expectations of what might be available. the network dl model focuses on smaller specialized collections. the choices users make in moving through the site guide them to the right general vicinity where they can find small collections well aligned with their particular interests. by the time that users search and obtain a return, they have already identified the group of resources that is most likely to answer their needs. rather than settling for the one or two resources that a broader search engine brings to the top, the small, focused collections help users gain a picture of the most relevant resources. in sum, a networked digital library offers high value to users because: they repeatedly find resources in the library in the course of their normal searching with a commercial search engine. value is added by assisting them from their point of entry to other related resources of high interest. the ability of the network to support rich exploratory searching gives users the confidence that they can find resources that are aligned with their needs and interests more than one time. thus the library becomes a trusted source and a starting point for exploration. users are enticed deep into the library collections to find the best resources for their purpose. they learn through their exploration of the library and leave with resources that are closely aligned with their needs. conclusion teach the earth evolved as a digital library from the collective work of a number of nsf-funded projects focused on professional development in geoscience education. while the resulting library has features that are typical of a digital library (a metadata repository that supports a centralized portal with search and browse capacity) its unique history led to a focus on authoring tools, tools that support reuse and sharing, and a networked structure that promotes the work of individual projects while integrating them into a larger whole. this structure has been advantageous in building a value proposition for contributors who gain valued services through their participation in the library and for users who can find the integrated resources of leaders in geoscience education in ways that effectively integrate with their use of commercial search engines. as a result, teach the earth is sought out as a partner by new projects for the services it provides, and it is used by a significant fraction of the target community (geoscience faculty). while challenges of scalability will face the library as it matures, it provides an example of the ways in which a digital library can create an intellectual commons that is a central part of the work of an educational community. acknowledgements the authors are grateful to michelle hall, mike mayhew and dave mogk who provided thoughtful reviews of the manuscript and who have been influential in our thinking about digital libraries. we would also like to thank all of the partners who have collaborated with serc as part of the building of teach the earth. this article reports results from work funded by the national science foundation including that taking place under grants due-0127141, ear-0304762 due 0226243, due 0226199, and due 0532875. any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the national science foundation. references 1. zia, l. (2000). "the nsf national science, mathematics, engineering, and technology education digital library (nsdl) program: a progress report." d-lib magazine, 6(10) . 2. manduca, c. a., f. mcmartin and d. mogk (2001). "pathways to progress: vision and plans for developing the nsdl: report to the national science foundation," . 3. marchioni, g., c. plaisant, et al. (2003). "the people in digital libraries: multifacted approaches to assessing needs and impact." digital library use: social practice in design and evaluation. a. peterson, n. bishop and b. van house. cambridge, massachusetts, the mit press: 119-160. 4. borgman, c. l., l. j. smart, k.a. millwood, j.r. finley, l. champeny, a.j. guilliland-swetland, and g.h. leazer (2005). "comparing faculty information seeking in teaching and research: implications for the design of digital libraries." journal of the american society for information science and technology, 56(6): 636 657. 5. lagoze, c., d. b. krafft, t. cornwall, n. dushay, d. eckstrom, and j. saylor (2006). "metadata aggregation and "automated digital libraries": a retrospective on the nsdl experience." proceedings of the 6th acs/ieee-cs joint conference on digital libraries, acm press. 6. lagoze, c., d. b. kraffti, s. payette, and s. jesuroga (2005). "what is a digital library anymore, anyway? beyond search and access in the nsdl." d-lib magazine, 11(11), . 7. borgman, c. l. (1999). "what are digital libraries? competing visions." information & management, 35(3): 227-243. 8. arms, w. (2000). "automated digital libraries: how effectively can computers be used for skilled professional librarianship?" d-lib magazine, 6(7/8), . 9. manduca, c. a. and d. mogk (2000). "dlese (digital library for earth system education): a community plan," university of oklahoma: 46 p. 10. sumner, t. and m. marlino (2004). "digital libraries and educational practice: a case for new models," proceedings of the 4th acm/ieee-cs joint conference on digital libraries. 11. sumner, t. and m. dawe (2001). "looking at digital library usability from a reuse perspective." international conference on digital libraries. proceedings of the 1st acm/ieee-cs joint conference on digital libraries, roanoke, virginia, acm press. 12. zemsky, r. and w. massey (2004). "why the e-learning boom went bust." the chronicle of higher education, 50(44). 13. hutchings, p. (2001-2002). "reflections on the scholarship of teaching and learning." essays on teaching excellence, 13(5). 14. manduca, c.a., h. macdonald, d. mogk, and b. tewksbury (2006). "on the cutting edge: evolving themes, enduring impact." northfield, mn: science education resource center . 15. ledley, t.s., l. dahlman, c. mcauliffe, n. haddad, c.a. manduca, s. fox, d. blaha, r. freuder, and r. downs (2003). "the earth exploration toolbook: facilitating the use of data to teach earth science processes." geological society of america abstracts with programs, 35(6), september 2003, p. 115 . 16. mccaffrey, m. and t. weston (2005). "the climate change collection: a case study on digital library collection review and the integration of research, education and evaluation." proceedings of the 5th acm/ieee-cs joint conference on digital libraries, . 17. manduca, c.a., l. guertin, r.h. macdonald, r. teed (2004). "starting point: resources for faculty teaching entry level geoscience." geological society of america abstracts with programs, 36(5): 276. . 18. marchioni, g. (2006). "exploratory search: from finding to understanding." communications, 49(4): 41 46. 19. for more information about the on the cutting edge faculty development program, see . 20. manduca, c. a., e. iverson, s. fox and f. mcmartin (2005). "influencing user behavior through digital library design: an example from the geosciences." d-lib magazine, 11(5). . 21. see for example, the on the cutting edge module on designing effective online educational resources in the geosciences, . 22. see for example, the earth exploration toolbook website, . 23. see for example, the on the cutting edge module on geology and human health, . 24. see for example, the starting point module on interactive lectures, . 27. examples of activitysheets, . 29. example of coursesheets, . 30. example of fieldtripsheets, . 31. wiggins, g. and j. mctighe (2000). understanding by design, prentice hall. 32. robson, r., g. collier and b. muramatsu (2004). "reusability guidelines." covallis, or, reusable learning, . 33. mcdaris, j., s. fox and c. a. manduca (2004). "supporting reusability." tucson, az, 2004 joint conference on digital libraries, june 2004. 34. see for example, the teach the earth search the site page, . 35. see for example, the using data in the classroom activity browse page, . 36. see for example, the teaching quantitative skills in the geosciences activity browse page, . 37. iverson, e., c. manduca and s. fox (2005). "evaluating vocabulary as a component of faceted search." 2005 nsdl annual meeting, denver co, . 38. see for example, the on the cutting edge module on teaching public policy in the earth sciences, . 39. see for example, the digital catalog record for how people learn: brain, mind, experience, and school published by national academy press, . 40. see for example, the microbial life educational resources teaching activities page, . 41. see for example, the teaching quantitative skills in the geosciences website, . 42. see for example, the starting point module on teaching with models, . 43. see for example, on the cutting edge module on teaching hydrogeology in the 21st century activity browse page, . 44. borgman, c. l., l. j. smart, k. a. millwood, j. r. finley, l. champeny, a. j. gilliland-swetland and g. h. leazer (2005). "comparing faculty information seeking in teaching and research: implications for the design of digital libraries." journal of the american society for information science and technology, 56(6): 636-657. 45. lossau, n. (2004). "search engine technology and digital libraries: libraries need to discover the academic internet." d-lib magazine, 10(6). . 46. zach, l. (2004). "when is enough enough? modeling the information-seeking and stopping behavior of senior arts administrators." journal of the american society for information science and technology, 56(1): 23 35. 47. mcmartin, f., e. iverson, et al. (2006). "factors motivating use of digital libraries." proceedings of the 6th acm/ieee-cs joint conference on digital libraries, acm press. copyright © 2006 cathryn a. manduca, sean fox, ellen r. iverson top | contents search | author index | title index | back issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/december2006-manduca   the core: digital library education in library and information science programs search   |   back issues   |   author index   |   title index   |   contents d-lib magazine november 2006 volume 12 number 11 issn 1082-9873 the core digital library education in library and information science programs   jeffrey pomerantz1, 2, sanghee oh1, seungwon yang3 edward a. fox3 and barbara m. wildemuth1 1school of information and library science university of north carolina chapel hill, chapel hill, nc 27599-3360 2+1 919-962-8366; 3 department of computer science, virginia tech, blacksburg, va 24061 abstract this paper identifies the "state of the art" in digital library education in library and information science programs, by identifying the readings that are assigned in digital library courses and the topics of these readings. the most frequently-assigned readings are identified at multiple units of analysis, as are the topics on which readings are most frequently assigned. while no core set of readings emerged, there was significant consensus on the authors to be included in digital library course reading assignments, as well as the topics to be covered. implications for the range of assigned readings and topics for digital library education in library science education are discussed. introduction hundreds of millions of dollars have been invested in digital library (dl) research since the early 1990s. much of this research has addressed how dls can aid education, but there has been no parallel investment in supporting teaching and learning about dls. such research investment is of ongoing importance in the united states and other nations (e.g., australia, china, india, japan, and many european nations) where significant dl development is being undertaken. without investment in education related to dls, we face a future with many digital libraries, but few digital librarians to ensure their success. we run the risk of developers of digital library systems building software that is seriously flawed – since they will not be aware of crucial requirements, efficient and effective techniques for implementation, or key ingredients of success. end users already face a confusing situation where their ability to work with useful information is limited by failures of usability and interoperability. sponsors of some early digital libraries now wonder about their sustainability, or are concerned about their long-term viability with regard to digital preservation. those involved in requirements analysis, design, development, management, and utilization of many types of related advanced information systems also face similar problems, which might be avoided with the help of those who have had formal training regarding dls. there are, however, currently no formal degree programs in digital librarianship. a few library and information science (lis) and computer science (cs) programs offer specific courses on dls, and a small number of lis programs have begun offering certificate programs in digital librarianship. there is, however, little agreement as to the content and scope of these courses and programs, and little coordination between institutions. while the computing curriculum 2001 (joint task force, 2001), a collaborative effort of acm and ieee-cs to define curricula for cs and related programs, includes dls as one of fourteen knowledge modules under information management, no further work has been supported to develop a dl curriculum for cs beyond the brief cc2001 description. research question this paper is an attempt to identify the state of the art in dl education in lis programs. the authors are currently working on a similar investigation of cs programs, and plan a future paper on that topic and comparisons between lis and cs curricula. the present paper, however, is the first step in an effort to identify how the topic of dls is being taught in lis programs. this investigation was guided by the following specific research questions: what readings are assigned in courses on digital libraries in library and information science programs? is there a core group of readings? what is the distribution of readings among the various topics in these courses? previous work library and information science has always been a field concerned with the education of future librarians, and, like many professions, with the integration of research with practice in the field and in the classroom. consequently, there has been a small but steady stream of studies of lis curricula over the years. many of these studies are concerned with the subtopics addressed in lis courses on various subjects. many subject areas in lis programs have been studied in this way: bibliographic instruction (larson and meltzer, 1987), business information (white, 2004), information technology (he, 1999), the economics of information (weech, 1994), and popular culture (moran, 1985), to name a few, as well as programs' core courses (irwin, 2002). none of the existing studies of lis curricula or syllabi, however, have been concerned with courses on dls. most of these studies simply list the topics addressed in courses. this is useful information for identifying the scope of what is taught about a subject in lis, but does not provide more specific information about which topics may be considered to be more or less important. only white (2004) goes farther, and provides the number of courses in which topics are addressed. further, none of these studies reports on the specific readings that are assigned in these courses to address these topics. only a few studies achieve this level of detail: joudrey (2002) reports on the textbooks and chan (1987) reports on both the textbooks and supplementary readings used in courses that address cataloging and classification. nicholson (2005) reports on the authors and readings assigned in courses on generalized search skills. larsen (1979) reports on the reference sources introduced in basic reference courses, though not the textbooks or the supplementary readings assigned in these courses. methodology identification of courses and collection of syllabi the methods for this study were derived from those used by joudrey (2002) and nicholson (2005). the list of course offerings in lis programs were viewed on the open web, and courses on the topic of dls were identified based on their titles and descriptions. syllabi were collected from courses in which the phrase "digital library" or "digital libraries" were used in either the course title or short catalog description. syllabi were collected from the open web, as many programs' websites have links to course syllabi and many instructors have links to the syllabi of courses that they teach. where syllabi were not available on the web, the most recent instructor of the course was contacted by email or telephone and asked for a copy of the syllabus. where no instructor was listed on the program's website, the program's main office was contacted. only the syllabus from the most recent semester in which the course was offered was collected. thus, only one syllabus per course was considered. all courses in lis programs to a certain extent address dl-related topics, as dls are at the intersection of most or arguably all of the topics that are addressed in modern library science education. for example, classification and information retrieval are both critical to digital libraries, but no syllabi were collected for courses specifically on these topics. limiting the collected syllabi to those courses that explicitly use the phrase "digital library" or "digital libraries" was a somewhat arbitrary decision, but it was necessary in order that this study not explode to be an analysis of entire lis curricula. while some lis programs offer undergraduate majors or minors, syllabi were collected only from programs and courses at the graduate level. limiting the scope of the study to graduate programs enabled the authors to utilize a pre-existing list of graduate programs of acknowledged quality: the authors utilized the american library association's (ala's) list of accredited master's programs in library and information studies [1], which contained 56 programs as of this writing. analysis of the assigned readings all readings (i.e., books, book chapters, and articles) listed on the collected syllabi were entered into a citation management database. the way in which readings are listed on syllabi varies greatly: some instructors require students to purchase specific books, and some simply recommend a set of books that students may purchase if they so desire. some instructors assign a fixed set of readings for each class session and some do not. some instructors assign entire book chapters and some assign sections of chapters. as a result, this step in the methodology required making some assumptions about the appropriate unit of analysis for identifying readings. in the end, multiple units of analysis were employed: an author, an entire book, an entire journal, and a single journal article. where an instructor assigned only part of a book chapter or part of a journal article, that assignment was "rounded up," as it were, to the larger unit. finally, every reading from every syllabus was classified by topic. these topics were those identified by the current authors in their earlier analysis of the computing curriculum 2001 (joint task force, 2001). the cc2001 discussion regarding the field of information management (joint task force, 2001, p. 140) provided the starting point for our analysis, with a focus on three core areas (information models and systems, database systems, and data modeling), as well as the four elective areas most related to library and information science (information storage and retrieval, hypertext and hypermedia, multimedia information and systems, and digital libraries). cc2001 lists a set of topics under each of these areas. the topics suggested by cc2001 were validated through an examination of published papers from dl-related venues (the acm conference on digital libraries, jcdl, and d-lib magazine) (pomerantz et al., 2006). together, these papers represent a significant portion of the published literature on dls. thus the topics into which readings were classified represent the current state of the art and recommended best practices in dl research and education. the result of these preliminary analyses was the list of topics in figure 1.   figure 1: topics in digital libraries (for a larger view of figure 1, click here.) the methodology for the current analysis of course readings is similar to that performed by pomerantz et al. (2006) in their analysis of published papers. every paper in the corpus was assigned to a single topic most closely related to the reading's content focus, from the list of core topics in figure 1. in the process of the topic classification, the topics assigned by the course instructors to the class sessions in which these readings were used were considered, and the original materials were also analyzed to enhance the accuracy of the subject classification. this assignment was performed using methods employed by librarians assigning subject headings: reading the title and abstract, and skimming the actual paper to achieve an understanding of the paper's content. two members of the research team independently assigned each reading to a topic. these assignments were subsequently reviewed and any differences in categorization were resolved through discussion. this analysis was inductive in the sense that the topic classification scheme was elaborated as new subtopics were identified from the data. results most frequently assigned readings of the 56 ala accredited lis master's programs, 29 programs offer dl-related courses; thus, 52% of accredited lis programs offer courses on dls. in these 29 programs, 40 dl-related courses were offered in recent years, between 2003 and 2006. some of these programs offer two and even three dl-related courses. in programs where multiple dl courses are offered, generally one course is a broad-based introductory dl course, and other courses are "special topics" or clinical courses on specific technologies or services for dls. we were able to collect syllabi and their reading lists from 33 courses in 23 programs. the most recent syllabus collected was from the spring semester 2006, while the oldest syllabus was from fall semester 2003. as mentioned above, syllabi were collected only for the most recent semester in which the course was offered; of the 29 programs, none offered a dl-related course every semester, and 7 offered a course once a year. the rest of the programs did not have a regular dl course offering, and many new courses were scheduled to be offered in 2005 and 2006. of the 40 courses identified, syllabi for 26 (65%) were available on the open web. of these 26 syllabi, 6 did not provide the course reading lists. thus complete course materials were maintained online for only 20 syllabi (50% of courses identified). this percentage is somewhat disappointing, given the courses' focus on digital libraries and the fact that 100% of the lis programs on the ala list of accredited master's programs maintain websites. to obtain those syllabi and reading lists that were not available on the open web, we contacted the instructors of these courses individually via email or telephone. a total of 1,777 titles for readings were identified in the collected syllabi, where a reading was defined as a book, book chapter, journal, journal article, report, or online source. of these 1,777 titles, 80 were excluded from this analysis, for a total of 1,697 titles. the 80 titles that were excluded were those books from which only specific chapters were assigned; those individual chapters were included in the analysis. many readings were assigned in only one course, thus giving the frequency distribution of readings a very long tail. for this reason, only the top few readings of each type are shown in the tables below. the full sets of readings are available at: . table 1 shows the top 5 most frequently assigned book titles in dl courses. these are singleor multiply-authored books, and not edited compilations. the second column shows the number of courses in which the book was assigned. students in these courses did not necessarily read these books cover-to-cover, however; most often individual chapters were assigned for individual class sessions. further, these books were not always required for these courses. table 1: most frequently assigned books books # of assignments required in # courses recommended in # courses # of unique chapters assigned witten, i. h., & bainbridge, d. (2003). how to build a digital library. san francisco, ca: morgan kaufman publishers. 13 9 4 9 of 9 arms, w. y. (2000). digital libraries. cambridge, ma: the mit press. 12 9 3 12 of 14 borgman, c. l. (2000). from gutenberg to the global information infrastructure. cambridge, ma: the mit press. 9 6 3 9 of 9 lesk, m. (2004). understanding digital libraries (second ed.). san francisco, ca: morgan kaufman publishers. 8 7 0 14 of 14 chowdhury, g. g., chowdhury, s. (2003). introduction to digital libraries. london: facet. 7 5 2 15 of 15 of course, what qualifies as a book is less clear than it once may have been. for example, arms' book, digital libraries, was first published in print, and is now available in full text in several locations online. clearly the print version is a book, but should the online versions also be considered to be books? more confusingly, the work introduction to metadata: pathways to digital information [2], edited by murtha baca, assigned in several dl courses, is dual-published by the j. paul getty trust, both in print and online. for the purposes of this study, both the examples were categorized as books. table 2 shows the top 13 most frequently assigned journal articles in dl courses. there appear to be two distinct sets of articles represented among those most frequently assigned: overview articles, such as borgman (1999) and schwartz (2000), and articles on specific topics, such as arms, blanchi & overly (1997) and lynch (2005). table 2: most frequently assigned journal articles articles # of assignments borgman, c. l. (1999). what are digital libraries? competing visions. information processing & management, 35(3), 227-243. 9 bush, v. (1945). as we may think. the atlantic monthly, 101-108. 8 schwartz, c. (2000). digital libraries: an overview. journal of academic librarianship, 26(6), 385-394. 7 choudhury, g.s.; hobbs, b.; m lorie, flores, n.e. (2002). a framework for evaluating digital library service. d-lib magazine july/august 2002. volume 8 number 7/8. 5 arms, w. y., blanchi, c., & overly, e. a. (1997). an architecture for information in digital libraries. d-lib magazine, 3(2). 5 levy, d. m. (2000). digital libraries and the problem of purpose, d-lib magazine, 6(1). 5 mccray, a. t., gallagher, m. e. (2001). principles for digital library development. communications of the acm, 44(5), 48-54. 5 lynch, c. (2005). where do we go from here? the next decade for digital libraries. d-lib magazine, 11(7/8). 5 arms, w. y., hillmann, d., lagoze, c., krafft, d., marisa, c., saylor, j., et al. (2002). a spectrum of interoperability: the site for science prototype for the nsdl. d-lib magazine, 8(1). 4 hill, l. l., carver, l., larsgaard, m., dolin, r., smith, t. r., frew, j. (2000). alexandria digital library: user evaluation studies and system design. journal of the american society for information science, 51(3), 246-259. 4 lynch, c. (2002). digital collections, digital libraries and the digitization of cultural heritage information. first monday, 7(5). 4 lossau, n. (2004). search engine technology and digital libraries: libraries need to discover the academic internet. d-lib magazine, 10(6). 4 van house, n., burler, m. h., ogle, v., schiff, l. (1996). user-centered iterative design for digital libraries. d-lib magazine. 4 table 3 shows the top 13 most frequently assigned journals in dl courses. a total of 121 journals were identified in the collected syllabi; journals that changed names (e.g., the journal of the american society for information science and the journal of the american society for information science & technology) were considered to be a single journal. the middle column of table 3 shows the number of courses in which an article from the journal was assigned, and the righthand column shows the number of unique articles from these journals assigned across all courses. these data include two special issues on the topic of digital libraries: communications of the acm vol. 44 no. 5, may 2001; and journal of the american society for information science & technology vol. 49 no. 11, 1998. the entirety of these special issues (i.e., all the individual articles in the special issue) was assigned in one of the courses included in this study. thus, each article in these special issues was counted in the righthand column. table 3: most frequently assigned journals journal names # of assignments # of unique articles assigned d-lib magazine 154 94 communications of the acm 27 23 journal of the american society for information science (& technology) 25 19 first monday 21 15 library trends 19 13 information processing & management 15 8 ariadne 14 11 computers in libraries 14 14 library journal 12 12 information today 11 11 online 11 11 journal of academic librarianship 10 4 library hi tech 10 8 table 4 shows the top 10 most frequently assigned authors in dl courses. an author was counted once for each reading on a syllabus on which his or her name appeared. authors of all genres of reading were included: authors of books, book chapters, journal articles, etc. this includes multiply-authored works, for which every author was counted once. a total of 949 authors were identified. as in table 3, the middle column of table 4 shows the number of courses in which an article by an author was assigned, and the righthand column shows the number of unique works by these authors assigned across all courses. table 4: most frequently assigned authors author # of assignments # of unique articles assigned arms, willliam y. 36 10 borgman, christine l. 27 10 bainbridge, david 26 8 witten, ian h. 26 8 lesk, michael 18 6 lynch, clifford 18 9 lagoze, carl 15 7 marchionini, gary 15 12 chowdhury, sudatta 14 5 smith, abby 14 8 works with multiple authors present an interesting contrast when looking at authors or at the works themselves. note that both ian witten and david bainbridge are among the most frequently-assigned authors. witten and bainbridge are frequent co-authors; bainbridge is a co-author of 7 of the 8 publications bearing witten's name that were identified on dl syllabi, and witten is a co-author of 7 of the 8 identified publications bearing bainbridge's name. note also that witten and bainbridge's book, how to build a digital library, is the most-assigned book on dl syllabi. classification by topic the distribution of readings from dl course syllabi is illustrated in figure 2. readings on project management and on dl architectures dominated the syllabi. other topics that were common across many of the syllabi included collection development, information/knowledge organization, and overviews of the area. figure 2: distribution of readings across topics undoubtedly all of the courses examined address some dl-related topics through lectures, class discussions, assignments, or some other means, that are not addressed in readings. it is assumed, however, that readings are assigned to address the most important topics in each course and in each class session. thus, while an analysis of the readings from courses does not provide a complete view of all of the topics addressed in these courses, it does provide a view of the topical highlights, and those which the instructors consider most important. discussion this study found similar results to nicholson's (2005) findings: there is no core set of readings assigned in dl courses, but there is a core set of authors whose works are assigned. this study also found that there is a core set of journals from which readings are assigned. the authors hypothesize that an analysis of readings in many topical areas of lis would show a core set of authors and journals, but no core set of readings. nicholson suggests two possible explanations for this phenomenon: first, that as syllabi get updated, more recent works by authors are substituted in place of older works, creating a patchwork when looking across syllabi; and second, that since lis is a new field, core works have not yet emerged. we suggest a third possible explanation: that within the field of study of lis, there may be no agreement about which specific works are seminal, but there is agreement that certain topical areas are central to the field. this would seem to be the case within the arena of dls: certain topics are agreed upon as falling under the broad umbrella of dls, though there is no agreement as to precisely which works are best read to introduce those topics in a course. this lack of agreement manifests itself to varying degrees with regard to the consistency of the readings assigned in courses, by a specific author, from a specific journal, or from a specific book. for example, works by arms are frequently assigned, and which of arms' works are assigned is fairly consistent across courses (10 unique works in 36 assignments). on the other hand, works by marchionini are also frequently assigned, but there is little consistency in these assignments across courses (12 unique works in 15 assignments). as dl curricula continue to evolve, particularly if supported by funded development and dissemination projects (such as the unc-vt project described in pomerantz et al., 2006), we may see increases in the consistency of assigned readings across courses; that is, a core set of readings may begin to emerge. it is in the nature of teaching that courses and course reading lists change over time, both when a course is offered by the same instructor across semesters and when a course is offered by a new instructor. thus, the set of readings analyzed in this study is simply a snapshot of the state of digital library education in the spring 2006 semester. it is possible that different readings could emerge as most frequently assigned if this same study were conducted with syllabi from different semesters. we doubt, however, that this would be the case. we expect that, if this study were replicated, the same authors, books, and journal articles would emerge as the most frequently assigned. the order of the top few most-assigned authors, books, or articles might vary, but the set would likely be consistent. whether this expectation is fulfilled or not, it is clear that certain authors, books, and articles at present have a great impact on the teaching of digital library courses in lis programs, and possibly will continue to have in the future. in this vein, an interesting issue arises with regard to the use of books in courses, and whether or not a book is "required" for a course. as is made clear above, whole books and chapters from books are used in many courses in which those books are not required: that is, the instructor does not require that students in the course purchase the book. an instructor may not wish to require that students purchase a book because it is expensive, or an instructor may have created a course pack; there may be any number of reasons why students are not required to purchase a book for a course. further, even if a book is required in a course, a student still may not purchase it, but instead may take it out of a library. this is especially true for students in lis programs: what student knows better to take a book out of a library instead of purchasing it, than a student in library school? thus a book may be widely used in courses, but still not be widely purchased, at least not by students. and, as students must certainly be a large percentage of the market for the books named above, it may appear to the publishers that these books are not commercial successes, when in fact they are very widely used. lesk's (2004) book, understanding digital libraries, for example, was enough of a commercial success that the publisher issued a second edition, but not all publishers may be so enlightened. we encourage publishers of books that are widely used in courses to use metrics other than sales to evaluate the success of their books. as noted earlier, the articles assigned in dl courses tended to fall into two groups: overview articles and articles on specific topics. the overview articles, such as borgman (1999) and schwartz (2000), tended to be assigned near the beginning of the semester, and were presumably assigned as a method for the instructor to introduce the broad topic of dls. articles on specific topics, such as arms, blanchi & overly (1997) and lynch (2005), were assigned, presumably, as a method for the instructor to introduce that specific topic. other articles on that topic may exist, but the instructor was more likely to select a familiar article or an article that was likely to provide fodder for class discussion. as courses on dl topics evolve, we are likely to see more convergence on the selection of the overview articles, while there may continue to be variety in those articles used to introduce particular topics. while publications and conference papers on the topic of architecture were the most frequently identified in pomerantz et al. (2006), readings on the topic of project management were the most frequently assigned in this analysis. project management is followed by architecture and collection development. this also differs from the findings in pomerantz et al., where the topic of services contained the second greatest number of papers. to a certain extent this is due to the fact that some of the readings that are classified here as project management and as archiving and preservation integrity would have been classified as services in the scheme used in the earlier paper. perhaps a more important factor, however, is that it would be difficult, if not impossible, to teach a dl course without addressing the topics of architecture and collection development, but papers on these topics are not frequently presented at jcdl or published in d-lib magazine. indeed, the topic of project management did not even crop up in pomerantz et al.'s analysis of published papers, but it is another important topic to address in a dl course, particularly in any course that involves a practical assignment to build a dl. digital library-related topics are central to much of the curriculum in lis programs, and are addressed in many courses across the curriculum. this raises the question of whether, in the long run, courses specifically dedicated to digital libraries have a future. there is, for example, no course in lis programs on physical libraries. from a certain point of view, digital libraries are simply environments in which many principles and tools from across the lis curriculum come together. time will tell whether courses specifically dedicated to dls will persist in lis curricula, or if the topics addressed in dl courses will be integrated into other, more topically focused courses across lis curricula. given the spread of dls over the past decade and a half, however, the authors predict that dl courses will become more rather than less common. acknowledgements many thanks are due to kristina spurgin, for performing the rather tedious task of collecting most of the syllabi and entering readings into a database. thanks to ian witten for raising the issue of a book's use in courses versus its commercial success. this material is based upon work supported by the national science foundation under grant nos. iis-0535057 (vt) and iis-0535060 (unc-ch). notes [1] . [2] . references arms, w. y. (2000). digital libraries. cambridge, ma: the mit press. arms, w. y., blanchi, c., & overly, e. a. (1997). an architecture for information in digital libraries. d-lib magazine, 3(2). . arms, w. y., hillmann, d., lagoze, c., krafft, d., marisa, c., saylor, j., et al. (2002). a spectrum of interoperability: the site for science prototype for the nsdl. d-lib magazine, 8(1). . borgman, c. l. (1999). what are digital libraries? competing visions. information processing & management, 35(3), 227-243. borgman, c. l. (2000). from gutenberg to the global information infrastructure. cambridge, ma: the mit press. bush, v. (1945). as we may think. the atlantic monthly, 101-108. chan, l. m. (1987). instructional materials used in teaching cataloging and classification. cataloging and classification quarterly, 7(4), 131-144. chowdhury, g. g., chowdhury, s. (2003). introduction to digital libraries. london: facet. he, s. (1999). a content analysis of it-oriented courses in lis curricula: distribution, changing ratio, curricula topics and consensus among four american lis schools. journal of information, communication and library science, 6(2), 1-9. hill, l. l., carver, l., larsgaard, m., dolin, r., smith, t. r., frew, j. (2000). alexandria digital library: user evaluation studies and system design. journal of the american society for information science, 51(3), 246-259. irwin, r. (2002). characterizing the core: what catalog descriptions of mandatory courses reveal about lis schools and librarianship. journal of education for library and information science, 43(2), 175-184. joint task force on computing curricula, ieee computer society [and] association for computing machinery. (2001). computing curricula 2001: computer science. final report. . joudrey, d. n. (2002). textbooks used in bibliographic control education courses. cataloging and classification quarterly, 34(1/2), 103-120. larsen, j. c. (1979). information sources currently studied in general reference courses. rq, 18(4), 341-348. larson, m. e., & meltzer, e. (1987). education for bibliographic instruction. journal of education for library and information science, 28(1), 9-16. lesk, m. (2004). understanding digital libraries (second ed.). san francisco, ca: morgan kaufman publishers. levy, d. m. (2000). digital libraries and the problem of purpose, d-lib magazine, 6(1). . lossau, n. (2004). search engine technology and digital libraries: libraries need to discover the academic internet. d-lib magazine, 10(6). . lynch, c. (2002). digital collections, digital libraries and the digitization of cultural heritage information. first monday, 7(5). . lynch, c. (2005). where do we go from here? the next decade for digital libraries. d-lib magazine, 11(7/8). . mccray, a. t., gallagher, m. e. (2001). principles for digital library development. communications of the acm, 44(5), 48-54. moran, b. b. (1985). popular culture and library education. journal of education for library and information science, 26(1), 25-32. nicholson, s. (2005). understanding the foundation: the state of generalist search education in library schools as related to the needs of expert searchers in medical libraries. journal of the medical library association, 93(1), 58-65. . pomerantz, j., wildemuth, b., fox, e. a., & yang, s. (2006). curriculum development for digital libraries. in proceedings of the 6th acm/ieee-cs joint conference on digital libraries (pp. 175-184). new york: association for computing machinery. . schwartz, c. (2000). digital libraries: an overview. journal of academic librarianship, 26(6), 385-394. weech, t. l. (1994). the teaching of economics of information in schools of library and information science in the us-a preliminary analysis. in b. maxian (ed.), proceedings of the seventh annual meeting of the american society for information science, alexandria, va, 17-20 october 1994 (pp. 70-75). medford: new jersey, learned information, inc., for american society for information science. white, g. w. (2004). business information courses in lis programs: a content analysis. journal of business and finance librarianship, 10(2), 3-15. witten, i. h., & bainbridge, d. (2003). how to build a digital library. san francisco, ca: morgan kaufman publishers. appendix citations for all books, journal articles, journals, and authors assigned in syllabi collected for this study are available at: . copyright © 2006 jeffrey pomerantz, sanghee oh, seungwon yang, edward a. fox, and barbara m. wildemuth top | contents search | author index | title index | back issues editorial | next article home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/november2006-pomerantz   where do we go from here? the next decade for digital libraries search   |   back issues   |   author index   |   title index   |   contents d-lib magazine july/august 2005 volume 11 number 7/8 issn 1082-9873 where do we go from here? the next decade for digital libraries   clifford lynch coalition for networked information the field of digital libraries has always been poorly-defined, a "discipline" of amorphous borders and crossroads, but also of atavistic resonance and unreasonable inspiration. "digital libraries": this oxymoronic phrase has attracted dreamers and engineers, visionaries and entrepreneurs, a diversity of social scientists, lawyers, scientists and technicians. and even, ironically, librarians – though some would argue that digital libraries have very little to do with libraries as institutions or the practice of librarianship. others would argue that the issue of the future of libraries as social, cultural and community institutions, along with related questions about the character and treatment of what we have come to call "intellectual property" in our society, form perhaps the most central of the core questions within the discipline of digital libraries – and that these questions are too important to be left to librarians, who should be seen as nothing more than one group among a broad array of stakeholders. it is debatable (in the true sense of the word) whether this very broad tent full of diverse interests and inherent contradictions will continue to be helpful going into the future, and my purpose here is to briefly consider some of the arguments that might be brought to such a debate both in terms of where digital libraries have been and where they may choose to go in future. this issue of d-lib is intended to be a retrospect on the past decade of digital library efforts. there are many ways to think about this decade, and i'll mention some of them shortly. but before doing that, let us recognize that digital libraries did not simply emerge from nowhere in the mid-1990s. this is a field with an incredibly rich, and, as yet, poorly chronicled pre-history and early history. there is a stream of work and ideas that reaches back to at least the turn of the 20th century, and includes such thinkers as h.g. wells and paul otlet; later contributors to the pre-history of visions of new, technologically-enabled means of knowledge organization, access and distribution also include vannevar bush and j.c.r. licklider.(indeed, licklider's 1965 book libraries of the future might well be seen as marking one of the transition points between pre-history and the actual history of digital libraries.) up till the mid-1990s, as the internet made the jump from the research and education community into the broad public consciousness, we repeatedly encounter the utopian visions of a hither-to unimaginable cornucopia of knowledge and human creativity accessible worldwide through the network; while such aspirations were at least deflated if not largely shattered by the onslaught of commercial and government interests, the censors, snoops and copyright maximalists, they still run not far under the surface of the public mind, as witnessed by the reactions to google's recent announcement that it would digitize the contents of several major research libraries. the technical and engineering basis for digital libraries also reaches back several decades, to the 1960s, and includes on-line research and commercial information services, library automation systems, document structuring and manipulation systems, human computer interface work and a wealth of other efforts. technologies like distributed search (for example, z39.50) were well established by the late 1980s; it is easy today to forget kahn and cerf's seminal integrative paper "the digital library project volume 1: the world of knowbots" was written in 1987-1988. indeed, by the mid-1980s there were systems both in the commercial sector (consider lexis-nexis) and the research world (bruce schatz's telesophy, for example) that might reasonably be considered digital libraries at least by some definitions. very substantial digital library systems were developed prior to the world wide web. one way to characterize the period from about 1994-2004 is that it represents the first time that digital library research could really get substantial programmatic funding from the major research funding agencies in the united states (and abroad as well, though that is another story i can't cover here). the u.s. national science foundation, in collaboration variously with darpa, nasa, the national library of medicine and the national institutes of health, the library of congress, the national endowment for the humanities, the institute of museum and library studies (and no doubt other entities that i have offended by omitting here) established two major competitive funding programs – the digital libraries initiative and dli-2 – through which researchers in higher education (along with partners beyond the higher education world) systematically engaged in the construction and analysis of digital library prototypes and research in both the underlying technologies and social implications surrounding these systems. this funding legitimized digital libraries as a field of research (including providing support for a cadre of graduate students). and it captured the attention of scientists, scholars, educators, political figures, and the general public as these investments were played out against the public discovery of the internet, the web, and all things digital and networked. we should be clear that the new element was the programmatic funding and community creation: nsf, nih and other government agencies certainly invested in systems that we might today retrospectively characterize as early digital libraries at least as early as the 1960s (think of dialog and brs and chemical abstracts); what they didn't do was systematically create a funding program dealing with digital libraries research in a general way or that strongly encouraged cross-disciplinary collaboration, including collaboration among computer and information scientists, engineers, librarians and social scientists. in 1992, darpa funded the computer science technical reports project, which involved five universities and the corporation for national research initiatives; while this was important in terms of digital library technology, it also had some explicit community-building goals, and might be viewed as seeding the larger-scale community building from the full scale digital libraries initiatives that began a few years later. also, there was a huge amount of digital library activity happening by the 1990s in the commercial, government, higher education and cultural memory sectors, often with a more operational than research orientation, as well as in research projects that weren't part of the nsf-led initiatives but drew funding from other sources. ironically, particularly during the first half of the decade, one would be much more likely to find actual libraries involved in these other efforts – the lion's share of the nsf funding went to computer science groups, with libraries often being only peripherally involved, if at all. but the singular and really stunning accomplishment of nsf and its collaborator institutions during this period was to successfully form a community, convened physically through "all-hands meetings" and conferences like the joint acm/ieee conferences on digital libraries (and its predecessors and international peers) or the imls web-wise meetings, and virtually through venues like d-lib magazine. other organizations also helped to advance this community-building work as well; the contributions of the andrew w. mellon foundation, and the coalition for networked information (my own organization) were certainly valuable here, for example. indeed, the investment that nsf led in community-building immediately benefited from all of the other investments in operational digital libraries, in related technologies, and in digital library research mentioned above; nsf and its collaborators wisely chose an inclusive rather than exclusionary posture in engaging the sectors of the digital library world beyond those that they were directly funding. the unprecedentedly diverse multi-disciplinary (and international) community which nsf and its collaborators assembled quickly generated enormous leverage in promoting the advancement, organization and dissemination of knowledge and ideas about digital libraries. the achievement here has been so significant, and so effective, in my view, that it demands careful consideration as strategies are developed in future for advancing other innovative multi-disciplinary scientific and engineering research initiatives. as of 2005, it seems a virtual certainty that substantial programmatic us government funding of digital libraries research in terms of the construction of prototype systems is at an end, at least for the near future. the novelty of constructing digital libraries as a research end in itself has run its course; additionally, government budget contractions and shifts in funding priorities make it difficult to establish any new research initiatives. exceptions are mostly in areas like defense, intelligence and homeland security, and the digital library community has certainly repositioned projects and refocused research to respond to funding solicitations from these sectors. certainly relevant enabling digital library technologies can and will compete with other technologies involved in basic information systems and technology and computer science for funding under more general funding programs from nsf and other agencies. the investigation of the role of cultural memory institutions in the digital world and related policy problems in intellectual property never really got much funding to begin with, except in terms of some studies done by the national research council such as the digital dilemma report; it's hard to structure academic research around these questions, at least within a framework that is comfortable and comprehensible to science and engineering funding agencies. (individual speculation and deep reflection seem to be a better fit for grants from private foundations, such as the macarthur "genius" awards). the overall sense of a digital library community, at least for now, continues, supported by modest investments from imls, nsf and various other sources. where might the concepts, technologies, engineering know-how, and even issues developed by this amorphous "digital library" community over the past decade or more find other new and perhaps unexpected roles in the coming years? one obvious place is production systems, and commercial products that can be used to help construct and operate these production systems. the thinking about cyberinfrastructure and e-research in the united states and beyond clearly maps out a place for digital libraries as part of the infrastructure to support research, and maps out a set of research issues in the applications of digital library technology to support various scholarly, scientific and engineering disciplines. there are roles for social scientists as well as technical and disciplinary specialists in exploring these research issues. the move to e-research has also highlighted a related set of efforts that involve what is now being termed "data curation" or even "data science" – the management and curation of large, complex scientific information resources. the digital library community has much to contribute to this work. but a good deal of what will happen here i would characterize as advanced technology deployment in production systems rather than pure research, much like what takes place in the deployment of high performance computer communications networks to support the research and education community. it is work that combines production engineering and research in complex and delicate balances. beyond the e-research and cyberinfrastructure programs, we see a great deal of investment across the higher education, cultural memory, and government and commercial sectors in systems and services like digital asset management, digital collection creation and management, and institutional repositories. all of these use the technological tools of digital libraries, and many of them draw upon the social tools and insights as well. and search technologies of various kinds, both at the enterprise and internet-wide (e.g. google, yahoo, et al) levels also draw heavily on digital library technologies. in a real sense, then, we can view digital libraries as offering a relatively mature set of tools, engineering approaches, and technologies that are now ready to be harnessed in the service of many organizations and many purposes. much of the further research will occur within the context of those organizations and purposes, but certainly within these contexts there's an enormous amount of research yet to be done, particularly when one includes the curation and preservation issues. more broadly speaking, digital preservation is going to be an enormous issue – a very fundamental societal problem at all levels from the nation-state to the individual. in my view, it's going to attract increasing commercial interest, as well as growing unease and concern from the general public, over the next decade. this is a hard area to do compelling research in: without the digital analogs of physical accelerated ageing test beds, most research is either about tools, about identifying approaches that don't work, or is highly speculative in nature – how do you prove that your approach in fact will preserve data for a thousand years without having to wait that long? (you can prove it will fail in much less time, of course!). digital libraries have made some contributions to this area, but limited ones. we now have national programs like ndiipp at the library of congress trying to deal with these areas on a more operational basis; the operational programs are deeply bound up with legal, economic and public policy conundrums. my feeling is that we need to fund this area with all the research dollars it can usefully employ – from a variety of sources, including the research agencies and operational programs like ndiipp – but that the amount that can be usefully expended on real research probably isn't terribly large. prototypes, pilot projects and operational system and service launches are another matter entirely, and are likely to be much more expensive – but they generally also need long-term funding, or some other kind of economic sustainability framework that comes into operation to support them if they are going to really accomplish much. note that there's a set of research questions about stewardship more broadly in the digital age; these are related to preservation but go far beyond preservation, and move into cultural, public policy, and ethical questions about how and what we remember and forget, about when and how it is appropriate to invest in ensuring the survival of memory. these question are, in my view, of central importance, but they are not purely (or even, perhaps, predominantly) scientific and engineering questions, and are at best adjacent to the concerns that have characterized digital libraries, even in the context of preservation. yet we must not lose sight of them, even though it's unclear where the responsibility for engaging them lies other than in academia broadly, largely outside of existing funding frameworks. finally, there are numerous areas of research related to the historic interests of the digital library community that are at the crossroads of technology and social science and which will demand investment and attention in the coming years; many of these are natural extensions and elaborations of the collaborations initiated by the past decade of digital library research programs. i would hope to see some of them become part of specific research funding initiatives. at the same time, however, recognize that many of them have characteristics that make them unattractive as traditional three to five year funded research programs. let me mention just a handful of areas here that i find particularly compelling. personal information management. as more and more of the activities in our lives are captured, represented and stored in digital form, the questions of how we organize, manage, share, and preserve these digital representations will become increasingly crucial. among the trends lending urgency to this research area are the development of digital medical records (in the broadest sense), e-portfolios in the education environment, the overall shift of communications to email, and the amassing of very large personal collections of digital content (text, images, video, sound recordings, etc.) long term relationships between humans and information collections and systems. this is related to personal information management, but also considers evolutionary characteristics of behavior, systems that learn, personalization, system to system migration across generations of technologies, and similar questions. this is connected to human-computer interface studies and also to studies of how individuals and groups seek, discover, use and share information, but goes beyond the typical concerns of both to take a very long time horizon perspective. role of digital libraries, digital collections and other information services in supporting teaching, learning, and human development. the analysis here needs to be done not on a relatively transactional basis (i.e. how can a given system support achievement of a specific curricular goal in seventh grade mathematics) but how information resources and services can be partners over development and learning that spans an entire human lifetime, from early childhood to old age. active environments for computer supported collaborative work offer the starting point for another research program. these environments are called for, under the term "colaboratories", by the various cyberinfrastructure and e-science programs, but have much more general applicability for collaboration and social interactions. from one perspective, these environments are natural extensions of digital library environments, but at least some sectors of the digital library community have always found active work environments to be an uncomfortable fit with the rather passive tradition of libraries; perhaps here the baggage of "digital libraries" as the disciplinary frame is less than helpful. but there is a rich research agenda that connects literatures and evidence with authoring, analysis and re-use in a much more comprehensive way than we have done to date; this would consider, for example, the interactions between the practices of scholarly authoring and communication on one hand, and on the other, the shifting practices of scholarship that are being recognized and accelerated by investments in e-science and e-research. perhaps the overarching theme here, and it is one that may point to a major direction for research that follows on the last decade of progress in digital libraries, is connecting and integrating digital libraries with broader individual, group and societal activities, and doing this across meaningful time horizons that recognize digital libraries and related constructs as an integral and permanent part of the evolving information environment. the next decade for digital libraries may well be characterized most profoundly by the transition from technologies and prototypes to the ubiquitous, immersive, and pervasive deployment of digital library technologies and services in the broader information and information technology landscape. copyright © 2005 clifford lynch top | contents search | author index | title index | back issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/july2005-lynch   indigenous cultural and intellectual property rights: a digital library context search  |    back issues  |    author index  |    title index  |    contents d-lib magazine may 2002 volume 8 number 5 issn 1082-9873 indigenous cultural and intellectual property rights a digital library context   robert sullivan te tumu herenga / the university of auckland library aotearoa / new zealand <[email protected]> introduction recently a watershed moment occurred in the world of intellectual and cultural property rights in aotearoa, new zealand, when the "toi iho" trademark [1] was launched at the auckland city art gallery. elders and leaders of the art world and of the tangata whenua — the indigenous local tribe — gathered to celebrate the physical reality of an idea/passion/signifier that had been discussed in various guises by maori for decades. the toi iho trademark asserts authenticity in the creative arts and provides a cultural context for works that have a maori lineage or whakapapa. the toi iho trademark signifies an ethos of ownership, respect, and active engagement with the maori people from which the culture sprang. within this context, i begin a discussion about the digitization of the creative works of our ancestors. works cauled in the times of the gods — when tane separated his parents, the heavens and the earth [2], when ruaumoko [3], the foetal earthquake-god, kicked the belly of the earth mother to create the ravines and mountains of aotearoa / new zealand, and when maui [4] hauled his great fish — the north island — out of the domain of tangaroa, the ocean. this created wisdom has been handed down by the ancestors since kupe [5] first discovered aotearoa near the end of the first millennium. digitizing cultural materials anything can be digitized: any story, legend, map, chart, blueprint, or equation. any storyteller recorded in video or sound format can be transformed into a digital rendition for access on local or global networks. when digitizing cultural materials, the important questions are: how do we send a message that strengthens the holistic context of each cultural item and collection? how do we ensure that both indigenous and non-indigenous peoples receive the message? how do we digitize material taking into account its metaphysical as well as its digital life? in august 2001, i was fortunate to participate in the hilo, hawaii, meeting on "digital collectives in indigenous cultures and communities" [6]. this collaboration of indigenous, technical, financial, and library experts created a vision that needs to be embraced and driven by indigenous communities themselves: "building a global indigenous library" (a suggested project in the hilo meeting report). various technical digital library models were articulated at the meeting. it was agreed that the challenge in building a successful indigenous digital library model is winning the trust of the people the library aims to serve. trust is won through the governance and administration of the digital library, and the way that flow-on economic benefits are distributed to the people providing the content. for trust to exist there must be a mutual ethic of reciprocity. many communities want training and employment opportunities. building a global digital library requires first people and infrastructure, and then content. various technical protocols and standards must be met to ensure that the resource is accessible — and accessible in the manner intended. dealing with these technical issues alone provides an enormous employment opportunity. consequently, there is potential for material returns from sharing the cultures of indigenous communities for the benefit of the world. a necessary component of dl infrastructure is the equipment to deliver and receive information. many communities have narrow-width access to the internet — if they have telecommunications or computers at all. how to resource communities so that they can access the world wide web is an issue for everyone — an issue with even more resonance for indigenous groups. dl infrastructure also includes administrative structures. in the d-lib magazine "special issue on digital technology and indigenous communities" [7], professor loriene roy articulates a model of governance that could be transferred to the indigenous digital library (idl). suffice it to say, governance is an important issue that will impact the idl's effectiveness in working with communities to gather information. indigenous cultural and intellectual property (icip) management must be articulated from the start of any idl project. barambah and kukoyi [8] advocate the development of cultural protocols that overcome legislative deficiencies and difficulties. a key legislative deficiency affecting idl projects, in particular, is the temporary and individualistic protection that copyright offers to creators of icip. copyright expires after a defined term. copyright is assigned to individuals. therefore, the collective nature and enduring guardianship — care, development and preservation — with which indigenous communities imbue their cultural and intellectual property, cannot be addressed by copyright alone. the new zealand initiative of the toi iho trademark ameliorates this situation for the maori artistic community. a similar protocol initiative, to set up systems and procedures to ensure that local indigenous customs are maintained in regard to their information, would ensure the integrity of idls. the new zealand initiative also enables cross-cultural partnerships. the toi ihotm maori made has two companion trademarks: toi ihotm mainly maori, and toi ihotm mauri co-production. the first two trademarks cannot be used by businesses, only by individual artists. however, the toi ihotm maori co-production can include use by business entities. the trademark can also be used to authenticate exhibitions and may provide an interesting application for digital repositories. the protocols articulated by indigenous communities serve many purposes. they satisfy communities that their information will be contextualized in a manner that acknowledges and maintains everything the communities hold to be significant. protocols perform an educating function for the library and information community. thus, they potentially reduce infringement in other information spheres. the australian aboriginal article by barambah et al. [8] raises issues of relevance to new zealand maori: "the issues involved include: who can speak for what? who has the authority for what? whose custom? whose heritage? whose culture? and whose identity? all these questions are extremely important." (p.33) these issues are also transferable to the global indigenous sphere. authenticating material already held by institutions of memory in the maori context, and indeed the polynesian context, much information has already been collected and systematically catalogued in institutions of memory such as the alexander turnbull library (http://www.natlib.govt.nz/en/collections/turnbull/index.html). most of this information has been gathered using nineteenth century methods — some involving payments, which thus has encouraged false-information to be given and published [9]. if a digital collection were to be created using such material, out of legal copyright, the digitizing institution would have the responsibility of placing the material in its appropriate context, by liaising with the communities where the information originated. one such case involves the department for courts of new zealand, which holds copies of the maori land court minute books of evidence [10] given to establish legal title to most of the maori land in new zealand. "maori are compelled by statute to deal with the maori land court if they want to transact business over their land" (p.11). the historical purpose of the court was to prepare the land for purchase by settlers. some of the evidence is contentious, and covers historical battles and family feuds. the department for courts is completing a project involving the digitization of all the minute books, covering 12.2 million pages (and growing 5 percent annually). prior to the project, a series of consultative meetings were held throughout the country. from these meetings, a number of principles emerged: the information must be unable to be changed or altered. sacred, genealogical information should only be accessed by individuals after they have consulted with the relevant tribes. institutions of social memory must be informed that genealogical information contained in copies of the land court records that they hold is restricted information. it must be ensured that sacred information "…is not used in a manner contrary to maori cultural values, or for commercial purposes". maori assert ownership of the record. the principal recommendation of the report is that a group be established, with representation from interested tribes, to formulate policies on access to the record — in both documentary and digital formats. authenticating contemporary material the process for digitizing contemporary cultural information from indigenous communities is less complex than for older materials, and has been amply covered by the "digital collectives" collaboration report [11]. some of the guiding principles from the report pertaining to icip invite digitizing groups to: affirm indigenous communities as equal partners in future collaborations. uphold cultural intellectual and property rights of communities. ensure cultural integrity. interpret, analyze, and synthesize information for general audiences. require that "digital libraries should be developed and controlled by indigenous peoples and self-determined" (p.6). understand the "importance of community-based guides [to digitization] that express [sic] tribal values" (p.8). two other informational documents pertaining to icip are: "the mataatua declaration on cultural and intellectual property rights of indigenous peoples" [12], ratified by over 150 indigenous representatives from 60 un member states [13], and the draft declaration on the rights of indigenous peoples. the former document asserts cultural ownership of indigenous knowledge — this includes development, promotion and protection, as well as that content creators must be first beneficiaries. articles 12 and 29 of the 1993 draft declaration on the rights of indigenous peoples [14] (un commission on human rights) also highlights the need for "full ownership, control and protection" of icip. the first international indigenous librarians' forum proceedings [15] endorses both the mataatua declaration, and the un draft declaration. conclusion a cornerstone of an indigenous digital library is that the indigenous communities themselves control the rights management of their cultural intellectual property. local cultural protocols need to be documented and followed prior to the creation of digital content, and communities must be consulted with regard to the digitization of content already gathered by institutions of social memory. as noted in the hilo meeting report, indigenous leaders should gather to plan and confirm the path ahead. notes and references [1] toi iho, a registered trademark of authenticity and quality for maori arts and crafts. < http://www.toiiho.com/> (last accessed may 8, 2002). [2] national library of new zealand, (last accessed may 8, 2002). this site opens with a summary of the story. [3] weka, hana. ruamoko, (last accessed may 8, 20020. [4] orbell, margaret. the illustrated encyclopedia of maori myth and legend. (christchurch: canterbury university press, 1995). 114-117. [5] ibid. 92-94. [6] digital collectives in indigenous cultures and communities meeting, hilo, hawaii, august 10-12, 2001, (last accessed may 8 2002). [7] special issue on digital technology and indigenous communities, d-lib magazine, march 2002, (last accessed may 8, 2002). [8] barambah, maroochy and ade kukoyi, "protocols for the use of indigenous cultural material" in going digital 2000: legal issues for e-commerce, software and the internet, (nsw, australia: prospect media, 2000) anne fitzgerald ed. [9] walker, ranginui. "intellectual property" in nga pepa o ranginui, (auckland: penguin, 1996). [10] department for courts, information management team. "maori land court information management team report: access to and archiving of maori land court records after imaging". 28 july 1999. for a guide to the maori land court, see and (last accessed may 8 2002). [11] holland, maurita and digital collectives in indigenous cultures and communities. we come from around the world and share similar visions! (ann arbor: school of information, university of michigan, 2002). [12] "appendix e," the mataatua declaration on cultural and intellectual property rights of indigenous peoples (last accessed may 8, 2002). [13] biodiversity and maori (wellington: te puni kokiri, 1994), 17. [14] draft declaration on the rights of indigenous peoples, (last accessed may 8, 2002). [15] sullivan, robert. ed. proceedings (auckland: international indigenous librarians forum, 2001). further reading cultural and intellectual property rights: economics, politics and colonization. vol. 2. (auckland: moko productions/iri, 1997).   copyright © robert sullivan top | contents search | author index | title index | back issues previous article | in brief home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/may2002-sullivan   participatory cultural heritage: a tale of two institutions' use of social media search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine march/april 2014 volume 20, number 3/4 table of contents   participatory cultural heritage: a tale of two institutions' use of social media chern li liew victoria university of wellington, new zealand chernli.liew@vuw.ac.nz doi:10.1045/march2014-liew   printer-friendly version   abstract the purpose of this study is to examine how and to what extent cultural heritage institutions (chis) are currently using social media to create a culture of participation around their digital collections and services. an environmental scan of new zealand chis with a social media initiative was conducted and four cases with considerable activities, participatory communication and user-generated contents were investigated. two of these case studies are reported in this paper. the two sites were chosen, firstly, on the basis of their having levels of participatory activity significant enough to merit in-depth analyses; and, second, on their ability to provide contrasting examples of different approaches and practices. the purpose of the comparison is to highlight the different nature and extent of participatory culture and user generated/contributed contents. while one of the sites belongs to a major national institution, the other represents a regional, community-level initiative. further, while one site employs a self-hosted web 2.0 platform, the other utilises a third-party platform. finally, while one is aimed primarily at displaying and promoting images from an archival collection while enabling user commenting, the other actively seeks contributions to share and co-construct local history stories.   1.0 introduction the ability of an individual or a community to comment on, create, upload and share digital cultural content demonstrates a growing demand for creative expression, the exploration of identity and for cultural participation. a growing number of chis including archives, libraries and museums have responded to the challenge of providing authentic and authoritative information within an increasingly participatory online environment. many chis now manage sites where cultural heritage contents are examined and where users and communities of interests collaborate in the 'making of meaning' and co-construction of memories (terras, 2011; freeman, 2010; falk, 2006; gurian, 2006), rather than places where cultural authority is asserted. these projects represent a shift in how chis act as trusted cultural heritage guardian and facilitating access to cultural contents. according to russo, et al. (2008), "the challenges that social media bring demonstrate an enhancement of the traditional one-to-many information transfer model with a more genuinely interactive many-to-many communication model, in which institutions use their own voice and authority to encourage participatory communication with individuals and communities of interest or practice." (p.28). by using social media as part of their curatorial practice or communication with users and stakeholders, chis can open up their previously 'guarded' collections and communications, to privilege engagement and participation by users and communities of interest. sharing cultural heritage contents through web 2.0 spaces expands opportunities for institutions and their communities of interest to actively use and reuse these contents. it also provides opportunities for counteracting the silo effect of limiting access to these contents to institutional websites and repositories (zorich, waibel and erway, 2008) and for building cross-institutional collections (palmer, zavalina and fenlon, 2010). within a few years, a number of pioneering chi web 2.0 projects were underway, accompanied by a growing body of professional and academic literature that both documented and advocated on behalf of such initiatives (van den akker, et al., 2011; cocciolo, 2010; nogueira, 2010; daines and nimer, 2009; samouelian, 2009; huvila, 2008; krause and yakel, 2007; chad and miller, 2005). web 2.0 implementation has become sufficiently well-established; enough to require attentive consideration from all who work within the cultural heritage field. however, despite its current high profile, it is important to remember that social media use by chis is still only within its first decade. most effort consequently remains highly exploratory in character. the outlines of the prevailing forms it will acquire and the overall impact it will have on the profession are yet to settle into definitive patterns (theimer, 2011). theimer alludes to the fluid and experimental nature of current activity within this field and to the necessity of chis undergoing a learning process in order to find out how their institution might best adjust to the new opportunities and challenges represented by web 2.0. palmer and stevenson (2011, p.2) describe the present moment as "a period of flux" in which chi professionals are faced with "discovering new ways of engag[ing] with end-users and [working within] new spaces"; of "negotiating issues of control, authority, voice, and trust"; of "defining and targeting an audience, establishing clear objectives, measuring successes, and delineating the personal from the professional".   2.0 social media and participatory culture if there is one word that highlights the particular quality of social media, it would probably be 'participation'. unlike the mass media before it, social media is fundamentally designed as a participative medium. cultural theorist jenkins (2006) observes that with the emergence of web 2.0, a paradigm shift has occurred in the way media content is produced where audiences are empowered to participate in the culture. one of the consequences of the shifts in media paradigms from the 20th century 'packaged' media to the 21st century 'conversational' media is that notions of authorship, creativity and collaboration have become part of everyday culture rather than remaining in the hands of the authoritative institutions. this changes the culture into a participatory one where ordinary citizens express themselves and share their opinion with others. for some critics like jenkins (2006), social media is part of the rise of participatory culture which empowers users to generate and produce content, moving from the mode of action characteristics of 'audiencing' to the mode characteristics of producing. bruns (2009) believe users move between these modes and he calls these users the 'produsers', i.e. users playing an active role in producing contents rather than just interpreting them. the main picture drawn is that web 2.0 has become a medium for sharing and for conversations, rather than for dissemination and control. of all the opportunities made possible by social media, perhaps the most advantageous to chis is indeed, the ability to foster participant engagement between an institution and its users and communities of interests. some have likened the potential transformation to "a transition from acropolis — that inaccessible treasury on the fortified hill — to agora, a marketplace of ideas offering space for conversation, a forum for civic engagement and debate, and opportunity for a variety of encounters" (proctor, 2010). it is noteworthy that for some, social media is part of broader structural affordances of a capitalist economy (andrejevic, 2011) in which users' free labour is exploited for the benefit of the corporations (andrejevic , 2011; lovinck, 2012). however, it is beyond the scope of this paper to engage in a philosophical discussion about social media and the complex social processes it engages with. the concept of participatory culture is being used here as a way of distinguishing the focus of the case studies from the more traditional modes of participation that took place 'behind closed doors' (e.g. prototyping new projects with focus groups and consultation with relevant community groups when designing an exhibit).   2.1 rationale for participatory culture in chis preserving a community's cultural heritage has been among the responsibilities of chis. with the emergence of various web 2.0 applications, many are now facing a demand in the era of participatory culture where members of the community are expected to get involved. they expect to be able to participate through for instance, sharing and contributing documents and contextual information (e.g. photos, personal diary entries) that have the potential to enrich and add value to the histories, or participate in co-curating of memories collection. a number of chis in new zealand have responded to this demand and it is the aim of this paper to present two of the case studies conducted to highlight current practices. enthusiastic digitisation and 'curating' of information by amateurs can be a potential rich source of cultural heritage content that documents areas not covered by traditional chis. these amateur collections might form useful complements to institutional collections. linking a stand-alone chi website into websites such as flickr, which have an in-built audience and a platform that encourages general public to contribute relevant materials to institutional digital collections may provide a way to increase the use of digitised heritage content. terras (2011) discuss examples of how flickr is used as a platform for generating amateur cultural heritage content. by acknowledging and integrating such user-contributed contents to their own collections, chis may be able to invigorate not only their online presence and extend the use of collections, but also to enrich their heritage collections. three non-exclusive categories of motivation have been generally identified. the first relates to the perception that implementing some form of web 2.0 was a practical "business" necessity, given users' and stakeholders' expectations within the present operating environment (daines and nimer, 2009; chad and miller, 2005). the second involves the belief that participatory platforms will help further the pursuit of core cultural heritage goals by creating wider educational opportunities and strengthening the information base through leveraging users' knowledge (oomen and aroyo, 2011; proctor, 2010). the third category of aims has the potential to be transformational and relate to the idea that participatory platforms will enable chis to move beyond the seemingly 'elitist' aspects of their traditional practice (flinn, 2010).   3.0 research design by drawing on analyses of activities that are documented in the chis social media platforms (including the user-generated, contributed contents), this study investigates the extent of participatory culture in chis. the chosen approach takes its cue from an observation by yakel (2011) who argues that in order to gain a better understanding of "the dynamics of peer production" that is actually occurring on these sites, there is a need for "[m]ore research showing the nature of use, the volume of the comments, and the resulting conversational threads" (p.97). this might be contrasted with the prevailing character of current research in this field which tends to consist of either general surveys (samoelien, 2009) or practitioners' reports focused on the process of implementation (cocciolo, 2010). while such studies may provide examples of user-generated contents and some indication of the amount of such contents, they usually do not subject that content to systematic analysis. as yakel (2011) indicates, undertaking an in-depth analysis of activities and contents generated via social media can help provide insights into the specific character and extent of user participation and interaction that is actually occurring on that site. that in turn may help point to the sort of steps that still need to be taken in order to better encourage the emergence of the kind of sustainable online community engaged around heritage content that has often been posited as an objective for initiatives of this kind. one of the main points of focus in the content analyses undertaken in this study therefore, is on the extent to which each chi social media initiative might be seen to be making progress towards this goal of developing a sustainable community of users engaged in active participation and interaction with each other around the cultural content.   3.1 selection of case studies an environmental scan of web 2.0 implementation by new zealand chis was undertaken with the focus on social media applications that are specifically used in conjunction with the presentation and co-creation of heritage content. this is to distinguish from the general, popular use of web 2.0 by chis such as institutional blogs, facebook pages, twitter feeds which are largely about presenting news and events, discussions of professional activity and for promotional purposes. in other words, they are generally not used for the presentation and discussion of heritage content. these kinds of web 2.0 applications have therefore been excluded from the content analysis. the criteria used to select the case studies institutions included a variety of factors, most notably that the institutions have (i) a significant number of heritage items that are web 2.0-enabled and (ii) a significant amount of user activity (i.e. evidence that a significant number of different users had posted comments on a significant number of different items). the appendix lists the chis examined in the preliminary stage of the study. in the second stage, four in-depth case studies were conducted: nlnz on the commons (nlnz commons), the prow, nz history online and kete horowhenua (kh). the cases were selected firstly on the basis that they showed sufficient evidence of user participation via web 2.0 applications to merit further investigation. in this paper, two of these case studies representing two different types of initiatives are discussed. the two sites were chosen, firstly, on the basis of their having levels of participatory activity significant enough to merit investigation; and, second, on their ability to provide contrasting examples of different approaches and practices. the purpose of the comparison is to highlight the different approaches taken and the nature and extent of participatory culture and user generated/contributed contents. while one of the sites (nlnz commons) belongs to a major national institution, the other (kh) represents a regional, community-level initiative. further, while one site employs a self-hosted web 2.0 platform (kh), the other utilises a third party platform (nlnz commons). finally, while one is aimed primarily at displaying and promoting images from an archival collection while enabling user commenting (nlnz commons), the other actively seeks contributions to share and co-construct local history stories (kh).   3.2 content analysis process the following types of user-generated and contributed content were considered: (i) situations where users were able to upload and share their own heritage-related images, stories, etc. to a site and (ii) where users were able to post comments about a heritage-related item on the site. the contents were subject to a detailed analysis which included coding posts in terms of various categories (e.g. those that asked a question, those that offered an appreciation, those that added (contextual) information, those that added a personal reminiscence, those relating to family history enquiries, those related to school projects, etc.). site administrators' comments were also analysed and coded in a similar fashion (e.g. those responding to a question, those providing a response to an appreciation or reminiscence, etc.). patterns of inter-relationships between posts were also analysed in order to identify any signs of on-going interaction or sustained discussion between users, or between users and administrators, that might be interpreted as evidence of participatory communication. where available, supporting documentations such as policy statements and published writings were also examined to gain insights into the chis thinking behind the rationale, the underlying motivations and the practice for their social media practices.   4.0 case studies   4.1 nlnz on the commons background and character of site the national library of new zealand (nlnz), which incorporates the alexander turnbull library heritage collections, is one of new zealand's most pre-eminent chis. while the nation's other two main state heritage institutions, the museum of new zealand te papa tongarewa and archives new zealand have made some use of social media, these have been largely confined to formats like blogs, facebook pages and twitter feeds. the focus is mainly around events promotion and professional activities, rather than providing access to their cultural heritage collections. nlnz is as yet, the only national chi to undertake the display of a significant amount of its heritage material within a web 2.0 format that permits users to directly annotate and comment on collection items. notably, it has chosen to do so via the use of a third-party platform — in this case the photo sharing website flickr. this is a significant point of difference between it and the other case study (kh) examined in this paper. it is also in contrast with its trans-tasman counterpart, the national library of australia's trove which supports a number of web 2.0 features on its own platform. the platform nlnz specifically makes use of is the commons, a subsection of flickr intended to deal with the particular needs of pictures sourced from heritage repositories. launched in 2008 as a pilot project with the library of congress, the commons is based around the idea of using a special "no known copyright restrictions" license to facilitate the release of chi collection material online. the site's own homepage describes the commons project as having two principle objectives. the first is to show users the "hidden treasures in the world's public photography archives"; the second, to demonstrate how their "input and knowledge can help make these collections even richer." (the commons on flickr) while the former goal is primarily about providing better access to these images, the latter situates the commons within an interactive crowdsourcing model. a flickr blog post celebrating the 4th anniversary of the launch of the commons also highlights its participatory culture potential by noting how "[t]here have been instances where flickr members contributed context and story-telling around a photo which was then verified by the institutions and even added to the official records of that photo." (miller, 2012) published reports from chi participants in the commons have been generally enthusiastic, especially with regard to the benefits of being able to access the enormous traffic generated by "the global reach of flickr and its active international user base" (chan, 2008). many individual examples have also been cited of users productively engaging with chi content via the commons (bernstein, 2008). there is as yet however, little independent research on the overall extent, character and quality of the participatory activity on the site. one preliminary study suggests that this may be currently quite small in relation to the site's overall numbers. out of a sample of 106,352 images, henman (2011) found that only 3,218 photos (3%) had comments attached, only 236 (0.22%) had notes, and only 17% of tags had been added by users not affiliated with the contributing chi. a more widely expressed concern is the risks involved in chis entrusting their social media presence to commercial vendors driven by business motives — who, consequently, might not be around long-term (tennant, 2010). flickr's own statistics indicate that (as of december 2013), the commons had over 200,000 images contributed by 56 participating chis, with more than 130,000 comments added by users. figure 1: the national library of new zealand on the commons nlnz became a partner in the commons in november 2008; although it had been trialling the use of flickr photostream for about a year prior to that (johnston, 2008). as of december 2013, nlnz on the commons contained 26 sets of photos (figure 1). a "set" is a themed grouping of images with either the same creator, subject matter, location or medium in common. of the nlnz sets on the commons at the time this research was conducted, three were related to printed ephemera, four to books, with the remaining nineteen photograph collections. most of these sets contained between 20 and 50 pictures, with the smallest set containing 7 photos and the largest, 104 images. no figure was provided for the overall number of images in nlnz's flickr photostream. however, a tally of numbers for individual sets (which does not preclude duplicates) gave an estimated total of around 800. some useful insights into the motivations and expectations behind nlnz's decision to join the commons are provided in a 2008 nlnz blog post written by its then web manager (johnston, 2008). this identifies the two-fold purpose behind its year-long trialling of a flickr photostream as being, first, to "[a]ttract users who do not know about our collections, or haven't thought about visiting our subsites"; and, second, to "[o]bserve the tagging and commenting behaviours" of users. the latter objective was to see what issues or benefits might arise for nlnz from enabling these forms of participation. johnston observed that while visitation numbers had been pleasingly high for this pilot photostream, there had been much less user input than hoped for. she also noted that most of the comments posted had been "of the 'great photo' variety", with a small number providing more information about an image, and an even smaller number asking questions. statistics also showed that in terms of attracting traffic from flickr back to nlnz's own websites, there had been a click-through ratio of one click per collection image. however, as johnston rightly noted, the lack as yet of any established benchmarks for appraising chi social media use makes it difficult to judge whether the above figures represent a reasonable degree of success or not. overall, johnston (2008) considered flickr "a good way of dipping a toe in the social media water", especially in terms of the relatively low amount of "time and energy" nlnz was required to invest in it. recent events appear to confirm the largely exploratory approach nlnz has taken to its participation in flickr so far. sometime around may 2012, nlnz reduced the number of sets on its flickr photostream from 26 to 10, and the corresponding number of images from around 800 to around 200. in response to an enquiry as to why this was done1, nlnz explained that it was as a result of the organization's decision to not, at that moment to renew its flickr pro account. without this, flickr members are restricted to having no more than 200 images on view at a time. nlnz stressed however, that this was only a temporary state of affairs and did not entail the loss of any user-generated content previously posted on the site. the reason behind nlnz's decision to let its flickr pro subscription temporarily lapse was given as the organization's wish for time out in order to develop a better "sense of how we wanted to use the site, and how we could make it an active part of our interaction with the public." it was felt that nlnz on the commons suffered from a "lack of a clear direction, and sense of how [it] would be supported (both with fresh items and dedicated community engagement)". what was required therefore was "time to build a proper social media/web service plan". it was also noted that this situation had been compounded by nlnz having at that time, had its hands full with both a major refurbishment of its premises and the launch of a new website. this severely limited the time and energy it had been able to provide for its flickr sets. it was nlnz's belief that it would be better able to achieve its "overarching goal" of "connect[ing] new zealanders with the collections (and services) that matter to them" if it took time to identify how its presence on the commons could play "a defined role, with fresh and regular uploads, genuine engagement with other users, and dedicated support inside the library."1 it appears (as of december 2013) that nlnz has resumed the 26 sets of photos. provision for user-generated/contributed content adding comments about an image that others can respond to is one of the principle ways users can participate within nlnz on the commons. they can also select an image as a favourite, which bookmarks it on their own photostream enabling any visitors there to also view it. users can 'tag' a picture by adding a keyword of some kind to improve its findability. they can add 'notes' that appear directly on a particular section of an image whenever it is hovered over. while users cannot directly contribute their own images to the nlnz sets, they can add pictures via the comments application. this opens up the possibility of users posting photographs that relate in some way to the image uploaded by nlnz, or that provide examples of a mash-up or re-use of that picture. another form of re-use available to users is to set up a separate themed "group" photostream. this enables users to re-present images found elsewhere on flickr in terms of a particular interest. in order to do this, the administrator of a group needs to post a comment on a picture requesting nlnz's permission to re-post it within their group. overview of user-generated/contributed content an analysis of comments for all 26 sets showed that most were no more than a sentence in length and had been posted within two months of the item being uploaded. examination of the content of these comments themselves revealed that the majority were simply very short appreciations of the image on display. typical examples were: "excellent", "stunning", "awesome!!!", "wow!", "nice!", "perfect shot!" a number of cases were also noted where the same user had posted the same phrase — e.g. "great shot!" on several different photos in several different sets, seemingly deploying it as a kind of standard signature of appreciation. brief comments or notes of a jocular nature were also common (e.g. in relation to a set of antarctic exploration photographs: "brrrrr!", "i love those mittens!"). these types of posts might all be categorised as what chan (2008) refers to as "social commenting". chan describes this as "the 'social glue' that binds the communities that play the 'flickr game' together." according to chan, social commenting is "really about leaving a linkback-ed mark of a visit", a networked way of saying "i was here". by posting these comments, users not only express their personal enthusiasm for an image, they also draw attention and provide other users with a link to their own photostreams. in some cases, the self-promotional aspect of these kinds of posts is quite overt, with users including large (and otherwise irrelevant) images within an item's comments thread that advertises their own photostream. many of the appreciations posted appeared to come from photography enthusiasts and were directed at the technical and aesthetic quality of the image ("wonderful portrait with so great light and sharpness!", "this is stunning. classical, reserved and intense."). a few of these even seemed to address nlnz as if it were itself the photographer ("congratulations! this is a wonderful shot!"). others however, showed an awareness of the special character of this photostream as a means of providing access to items held within a chi collection. several of these users praised nlnz for making its heritage images available in this way: "i really like what the national library has done with flickr keep more pics coming, good work"; "great photos! thank you national library nz for your progressive attitude on sharing photos with the world." also evident was the strongly international character of the comments with many users identifying themselves as not being from new zealand. a number of posts were also in non-english languages (a few were in chinese or arabic). another very prevalent form of posts was formal requests to include one of nlnz's images elsewhere on flickr as part of a themed "group". these requests came in a very standardized form — "hi, i'm an admin for a group called [xxx], and we'd love to have this added to our group!" and appeared to be routinely added to the comments threads for all potential items of interest to that group. while a minority overall, there were also a significant number of comments that provided additional information about an item. in some cases these offered further context for a picture by describing some personal connection to it. a set on the 1915 gallipoli campaign for instance, received comments from users briefly recounting an ancestors' service there, or their own recent visit to that area. a few users also posted photographs in the comments thread for a picture that showed something in that historical image as it appeared today (e.g. a hill in gallipoli, a pond in new zealand, an antarctic expedition ship). one example of creative re-use of an image was found where a user had posted their own colourized version of a black-and-white portrait of an antarctic explorer that nlnz had uploaded. a few contributions drew on a user's own expertise in a particular area to add information not otherwise given with an image. one user for instance, provided valuable background about the german colonial circumstances in which a group of samoan photographs had been taken. without this, much of the significance of these images was lost. he also responded to another user's question posted on this thread several months later — indicating that he had maintained an ongoing interest in comments users made about these pictures. another contributor usefully cut and pasted excerpts from various online sources as a way of adding more information to items. interestingly, this included adding information about images taken from nlnz's own websites which they themselves had not supplied for flickr. a good example of several users contributing information about a topic and interacting with one another around it was found with a set of photos depicting merchant and naval ships. a sign of the potential benefits to be had from tapping into the knowledge of a specialist interest community, these threads showed users adding details absent from nlnz's own metadata about the possible identity, location, or itinerary of the vessels shown. in some cases, this entailed responding or adding to suggestions offered by others (a relatively rare instance of exchanges between users found on this site). these examples offer evidence of the potential for a form of cultural heritage-related participatory culture to develop on nlnz on the commons. such instances however, were comparatively rare and widely dispersed throughout the site. nlnz does maintain an institutional presence on flickr by posting comments in response to those added by users. however, these posts are relatively rare. most were added by nlnz's then web master, who personally signed her posts. written in an informal, chatty style characteristic of social media exchanges, these responses largely consist of conveying thanks and encouragement to users for their comments: "you're welcome [xxx] glad you're enjoying this photo"; "aww, shucks, [xxx] thanks!" in some cases, the opportunity has been taken to re-direct traffic back to nlnz's home sites by letting appreciative users know they can view a larger image and access more information there. particular care appears to have been taken to acknowledge the efforts of users who have added information of some kind — "wow! thanks for bringing these images together ... ". nevertheless, only a minority of contributors have been responded to. similarly, while only a few users asked questions or suggested corrections to an item's metadata, many of these received no reply. even for those that did, there was little evidence of significant curatorial input or of user contributed information being incorporated into nlnz's own records. in one case where a user questioned the date provided for a picture, they were told that this would be referred to curatorial staff. it was only seven months later, after another user repeated the same question, that a reply was finally posted acknowledging that the date had, in fact been incorrect (although no thanks was offered to users for pointing this out). a subsequent post from another user observed that given this date, the identification of the photographer that nlnz was still attributing this picture to must therefore also be wrong. ten months on, that comment had not received a response. analysis in many ways nlnz's participation in the commons can be judged a successful example of chi's use of a web 2.0 application. it has enabled an interesting selection of their heritage images, encompassing a variety of themes and styles to be made accessible to an international audience. it has also enabled this audience to engage with these images through providing feedback. while the number of comments posted is dwarfed by the impressively large number of views the site receives, they nevertheless represent a significant amount of user input. the feedback they provide is also overwhelmingly positive, with a considerable amount of praise being directed at nlnz itself, both for the quality of its collection and its progressive attitude in making these items available on flick in this way. if nlnz's main objectives were to find a relatively low cost avenue of publicizing its pictorial collections, making this content more widely accessible, raising its profile internationally and promoting itself as cutting edge, then these goals can be considered to have been achieved. if, however, nlnz on the commons is to be judged by the degree to which it helps facilitate the emergence of a sustainable heritage-related participatory culture around its collections, then its achievements have to be regarded as more limited. some of the issues relate to nlnz's decision to rely on a third-party social media platform. considered in terms of the aforementioned strategies yakel (2011) identifies as the principle ways in which chis maintain authority control within web 2.0 environments, nlnz on the commons can be seen to be almost entirely dependent on separation techniques. to some extent, this containment takes place within the flickr photostream itself, where user-generated and contributed contents remain confined to the comments thread. far more significantly, it appears to be occurring through nlnz's decision to isolate interaction with its heritage materials within a space that is located somewhere separated from the home sites where its databases reside. this risks relegating any user participation that occurs here to nothing more than an entertaining sideshow: something to be kept at some distance from the main event within the official venue, and where the roles of producer and consumer are kept distinctly separate and their relationship exclusively one-way. nlnz indicates its own awareness of this arrangement in the earlier cited blog post, marking the occasion of its first year on flickr (johnston, 2008). in response to a reader's comment voicing concern about what procedures were in place to ensure "the quality and value of 'crowdsourced' data", the web master replied that this was not an issue because user-generated contents like comments and tags were "not being sucked back into the library's own databases"; instead "all interactions remain on flickr". if any information was provided of the kind nlnz might consider incorporating into its catalogue records, this would be referred to curatorial staff to assess (although it was noted that so far "only a handful of comments" had been received that might qualify in this regard). otherwise, given that nlnz's official records were securely insulated from whatever users might add to its flickr photostream, the web master felt little need to moderate, verify or provide disclaimers for what was posted there, since freedom of discussion was "part of the point (and joy) of the site." as this last remark indicates, flickr does support the existence of a certain free-form mode of participatory culture about which there is much to applaud. it remains another question however, as to whether it also provides a suitable space for fostering the type of sustainable online user communities — adding value through thoughtful engagement with heritage content that serves as the focus of the present study. to some extent a chi's decision to use a third-party platform like flickr requires it to accept the need to work within the kind of web culture that already prevails there. this limiting of its ability to shape the way users interact with its content is part of the trade-off a chi makes for the lower costs and higher levels of traffic a large social media site provides. in the case of flickr, this web culture is dominated by the practice of 'social commenting'; the sheer volume of which partly derives from the vested interest users have in commenting as often and widely as possible if they wish to attract more visitors to their own photostreams. while not without value, comments of this type do tend to be fleeting, superficial and at times delivered in a routine, mechanical way. their presence does not prevent users from making other kinds of contributions such as adding information about an item. it may, however impede comments and discussions of this sort from flourishing by swamping and dispersing them through lengthy threads dominated by "awesome!" and "great pic!" as a populist social media site, flickr also tends to favor a certain 'house style' in which text speak, emoticons, typos, exclamations and a 'jokey' attitude predominate. along with the inclusion of cryptic handles, eye-catching avatars and large promotional images within the comment threads, the prevailing sense is one of busy clutter and dashed-off, stand-alone remarks. this lends itself more to being quickly skimmed through for whatever catches the eye rather than engaged with, in the kind of prolonged, attentive way required to initiate and sustain discussion. the photography-centric character of flickr's web culture might also be seen to create some potential for concern, in the sense that it might divert users' attention away from considering the historical dimensions of an image towards simply appreciating its aesthetic appeal. while not wrong in itself, this nevertheless does little to further the heritage education commitments inherent in a chi's mission. compounding this situation is the general paucity of descriptive information that nlnz has provided for its flickr sets. in part, this would seem to stem from nlnz's evident wish to restrict the main space in which users connect with its collections to its own websites. as a consequence, it appears to have kept both the metadata and the image size on its flickr site deliberately insufficient in the hope that this will serve to entice visitors back to its home sites, where they might then become regular users. the risk, however is that if this does not occur, then nlnz is in the position of disseminating its images in a manner that strips them of adequate historical and archival context. all the above concerns could be ameliorated if nlnz maintained a more regular and interventionist presence on its flickr photostream. site administrators would then be able to use both the item's description space and its comments thread to actively encourage users to think about, ask questions about, add information about and discuss the cultural heritage-related aspects of these photos. however, apart from a few early and commendable efforts at building rapport through responding to users, the administrators of nlnz on the commons come across as only paying brief and sporadic attention to the comments posted here. in particular, they appear to show insufficient interest in using these contributions to crowdsource further context and metadata about their collection items despite this being an important part of the commons' raison d'être. there is for instance, little to assure users that any information they provide here is being assessed for possible inclusion as a permanent part of nlnz's own records. nor are all users' direct questions and suggested corrections responded to. responsibility for engaging with users around content appears to have been left almost entirely with the web master when, in many cases, input from curatorial staff with specialist knowledge in the broad array of subject matter covered seems to have been required. of course, any arrangement of this kind would require a significant commitment of institutional resources and a more formal set of policies and guidelines. it therefore needs to be viewed within the context of the chi's other obligations. at a time when the organization has had its hands full with two major infrastructure projects1, it is easy to understand how maintaining an effective administrative presence on flickr might have become a low priority. as also acknowledged in that email, the degree of planning and resources nlnz has committed so far to managing its site on the commons has not been adequate to the task of establishing and achieving clear objectives for being there. again, this is understandable given how, as elaborated in the introduction, chi use of web 2.0 remains still mostly in an experimentation phase, with many organizations having required time to gain a better understanding of what web 2.0 applications can do for their institutions and communities of interests, and what is needed to make them work effectively. nlnz's partnership in the commons has so far had an exploratory character about it. it seems a positive sign then, that it recognizes the need to dedicate more thought and resources to this exercise in user participation if it is going to take it forward. what seems less positive though, is the fact that it had felt it necessary, as part of this development process, to temporarily remove the majority of user-generated contents already posted on that site. this 'pulling back' could have potentially damaged the image of the institution and its social media practice. if, however the (now) resurrected photostreams are now better resourced and more closely managed, there is potential for nlnz on the commons to build on the signs of incipient user communities and crowdsourcing activity already present, to create an effective participatory culture around its heritage contents.   4.2 kete horowhenua background and character of site kete horowhenua (kh) is an example of a community-level social media initiative. launched in 2007, it is managed by horowhenua district libraries and was developed with funding from the community partnership fund (cpf), part of the new zealand government's national digital strategy. this funding programme was specifically aimed at supporting ict projects that addressed the needs of local communities and incorporated their participation. with a total population of just over 30,000, the district that kh is primarily intended to serve represents a relatively small target audience. consequently, this is also the smallest potential contributor base among the cases looked at (see the appendix). figure 2: kete horowhenua kh is the first and best known of the community digital archives using the kete open source wiki application. describing itself as "a knowledge basket of images, audio, video and documents which are collected and catalogued by the community", kh is essentially a wiki-style digital library of cultural heritage resources. as its project manager explained in a 2008 conference paper (ransom, 2008), the idea for it came about from the horowhenua library trust's wish to find a way of supporting the local heritage sector. the latter was struggling due to a lack of the resources and premises required for preserving large collections of historical records and making these readily accessible to the public. it was feared that as a result, large amounts of history about the district was at risk of being lost. going digital was seen as a solution to the lack of space and resourcing required for preserving a physical collection in a centralized location, while going interactive opened up the possibility of crowdsourcing as an affordable way of creating and processing content. it is worth noting that achieving a high level of local community involvement in the creation and development of this site was always as central a goal and justification for kh as heritage preservation. it was likewise a condition of the cpf funding that the project received. in the words of its project manager: "we wanted [kh] to be self-managing and monitoring as far as possible, with no layer of library expertise needed. 'by the people for the people' was our mantra. our community would decide what content they wanted to include and would be able to upload material in any common file format and describe it with common language. it had to facilitate the building and strengthening of relationships, not just between items in kete, but between people as well." (ransom, 2008) as this suggests, the character of content on this site is deliberately kept flexible and include such things as brief encyclopedia-style stories, small local history publications and magazine articles, personal memoirs, collections of photographs or documents, audio and video clips, genealogical entries and reports on events. these items are organized and arranged under associated clusters known as 'topics'. in terms of subject matter, the need for an online repository of heritage materials was the initial driver of this project and local heritage societies provided the bulk of the original seed collection. however, kh's brief was broadened at an early stage to also encompass arts and cultural activities, having recognized that these sectors were as constrained as the local heritage sector by the lack of suitable display space. as a consequence, the site now also hosts 'virtual exhibitions' by local artists and craftspeople which often include price lists advertising items available for sale. in addition, many other local organizations (e.g. floral societies, music groups, seniors groups, model railway clubs) use kh as a sort of community forum for promoting and reporting activities. while contributions of this kind might be considered as 'cultural heritage' defined in a very broad and inclusive way, they fall outside the emphasis on historical records and accounts of the past that serve as the focus of this study. consequently, they have for the most part been excluded from the following analysis. provision for user-generated/contributed content given kh's strong identification of itself as community-built-and-run, there is a sense that all its content might be considered user-generated. it is true that distinctions can be drawn between for instance, already existing public collections from local heritage institutions which have been put up on the site, and personal memories and items from private collections contributed by families and individuals. the former however, is not in any way presented as if it were the 'official' content and in many cases was actually uploaded to the site, as well as arranged and described through the work of community volunteers. similarly, while many items are credited as being created by someone identifiable as a kh administrator, there is nothing to formally distinguish these entries as constituting an institutional collection or of being more authoritative than others. some of these items involve the horowhenua library trust reporting on its own activities just like any other community organization. in many others, the administrator appears to have created the item on behalf of and in association with groups or individuals within the community. in others still, the administrator plays the role of facilitator, initiating a topic page of some kind (e.g. a list of local street names and their origins) in a way that invites users to help further develop it. overall, the emphasis remains on the idea that this digital repository is "community-built", with all content accorded equal status. anyone is able to upload content to this site by registering as a member; all that is required being a name and email address. aliases are permitted but seem to be rarely used. kete members also have the option of adding a user image to further personalize their posts. as well as contributing their own content, any registered member can also edit material contributed by others. a section on each page identifies both who originally uploaded it and by whom it was most recently edited (one can also view a history log of this process). using a wiki format in this way serves to further democratize the character of the site, blurring the line between userand institutional-content by allowing anyone to have a say in how almost every item is described and arranged. kh does make provision for some topics to be sequestered into "locked baskets" that can only be edited by authorised individuals from the contributing group. however, it is a sign of their commitment to a participatory ethos that they emphasize that permission for such locked baskets is rare and only granted to organizations intending to build large or unique collections (see kh help on locked baskets). as befits the wish to develop a self-managing participatory culture, kh provides a very extensive set of instructions and guidelines, including a series of online tutorials. this is by far the most user education information found on any of the cases looked at. a comprehensive kete handbook of training material is available that can be downloaded as pdf files. this not only provides step-by-step guides and exemplars for using the site features, but also advice on how to be different and innovative, thereby encouraging an open-ended approach to the site's development. as a constraining influence, there is a list of house rules which are designed to "help everyone enjoy participating in the kete website". these include the requirement of being tolerant towards others whose opinions and memories differ from their own; not being offensive or defamatory; and respecting people's privacy as well as copyright laws (see kh house rules). kh also enables users to add comments about an item. the contribution mechanism here appears to be specifically designed to reinforce a sense of a conversation taking place amongst individuals. there is an invitation to "discuss this topic" and where comments have already been added, to "join this discussion." users are also given some say in moderating other users' comments by being able to flag these as inaccurate, inappropriate, duplicates or containing typos — enabling these issues to be drawn to the attention of site administrators (see figure 3). figure 3: soliciting information and commenting in kete horowhenua one very important aspect of kh's user participation not necessarily apparent from the site itself but often referenced in papers and publicity materials is the degree to which this involvement has often taken place in a physical, face-to-face environment rather than a virtual one (ransom, 2008). for instance, the initial digitization and uploading of material to the site was achieved by advertising for volunteers to come to the main library one night a week to help with this work. the response is reported to have been very enthusiastic, with large numbers of community members from a range of ages and experiences participating in 'weekly working bees' that eventually ran for nine months. as the project manager at the time noted (ransom, 2008), because kh is web-based, volunteers could have performed many of these tasks from home. however, the "social element" of working together with others within shared physical space appears to have proven attractive to many. the experience of working side-by-side in a collaborative manner was also found to be productive in terms of strengthening connections between library staff and the community. overview of user-generated/contributed content there are a few 'virtual exhibitions' by artists to which people have added comments and it is noticeable that in these cases, they are all of the 'fan mail' type (e.g. "great work!"). by contrast, the cultural heritage topics with comments attached are overwhelmingly of the family history/ genealogy kind, where users ask for and share information about someone; often a relative. a sizeable number of these are substantial discussion threads (i.e. three or more inter-related posts). sometimes commenters simply note a personal connection; usually though, there is some sharing of or request for information with several of the threads having the character of a collective investigation. such cases serve as good examples of the way collaborative web 2.0 activity can be used to create or enhance a resource by uncovering and interlinking information that would otherwise be lost or kept dispersed. while relatively few in number on the site, these 'productive' discussion threads can nevertheless be regarded as significant additions for a small community heritage resource of this kind. many of these same threads also give evidence of a sense of a networked community forming around an online collection. there are several examples of friendly personal exchanges between people, with the identification of family connections and shared discussion concerning a common ancestor. email addresses or other contact information is likewise often provided (with several of the commenters identifying themselves as living outside new zealand). however, this kind of participatory culture activity does seem to be largely limited to people's interest in their own family histories. it does not appear to extend all that much to interactivity around the kete resource itself, or horowhenua heritage as a whole (i.e. those commenting largely restrict their contributions to the one particular topic). ransom (2008) though, provides anecdotal evidence attesting to a significant offline participatory culture forming around kh through the activities of numerous local volunteers meeting up at venues like the library to source and upload content. running tallies of statistics available on the site (on 3 december 2013) indicate that there are currently 2183 topics on kete; with 27714 images, 106 audio files, 92 video files, 244 weblinks, 2676 documents, and 368 contributions to the discussion thread. these figures appear to indicate a very healthy amount of content being added for a web 2.0 repository serving a local community of this size. however, as already noted, it is difficult to differentiate what counts as user-generated from institutional content. for instance, many of the items credited as being created by "kete administration" appear to be in fact, community contributions that have either been merely uploaded by staff or else by community volunteers using that logon while working at the library. of the material most clearly identifiable as user-generated, a large proportion is not directly heritage-related but rather exhibitions by craftspeople or activity reports from local cultural groups. there are some indications that this type of material is beginning to dominate the site. historic photographs comprise much of the heritage content on kh. most have only a brief line or two of description which is usually restricted to identifying individuals or locations depicted, without any further explanation of historical context. quite a few of these photographs are without dates. they also generally lack any metadata indicating provenance, other than the historical society they have been collected by. many of the private collections of clippings and photographs found on this site are similarly under-described. for instance, one family album uploaded in 2007 has blank space left around its images for the subsequent adding of explanatory text; six years on, this has yet to be completed. in some cases, these photographs are indicated to be part of a "topic" that links them to other material in kh. in other cases though, a system of informal and not always consistent tags provides the only principle of arrangement and connection with other content. one example of a more systematically organized heritage topic is a dossier of individual items recounting the achievement of a local 15-year old who, in 1989, became the youngest person to swim across cook strait. this consists of 32 images and 26 documents (e.g. press releases and official letters of congratulations) that can be scrolled through as primary material relating to this story. again however, apart from a single line of introduction and some brief captions, there is no accompanying text. another example containing more contextual information is a set of photographs and diagrams depicting the early 20th century construction of a local hydro-electricity scheme. this appears to have been put together by a local historian and may possibly be part of a book, although this is not made clear. a number of items are digitized versions of printed publications (e.g. a 1988 booklet produced for a local school centenary). these publications have generally not been reproduced as pdfs. hence the text is searchable but the original look is not maintained. in some cases, such as with the school centenary booklet, there has been an effort to indicate the way content was arranged on individual pages and the order of these in relation to one another. however, the integrity of the original publication is somewhat compromised by the fact that additional images and information have also been included and packaged together as a single item. in other cases, such as in newspaper supplements, different pieces of text and images have been reproduced as independent snippets, in ways that makes the relationship and continuities between items found in the originals difficult to trace. more straightforward is the presentation of shorter published material, examples of which include a story about a local pioneer aviator originally printed in an aviation magazine and what appears to be a newspaper article about an early 20th century maori politician. in the latter case however, neither the date nor original source of this article has been provided. other examples of the wide variety of heritage content found on this site include a brief anecdote from someone who worked in the area as a taxi driver in the 1960s; an illustrated book-length memoir by a local farmer and public servant; and two well-written, well-referenced encyclopedia-style pieces about early 19th century maori conflict in the region contributed by a local historian. one potentially very useful set of primary resources digitized on this site is a set of native land court minutes between 1866-1868 which is also viewable as a pdf. there is a very noticeable presence by the kete administrators on the comments threads. some of these posts are credited to "kete administration"; others to individuals who identify themselves as representatives of kh. of all the case study sites, kh definitely counts as the one that is most actively facilitated by administrators. this administrative input involves not only answering direct questions but also regularly thanking users for their contributions, actively inciting people to add more comments, suggesting avenues of further discussion and generally participating in the conversations. there is likewise evidence of administrators putting people in contact with one another 'behind the scenes' (e.g. through facilitating the exchange of contact details). given kh's very strong emphasis on encouraging community involvement in its site, it does not seem all that surprising to find its administrators playing a much larger facilitative role than was the situation with the nlnz case. analysis of all the case studies examined, kh is the one that is most strongly centered around participatory culture ideals. this is in terms of the extent to which it not only seeks to facilitate user-generated contents, but also user involvement in the development and display of its collections. kh employs a wiki format that invites anyone to have a say in how an item is edited or formatted. in terms of managing authority control issues, kh and its associated websites are the most institutionally self-effacing; whilst significantly, also the most administratively involved of the chi sites examined. to some degree kh can be considered the poster child of user-centric, community-based chi websites in new zealand. it has won awards and received much attention both in new zealand and overseas, including the 2007 3m award for innovation in libraries and the 2007 world summit awards: special mention (north america and oceania) in the e-inclusion section. there are two obvious aspects where kh does not fare well in comparison with the other cases looked at. one is the overall quality of the writing and presentation of its content. one could argue that the lack of 'polish' or 'professionalism' in the writing might make the site unappealing as a resource for serious historical researchers. also, the very small size of the image files used (often, of relatively poor quality) might also render them generally unsuitable for downloading and re-use. such views are of course subjective, but those involved with kh would not necessarily dispute. ransom's (2008) for instance, puts a lot of emphasis on "informality" and the need for the horowhenua community to be able to express themselves on this site in their own way and in their own words, and indicating that the site is primarily intended to serve the needs of locals and not outsiders.   5.0 summary despite the high level of interest in the potential usefulness of web 2.0 for chis evident from the professional and academic literature, a scan of current new zealand chi websites indicates relatively low levels of web 2.0 implementations of the kind that facilitates participatory culture. instead, the social media applications most used by chis take the form of blogs, twitter feeds and facebook pages that are primarily aimed at promotional activities rather than fostering user-contributed contents and a sense of online community. in short, most chis are still using social media for less sophisticated forms of participation. there is evidence nevertheless that a considerable number of chis in new zealand are using social media and in a handful of cases, they appear to be exploring the use of this media to facilitate participatory communication amongst their communities of interests. in this paper, two case studies were presented to highlight the differences between the less and more sophisticated forms of participation. the content analyses reveal notable differences between the cases as to the nature and extent of activities and user comments, especially in terms of whether these could be considered to add value through providing information or context missing from the original cultural contents. it needs to be noted that this study is primarily descriptive, not prescriptive and any initial recommendations offered are prefaced with a call for further empirical research and theoretical insights on this matter. overall, within the cases examined, only a very small proportion of items that are comments-enabled have comments attached; most of these have only one or two posts; and there are few signs of interactions between or amongst users. this does not necessarily mean there is a lack of public interest in using such features but it does indicate the chis are yet to fulfil their participatory culture potential. the largest of the case studies, nlnz on the commons, is overwhelmingly comprised of brief 'fan mail' style appreciations (of a kind common on flickr) that appear to offer little from a participatory culture or heritage engagement perspective. it is notable that the nlnz on the commons is hosted on a third-party social media platform that allow users to set up free accounts outside the regular constraints of approval processes through its own it department. social media use in this case is seemingly very much still in an early phase of informal experimentation. the experimentation occurs outside existing norms and standards of the institution's ict use policies. this effort seems to fall within what mergel and bretschneider (2013) call 'decentralised, informal early experimentation' characterised by use of social media as a means of representing their agency as part of the social media channels their users and stakeholders are in but still testing out different approaches and operating in grey area and there is little coordination of the social media activities and user-generated contents in line with the institution's mission. overall, the direct ongoing involvement of site administrators appears comparatively low in terms of aspects such as providing instructional scaffolding, actively facilitating discussion, answering users' questions, responding to contributed information, etc. this is especially so with the national chi websites (nlnz on the commons and other sites examined in the preliminary phase, e.g. nzhistory online), where the use of web 2.0 comes across as a minor, sideline initiative that is under-resourced and uncoordinated to the official institutional mission. while flickr the commons promotes itself as a crowdsourcing resource for chis to enhance their collections, nlnz's use of it appears to be more about widening exposure of items and raising its institutional profile. these are in themselves important goals but this nevertheless indicates a gap between participatory culture rhetoric and reality. this highlights that a 'build it and they will come' approach, i.e. simply putting in place a web 2.0 platform does not guarantee activities; let alone participation. so, the questions, 'if they come, will they participate?' and 'if they participate, will we (the institution concerned) do the same?' will also need to be considered. by contrast the regional, community-initiated chis appear more committed to fostering community involvement in their collections. kh, for instance shows promising signs of a more coordinated effort in leveraging community knowledge and fostering participatory culture around their heritage contents. it is an example demonstrating how social media can be used to shift curatorial communication from one-to-many to a many-to-many communication whereby curatorial knowledge acts as a hub around which an online community of interest can build. by promoting user-contributed/generated content around its collection, kh enables cultural participants to be not just consumers, but also potential critics and creators of digital culture. however, for small organisations, under-resourcing appears to be a major problem to sustain social media efforts. on 27 november 2013, an address was made to the new zealand national digital forum by ransom, to draw attention to a 'crisis', i.e. that there was a need for an upgrade to kete to avoid security risks to the digital assets and community contributed contents (including those generated on kh), but there is no sustainable financial model to support this. this was unfortunate but not surprising news. indeed, as rewarding as engagement with web 2.0 communities can be, many chis (especially those at the local-, community-level) lack dedicated funding, human resources and an infrastructure needed to not only get them off the ground but to sustain them once developed (jett, et al., 2010).   6.0 conclusion and indication of further research in the conclusion to her book, theimer (2011) re-emphasised that archive 2.0 should not be thought of simply as the sum of 'archive + web 2.0'. she proposes that the transformation of archival practice in the age of social media should be about facilitating a mind-set amongst archivists that is focused around openness, user-centeredness and flexibility rather than simply about deploying a technology. this would apply to other chis. most chis are still a long way from reinventing themselves through new technologies and media, in making their collections open to the public to use, repurpose and re-contextualise in a way that promotes interactivity, engagement and knowledge co-construction and exchange. nevertheless, an increasing number of these memory institutions are moving into that direction even though at the moment, the overall use of web 2.0 applications by chis to foster participatory culture, if any, appears to be something that is still very much in an informal, experimental phase. on-going changes in technology, as well as a tentative, exploratory approach by chis unsure of the degree to which they wish to commit to participatory culture, and how this might impact on their practice, means that there are at present an ever-evolving variety of ways and purposes of engaging in this area. similarly, while there is a rapidly growing body of literature around the topic of participatory culture, this is spread relatively thinly across a large number of disciplines with different interests and orientations. consequently, there seems at the moment, insufficient coherence in terms of focused definitions and discussion of best practice. the case studies conducted in this study have proven useful in terms of providing a detailed, informative snapshot of what is presently happening in new zealand chis social media initiatives, with a particular focus on the extent of participatory culture established. further, by taking a methodological approach that provides a close analysis of the actual character of user-generated and contributed contents, as well as the patterns of links between comments posted by users and between site administrators and users, these case studies address what has been identified as a significant research need by other researchers (yakel, 2011) . the study also highlights the need to address many remaining questions. there is no doubt that social media present its own set of issues and challenges as well. this paper concludes by drawing attention to a number of issues needing investigation. the fact that more people can now produce and disseminate contents (including cultural objects) rather than just interpreting them means that institutions and individuals who have traditionally been a part of this production processes are having to adapt to this new mode of production. we are now seeing 'ordinary citizens and users' being included in cultural processes that were once solely the domain of collection managers, curators and other experts, often attached to institutions. this is not a challenge unique to the heritage discipline. other examples that have illustrated this shift and need to adapt include those of citizen journalism, online activism and in art production — there is the deviantart example which claims to be "the world's largest online art community". in addition to operational barriers such as limited funding and resources, introducing participatory processes into chi practices has also raised some concerns around authority, liability and credibility (yakel, 2011). the relationship between experts and non-experts and the processes though which chis use and make sense of user-generated and contributed content have yet to be resolved. separation of folksonomy from the 'official finding aid' is ubiquitous in archives 2.0 for instance and this separation appears to be a way institutions use to 'protect' the authority of the finding aids (and perhaps, of the institutions). this appears to be the case with the nlnz on the commons examined in this study. for munster and murphie (2009), while the '2.0' characteristics of 'participatory', 'dynamic' and 'user-centred' are presented as positives that sell web 2.0 to general users and to excite us with enticements of the possible features they promise, these features are often offered within the framework of business and marketing. they argue therefore that on one hand, web 2.0 promises user empowerment, on the other hand, it threatens control and exploitations. it may also be possible that an institution's willingness to engage in participatory communication via social media, or lack thereof is already established and embedded within its organisational culture or the associated occupational culture. along with 'radical trust' (fichter, 2006), a concept that is gaining currency within the social media debate, it would be useful to study the influence of organisational and occupational cultures on institutions' social media practices and an institution's inclination to 'control and exploit' or to 'trust and include' its community.   acknowledgements i would like to acknowledge the victoria university of wellington — university research fund 107851 for the undertaking of this research and the research assistance provided by reid perkins.   notes 1 email correspondence from reuben schrader, web editor of the national library of new zealand on 5 december 2012.   references [1] andrejevic, m. (2011). social network exploitation. in papacharissi, z. (ed.) a networked self: identity, community, and culture on social network sites. new york: routledge, (pp.82—102). [2] bernstein, s. (2008). top 10 reasons the commons on flickr is awesome. [3] bruns, a. (2009). from prosumer to produser: understanding user-led content creation. paper presented at transforming audiences, london, 3-4 september, 2009. [4] chad, k. and miller, p. (2005). do libraries matter? the rise of library 2.0. [5] chan, s. (2008). commons on flickr — a report, some concepts and a faq — the first 3 months from the powerhouse museum, fresh & new(er). [6] cocciolo, a. (2010). can web 2.0 enhance community participation in an institutional repository? the case of pocketknowledge at teachers college, columbia university. the journal of academic librarianship, 36(4), 304-312. [7] daines, j. g. and nimer, c. l. (2009). web 2.0 and archives. the interactive archivist: case studies in utilizing web 2.0 to improve the archival experience. [8] falk, j. h. (2006). an identity-centered approach to understanding museum learning. curator: the museum journal, 49, 151—166. http://doi.org/10.1111/j.2151-6952.2006.tb00209.x [9] fichter, d. (2006). web 2.0, library 2.0 and radical trust: a first take. [10] flinn, a. (2010). an attack on professionalism and scholarship?: democratising archives and the production of knowledge. ariadne, 62. [11] freeman, c. g. (2010). photosharing on flickr: intangible heritage and emergent publics. international journal of heritage studies, 16 (4), 352—368. http://doi.org/10.1080/13527251003775695 [12] gurian, e. h. (2006). civilizing the museum: the collected writings of elaine heumann gurian. london: routledge. [13] henman, e. (2011). analyzing flickr the commons (part 2). [14] huvila, i. (2008). participatory archive: towards decentralized curation, radical user orientation, and broader contextualization of records management. archival science, 8(1), 15-36. http://doi.org/10.1007/s10502-008-9071-0 [15] jenkins, h. (2006). fans, bloggers, and gamers: exploring participatory culture. new york: new york university press. [16] jett, j., palmer, c.l., fenlon, k., & chao, z. (2010). extending the reach of our collective cultural heritage: the imls dcc flickr feasibility study. proceedings of the annual meeting of the american society for information science & technology, 47. [17] johnston, c. (2008). awesome photo — thanks!! or, what i've learnt from our flickr pilot in librarytechnz. (blog). [18] krause, m. and yakel, e. (2007). interaction in virtual archives: the polar bear expedition digital collections next generation finding aid. the american archivist, 70(2), 282-314. [19] lovinck, g. (2012). networks without a cause: a critique of social media. cambridge, uk: polity press. [20] mergel, i. and bretschneider, s. i. (2013). a three-stage adoption process for social media use in government. public administration review, 73, 390—400. http://doi.org/10.1111/puar.12021 [21] miller, f. (2012). time flies! celebrating 4 years of the commons on flickr. (blog). [22] munster, a. and murphie, a. (2009). editorial—web 2.0: before, during and after the event, fibreculture journal, 14. [23] nogueira, m. (2010). archives in web 2.0: new opportunities. ariadne, 63. [24] oomen, j. and aroyo, l. (2011). crowdsourcing in the cultural heritage domain: opportunities and challenges. http://doi.org/10.1145/2103354.2103373 [25] palmer, j. and stevenson, j. (2011). something worth sitting still for? some implication of web 2.0 for outreach. in theimer, k. (ed.) (2011). a different kind of web: new connections between archives and our users. chicago: society of american archivists, (p.2). [26] palmer, c.l., zavalina, o., &anp; fenlon, k. (2010). beyond size and search: building contextual mass in digital aggregations for scholarly use. proceedings of the annual meeting of the american society for information science & technology, 47. [27] proctor, n. (2010). digital: museum as platform, curator as champion, in the age of social media. curator: the museum journal, 53(1), 35-43. http://doi.org/10.1111/j.2151-6952.2009.00006.x [28] ransom, j (2008). kete horowhenua : the story of the district as told by its people. presented at the 2008 vala conference, mebourne, 5-7 february 2008. [29] russo, a., watkins, j., kelly, l. and chan, s. (2008), participatory communication with social media. curator: the museum journal, 51, 21—31. http://doi.org/10.1111/j.2151-6952.2008.tb00292.x [30] samouelian, m. (2009). embracing web 2.0: archives and the newest generation of web applications. the american archivist, 72, 42-71. [31] shilton, k. and srinivasan, r. (2007). participatory appraisal and arrangement for multicultural archival collections. archivaria, 63, 87-101. [32] tennant, r. (2010). tragedy of the (flickr) commons? the digital shift. [33] terras, m. (2011). the digital wunderkammer: flickr as a platform for amateur cultural and heritage content. library trends, 59(4), 686—706. http://doi.org/10.1353/lib.2011.0022 [34] theimer, k. (ed.). (2011). a different kind of web: new connections between archives and our users. chicago: society of american archivists. [35] van den akker, c., et al. (2011). digital hermeneutics: agora and the online understanding of cultural heritage. [36] yakel, e. (2011). balancing archival authority with encouraging authentic voices to engage with records. in theimer, k. (ed.) (2011). a different kind of web: new connections between archives and our users. chicago: society of american archivists, (pp. 75—101). [37] zorich, d., waibel, g. & erway, r. (2008). beyond the silos of the lams: collaboration among libraries, archives and museums. report produced by oclc programs and research. http://www.oclc.org/programs/reports/2008-05.pdf   appendix chis examined in the preliminary stage of this study national-level archives new zealand digital new zealand the national library of new zealand/alexander turnbull library the new zealand film archive nzmuseums (te papa-initiated) nz on screen te ara — the encyclopedia of new zealand te papa tongarewa (national museum) local-/community-level ceismic: canterbury earthquake digital archive hocken digital collections kete christchurch kete horowhenua he kete kōrero — taranaki maori digital archive the prow tauranga memories kete   about the author chern li liew is a senior lecturer in the school of information management, victoria university of wellington (new zealand). she holds a phd in information studies from nanyang technological university (singapore) and an msc from loughborough university (uk). her research interests center on digital information services as socio-technical environments and she has published in the areas of digital libraries and digital cultural heritage: http://www.victoria.ac.nz/sim/about/staff/chernli-liew. she is on the editorial advisory boards of "international journal of digital library systems" and "online information review". she is also a member of the executive committee of the consortium of ischools asia pacific.   copyright © 2014 chern li liew preservation risk management for web resources: virtual remote control in cornell's project prism search  |    back issues  |    author index  |    title index  |    contents d-lib magazine january 2002 volume 8 number 1 issn 1082-9873 preservation risk management for web resources virtual remote control in cornell's project prism   anne r. kenney ([email protected]) nancy y. mcgovern ([email protected]) peter botticelli ([email protected]) richard entlich ([email protected]) carl lagoze ([email protected]) sandra payette ([email protected]) cornell university main project prism: library project prism:   actuaries spend their careers figuring out what benefits a company should offer, at what price, and for how long. their job is to make sense of all the empirical and statistical evidence of age, gender, health, heredity, life styles, physical habits, and living and working conditions that serve as indicators of longevity, productivity, and obligation. how well they do their job depends on how good their evidence is, how skilled they are at reading it, and how risk tolerant their customers are. archivists and research librarians interested in preserving web resources face a similar challenge. libraries increasingly depend on digital assets they neither own nor manage. this article describes current web preservation efforts by libraries and archives and suggests how a new preservation strategy could use a risk management methodology. cornell's project prism is exploring technologies and tools to assess the lifestyle and habits of the web that research libraries and other entities can monitor and use to develop retention policies for online resources. project prism's approach begins with characterizing the nature of preservation risks in the web environment, develops a risk management methodology for establishing a preservation monitoring and evaluation program, and leads to the creation of management tools and policies for virtual remote control. the approach will demonstrate how web crawlers and other automated tools and utilities can be used to identify and quantify risks; to implement appropriate and effective measures to prevent, mitigate, recover from damage to and loss of web-based assets; and to support post-event remediation. project prism is producing a framework for developing an ongoing comprehensive monitoring program that is scalable, extensible, and cost effective. growing dependence on web-accessible resources academic libraries have dramatically increased their offerings of online resources. a survey of the 21 members of the digital library federation revealed that 40% of their costs for digital libraries in 2000 went for commercial content.1 the big-ticket items were electronic scholarly journals that libraries license rather than own. yet little direct evidence shows that publishers have developed full-scale digital preservation capabilities to protect this material, and research libraries continue to purchase the print versions for preservation purposes. however, none appears ready to forgo access to the licensed content just because its long-term accessibility might be in question. research libraries are also including in their catalogs and gateways more open-access web resources that are not covered by licenses or other formal arrangements. a spring 2001 survey of cornell's and michigan's making of america collections revealed that nearly 250 academic institutions link directly to the moa collections, although neither university has committed to provide other entities with long-term access. similarly, a review of the holdings of several research library gateways over the past few years indicates growth in the number of links to open-access web resources that are managed with varying degrees of control.2 approximately 65% of the electronic resources on cornell's gateway are unrestricted, and additional open resources are included in aggregated sets that are available only to the campus community.3 one of the links is to the university of california, berkeley's cpu info center. this resource is notable because tom burd, the site manager, has done several things to advance its preservation, including establishing a mirror site, documenting changes, and providing a checksum in the source page. a recent note posted on this site, however, demonstrates how fragile such resources can be: "i am no longer affiliated with u.c. berkeley, and it has become very difficult to maintain this site. with the state of the web now, as compared to when i started this site in 1994, i'm not sure if it even warrants continuing on in light of many other online resources. as such, i will probably bring this site to a close in the coming weeks. if someone wanted to take over maintaining this site, i would be happy to tar up all the files and hand them over. please drop me a line if you are interested…" 4 estimates put the average life expectancy of a web page between 44 days and two years and a significant proportion of those that survive undergo some change in content within a year. since 1998, oclc's web characterization project has tracked trends in growth and content of the publicly available web space. one of the more revealing statistics, ip address volatility, identifies the percent of extant ip addresses from one year to the next. in a fairly consistent trend since 1998, slightly over half (55-56%) the ip addresses identified in one year are still available the next. within two years, a little over a third (35-37%) remain. four years later only 25% of the sample 1998 ip addresses could be located.5 oclc's annual review points to the instability of web resources; it doesn't indicate whether those resources still exist elsewhere on the web or whether the content has changed. while some resources disappear, others become unfindable due to the well-known problem that urls change.6 a recent preservation review of the 75 smithsonian institution web sites noted that an exhaustive search could not locate a copy of the first smithsonian web site, created in 1995. a url may persist while content changes wildly: the editors of rlg diginews discovered that links in several past issues pointed to lapsed domain names that had been converted by others into porn sites. much attention has been paid to unstable urls and to creating administrative/preservation metadata, but to date no evidence suggests that research libraries are privileging open access sites that utilize some form of urn or that document content change. even if such precautions were fully implemented, web resources are particularly vulnerable to external attacks. this past year, the internet was hit hard by the nimda worm, which took down 150,000 computers, and the code red virus, which struck more than 12,000 web sites in the u.s. in june 2001, microsoft had to issue a patch to protect its attack-vulnerable internet information server (iis) software, which is used by approximately 16 million web sites.7 current web preservation efforts by libraries and archives with the growing dependence on external digital assets, libraries and archives are undertaking some measures to protect their continued use of these resources. efforts can be grouped into three areas: collaborating with publishers to preserve licensed content, developing policies and guidelines for creating and maintaining web sites, and assuming archival custody for web resources of interest. licensed content publishers and librarians alike are grappling with how best to preserve licensed content. publishers are developing their own preservation strategies as they realize the commercial benefits of creating deep content databases. elsevier science among others has committed to building an electronic back file collection for all its publications and intends to maintain these electronically archived copies "forever." in 1999 the company developed a formal archiving policy (updated in 2001) that has been added to all licenses for sciencedirect.8 a number of publishers are also working with third parties to back up, store, and refresh digital content. oclc recently announced the formation of the digital and preservation resources division to provide integrated solutions for creating, accessing, and preserving digital collections. with planning grants from the andrew w. mellon foundation, seven research libraries and key commercial and scholarly publishers are exploring formal archiving arrangements for e-journals. in 2002, mellon intends to fund up to four continuing projects to gain practical experience with the functions and costs of constructing and operating e-journal archives for several years.9 policies and guidelines for creating and maintaining web sites recommendations for building web sites have addressed digital preservation indirectly. the world wide web consortium's (w3c) web content accessibility guidelines, techniques, and checklist provide some recommendations for good resource management (e.g., use of standard formats and backward compatible software) and have had a major impact on the development of web materials worldwide, especially for institutions affected by legislative requirements to meet the needs of disabled users, as outlined in such legislation as the americans with disabilities act and the rehabilitation act (section 508).10 however, the w3c guidelines do not expressly address content stability, documentation of change, or good database management. in fact preservation and records management issues are noticeably absent. in the united states, web preservation is more directly supported through government policies and guidelines to promote accountability, spurred in part by such legislation as the paperwork reduction act.11 for at least the past five years, charles r. mcclure and j. timothy sprehe have investigated policies and guidelines affecting state and federal agency web sites.12 their 2000 study, performance measures for federal agency websites, evaluated federal policies and defined criteria and performance measures—including those pertaining to record keeping—for assessing agency compliance with those policies.13 governments are also promulgating specific policies and recommendations for preserving government-supported web content. in january 2001, the u.s. national commission on libraries and information science published "a comprehensive assessment of public information dissemination," which recommends legislation that would "formally recognize and affirm the concept that public information is a strategic national resource."14 another recommendation is to "partner broadly, in and outside of government, to ensure permanent public availability of public information resources." the archivist's perspective has been quite influential, as arguments are advanced to treat web sites as important records in their own right.15 national archives in many countries are developing policies and guidelines. the u.s. federal records act, as amended, requires that agencies identify and transfer web site records to agency record keeping systems, including the national archives and records administration (nara), for permanent retention.16 nara has issued several bulletins on the disposition of electronic records that include web sites.17 it has also slowly begun to respond to this new form of record keeping and has appraised at least one federal web site as a permanent record. in late 2000, nara established an initiative to capture a snapshot of all federal web sites at the end of the clinton administration.18 nara has also contracted with the san diego supercomputer center for a project to investigate the preservation of presidential web sites.19 the national library of australia (nla) has been a world leader in promulgating guidelines for preservation. in december 2000 the nla issued safeguarding australia's web resources, which provides advice on creating, describing, naming, and managing web resources.20 the council on library and information resources funded nla's safekeeping project, which targets 170 key items accessible through padi (preserving access to digital information).21 nla staff wrote to the resource managers encouraging them to voluntarily preserve these materials and outlined nine strategies for long-term access.22 responses have been received from 116 resource owners and to date, safekeeping arrangements have been made for 77 items. negotiations are in progress for an additional 33 resources. eight resource owners lacked the appropriate infrastructures to comply with the recommendations. alternative "safekeepers" have been approached for four of these. by the end of 2001, 54 resource owners had not responded.23 assuming archival custody the third major focus of web preservation has been to identify and ingest web content into digital repositories. the best-known example is the internet archive, a not-for-profit organization associated with alexa internet, which has been automatically collecting all open access html pages since 1996.24 also in 1996, the national library of australia's pandora adapted web crawling to archive selected australian online publications.25 that same year, the royal library of sweden launched kulturarw3 to collect, preserve, and make accessible swedish electronic documents published online.26 for pandora, ingest includes manual creation and/or clean up of metadata and the establishment of content boundaries. this approach may be cost effective for a few highly valuable documents, but may be prohibitively expensive for large collections. important web archiving projects continue throughout the world.27 on october 24, 2001, the internet archive released the wayback machine, which lets users view snapshots of web sites as they appeared at various points in the past. with over 10 billion web pages exceeding 100 terabytes of data and growing at a rate of 12 terabytes a month, the internet archive provides the best view of the early web as well as a panoramic record of its rapid evolution over the past five years. it provides an invaluable tool for documenting change and filling some of the void in record keeping in the web's early days. we owe a debt of gratitude to the founders of the internet archive for the foresight and plain boldness of such an imposing task. nevertheless, it would be a mistake to conclude that the challenge has been met and the rest of us can relax. as impressive as the accomplishments of the internet archive are to date, this approach to web preservation is only part of the solution to a much larger problem. the internet archive and similar efforts to preserve the web by copying suffer from common weaknesses that they readily acknowledge:28 snapshots may or may not capture important changes in content and structure.29 technology development, including robot exclusions, password protection, javascript, and server-side image maps, inhibits full capture. a web page may serve as the front end to a database, image repository, or a library management system, and web crawlers capture none of the material contained in these so-called "deep" web resources.30 the sheer volume of material on the web is staggering. the high-speed crawlers used by the internet archive take months to traverse the entire web; even more time would be needed to treat anomalies associated with downloading. not all sites merit the same level of attention, especially given limited resources, and means must be devised for honing selection and treating materials according to their needs. automated approaches to collecting web data tend to stop short of incorporating the means to manage the risks of content loss to valuable web documents. file copying by itself fails to meet the criteria rlg and oclc have identified in attributes of a trusted digital repository.31 for example, the internet archive has not overtly committed to continued access through changing file formats, encoding standards, and software technologies. legal constraints limit the ability of crawlers to copy and preserve the web. project prism: preservation risk management for web assets web preservation efforts to date address major areas of concern, but fail to consider the challenge of preserving content that an institution does not control or for which it cannot negotiate formal archiving arrangements or assume direct custody. over time, preserving web content will require substantial resource commitments, as well as flexible and innovative approaches to changes in technologies, organizational missions, and user expectations. cornell university's project prism is a joint research effort by the computer science department and the university library to support libraries and archives as they extend their role from custodians of physical artifacts to managers of selected digital objects distributed over the network. digital curatorial responsibilities will need to be reconsidered and undertaken in light of cost, level of participation by cooperative or uncooperative partners, and technical feasibility. at the same time, we aim to design archiving tools and services that will enable non-librarians to raise the information integrity of research collections that are now managed haphazardly, if at all. ultimately, we seek an approach to archiving distributed web content that takes custody of digital files as a last resort, though the methodology could also be used for pre-ingest management. we are exploring a noncustodial, distributed model for archiving, in which resources are managed along a spectrum, from, at the highest level, a formal repository to, at the lowest level, the unmanaged web. one of our goals is to show how the integrity of unmanaged resources can be raised at minimal cost, using automated routines for monitoring and validating files according to policies established by organizations that value the longevity of those resources. our overall goal is to create archiving tools that will enable libraries, archives, commercial database providers, scholarly organizations, and individual authors to manage different sets of risks affecting the same resources remotely. a risk-based preservation management program begins with two key questions: what assets may be at risk and should be included in the program, and what constitutes risks to those assets? risk is a relative term—an event or threat may be risky in one environment but not in another.32 therefore, risk management programs should be developed and implemented within an organizational context: each institution will need to define its own "worry radius"—the context that provides definitions of perceived risk and acceptable loss.33 effective risk management also requires determining the scope and value of assets. the cost of implementing the program should be appropriate to the estimated value of the assets and the impact of their loss on operations and services.34 overview of risk management risk management is becoming a business in itself. that was true before september 11, but in its wake demand for risk management policies, organizations, consortia, and consultants has escalated.35 the internet is crammed with tools, guides, and services for every size and type of organization or industry.36 insurance providers, healthcare providers, nonprofit organizations, environmental monitors, and financial investors are all particularly active areas in these developments.37 on the business side, the literature reflects an evolution from disaster planning to business continuity to risk management. the rapidly growing areas of capital assets and digital assets management are closely aligned with these risk management developments. in the academic arena, institutions such as the wharton risk management and decision processes center are developing multidisciplinary approaches that combine methods and techniques from a wide range of domains, including decision analysis, public policy, economics, political science, and psychology.38 the boom in risk management has not applied to digital preservation. the nla's archiving web resources guidelines contains a chapter on assessing risk, but it largely addresses the need to track changes to web sites for record keeping, primarily to reduce liability and accountability risks to the organization, not the risks to its web-based content. risk management of digital information: a file format investigation develops a risk management methodology for migration.39 the report dissects the migration process and identifies risk categories and specific risks. project prism will adapt this kind of risk management to web-based materials. much of the risk management literature presents practical, commonsense approaches to generic problems or domain-specific requirements. even traditional risk management is addressing technology-related issues. regarding web resources, the literature acknowledges that a key challenge is to balance flexible access to web sites against the security needed to protect them.40 risk management models many of the proposed models cited in the literature share a common progression for establishing a risk management program.41 project prism has four main phases that map well to the typical states of risk management programs. table 1: risk management stages risk identification is the process of detecting potential risks or hazards through data collection. a range of data collection and manipulation tools and techniques exists.42 in phase 1 of prism, the team is using both automated and manual techniques to collect data and begin to characterize potential risks to web resources. web crawling is one effective way to collect information about the state of web pages and sites. the prism team employs the mercator web crawler to collect and analyze data to test hypotheses about the relationship between observable characteristics of web resources and threats to longevity.43 the modular and extensible nature of mercator makes it a powerful tool for customized analyses. risk classification44 is the process of developing a structured model to categorize risk and fitting observable risk attributes and events into the model. the oecd's chemical accident risk assessment thesaurus (carat™) is a good example of such a risk classification.45 the prism team combines quantitative and qualitative methods to characterize and classify the risks to web pages, web sites, and the hosting servers.46 risk assessment is the process of defining relevant risk scenarios or sequences of events that could result in damage or loss and the probability of these events.47 many sources focus on risk assessment. rosenthal describes the characteristics of a generic standard for risk assessment as "transparent, coherent, consistent, complete, comprehensive, impartial, uniform, balanced, defensible, sustainable, flexible, accompanied by suitable and sufficient guidance."48 variables to consider in assessing risk include the value of assets, possible threats, known vulnerabilities, likelihood of loss, and potential safeguards. in project prism, we are defining a data model for storing risk-significant information. this model reflects key attributes about web assets, observed events in the life of these resources, and information about the resources' environment. a key aspect of risk assessment in prism is defining and detecting significant patterns that may exist in this data. risk analysis determines the potential impact of risk patterns or scenarios, the possible extent of loss, and the direct and indirect costs of recovery.49 this step identifies vulnerabilities, considers the willingness of the organization to accept risk given potential consequences, and develops mitigation responses.50 artificial intelligence methods, decision support systems, and profiles of organizations all support risk analysis. the resulting knowledge and exposure databases provide evolving sources of information for analyzing potential risks.51 project prism is developing a knowledge base that could be characterized as a risk analysis engine. risk management implementation defines policies, procedures, and mechanisms to manage and respond to identifiable risks. the implemented program should balance the value of assets and the direct and indirect costs of preventing or recovering from damage or loss. the program should be known and understood both within the organization and by relevant stakeholders.52 an effective program includes comprehensive scope, regular audits, tested responses and strategies, built-in redundancies, and openly available, assigned responsibilities.53 bringing all of the pieces together for a fully implemented risk management program involves establishing holistic policies and compliance monitoring, developing ways to measure program effectiveness, managing the development and deployment of countermeasures, identifying incentives, building the risk management team, and developing or adapting supporting tools for the program.54 automated strategies to support preservation risk management project prism is exploring technologies that will form the basis for a suite of tools to support risk-based preservation monitoring and evaluation of web resources. from a technical perspective, our goal is to design feasible and appropriate mechanisms for off-site monitoring. assuming that over time libraries and other information intermediaries will extend their collecting scope over greatly increasing amounts of distributed content and that the longevity of these resources will be a primary concern, automatic methods will be needed to deal with such volume cost effectively and for consistent results that are less prone to human error. the methods will need to accommodate content providers who both cooperate in the effort, for example by contributing metadata, or content providers who, while not hostile to the idea of monitoring, are not collaborating. the methods will also need to be flexible enough to suit the variety of management requirements of diverse institutions. these monitoring mechanisms should be deployable in a range of systems contexts. for a university research library, that context might be a management system used to collect lists of urls that faculty and librarians have deemed important through some rating scale. the library might then employ the monitoring schemes outlined in the rest of this section as it assumes a role of "managing agent" for those external resources. at the other end of the spectrum, a preservation service might be a program that users could install on their own workstations to monitor web resources of their own choice. this tool could be launched like other utility tools such as a disk defragmenter or an anti-virus scanner. the web resources within an organization's worry radius might be a web site, a subset of resources in a web site, or a single web page or document. furthermore, a web resource might live in an individual's informally managed web page or in an organization's highly controlled web site. defining the boundaries of a web resource for preservation monitoring is not easy. mechanisms for preservation risk management must address four levels of context: a web page as a stand-alone object, ignoring its hyperlinks. a web page in local context, considering the links into it and out from it. a web site as a semantically coherent set of linked web pages. a web site as an entity in a broader technical and organizational context. time is part of each of these contexts. for risk analysis, some threats can be detected from the examination of a single static snapshot of a resource, while other threats become visible through analysis of how the resource changes over time. project prism is concerned with both the snapshot view and the time-elapsed view. for each of the four contexts, we hypothesize appropriate technical approaches for risk detection. by testing these hypotheses we can transform our results into the suite of tools we need. monitoring a web page as a standalone object as a stand-alone object, a web page must be considered without regard to its hyperlinked context. what risk attributes are visible by looking at a single web resource minus its link structure? given a one-time snapshot of a single web page, automated tools can observe these significant features: tidiness of html formatting: just as sloppy work habits reflect badly on an employee, untidy html is a reason for some unease about the management of a web resource. while early versions of html had poorly defined structure, the recent redefinition of html in the context of xml (xhtml) has now formally defined html structure.55 the tidy tool makes it possible to determine how well an html document conforms to this structure, revealing the sophistication and care of the page's manager.56 standards conformance: data format standards, such as the popular jpeg image standard, change over time, sometimes making previous versions unreadable.57 a monitoring mechanism could automatically determine whether a web resource conformed to current standards. conformance to open standards could also be considered. arguably, web resources formatted according to a nonpublic standard—for example microsoft word documents—may be a greater longevity risk than those formatted to public standards. on the other hand, industry dominance can privilege some proprietary formats over formats that are standard but not widely adopted, e.g., png.58 document structure: like html formatting, a document that manifests good structure, in the manner of a good research paper, may be more dependable than one that consists of text with no apparent order. automated digital libraries such as researchindex have had success with heuristics for deriving structure from pdf, ps, and html documents.59 these techniques could be used to measure the level of structure in a web resource. metadata: the presence or absence of metadata tags conforming to standards such as dublin core may indicate the level of management.60 automatic mechanisms could track the following characteristics over time: http response code: the http protocol defines response codes that indicate transfer error or success.61 an off-site monitor could record the incidence of http response codes over time and certain patterns of codes, such as a high frequency of 404 (page-not-available) codes, could be used to measure risk. response time: a server with widely fluctuating response times or consistently slow response time indicates a higher level of risk than one that is responsive. page changes: for certain types of pages, no changes at all might indicate complete lack of management or maintenance. on the other hand, unpredictable and large changes might indicate chaotic management. pages that change on some predictable schedule with some predictable delta might indicate high-integrity management. monitoring mechanisms that employ copy detection methods or page-similarity metrics would be useful for developing a measurement for page changes over time.62 page relocation: the lack of persistence of urls is a well-known problem. certainly, the disappearance of a selected resource, evidenced by consistent "page-not-found" errors, should be a cause for alarm. techniques such as "robust hyperlinks" might make it possible to track the movement of a resource across the web and use that movement and/or replication to determine risk.63 monitoring a web page in a hyperlinked context the hyperlinked structure of a web page, its in-links and out-links, has been successfully exploited in the development of better web search engines.64 similarly, such "link context," the links out from a page and the links from other pages to that page, may prove useful in deducing longevity risks. using a page snapshot, risks can be detected by analyzing: out-link structure: consider a page that links to a number of pages on the same server, in contrast to another page that either has no out-links or only links to pages on other servers. intuitively, the "intralinked page" may be more integrated into a site and at lower risk. pages with no links at all might be considered highly suspicious, having the appearance of "one-offs" rather than long-term web resources. in-link structure: an equal if not greater indicator of longevity risk is the number of links from other pages to a page and the nature of those links. isolated pages, ones with no in-links, should be highly suspect. ascertaining the absence of in-links in the web context is hard, since it requires crawling the entire web. two more tractable and meaningful in-link measurements are: intra-site links—as noted, a page that is integrated into a web site structure seems more trustworthy than one not pointed to by any pages on its site. it is possible to crawl that web site—defined by stripping the page url down to its root dns component—to determine if any page on that site links to the page in question. hub links—kleinberg's hits algorithm describes the method for finding authoritative web resources relative to a specific query.65 the presence or absence of links to a page from one or more of these authoritative web resources might be an indicator of risk. in related work, we are developing methods for classifying web pages automatically in collection categories, each of which is characterized by a set of authoritative pages on the web. we could then initiate a web crawl from these authorities and find direct or "transitive" links to a given page. page provenance: the url of a web page can itself provide metadata about the page's provenance and management structure. the host name often provides useful information on the identity (the "address") of the web server hosting a page, and, less reliably, the name of the institution responsible for publishing the page. a top-level domain name can help classify a publishing organization by type (.edu, .gov, .com). project prism will investigate the correlation between top-level domain name and preservation risks.66 also, the path name may provide clues about organizational subunits that may be responsible for managing a web page or site. in illustration 1, "preservation" in the path name may indicate a department or subunit of cornell university library, although it could also refer to the topic of preservation—either way, it may help establish responsibility for the page. link volatility: once the nature of the links to and from a page is determined, it is useful to compare changes in those links over time. if out-links are added or updated, a page is evidently being maintained and is at reduced risk. a decrease in in-links may indicate approaching isolation and should cause concern. illustration 1: parsing the url monitoring a web site there will be many cases where the unit of preservation is a web site—a coherent collection of interlinked pages rather than a single page. the notion of a web site lacks good formal definition, with just a few ideas on how to define a metadata structure for such an entity.67 mcclure and sprehe define a web site as "a set of uniform resource locators (urls) that fall under a single administrative control."68 for prism, a web site is a set of urls that are syntactically appended to a root url. for example, a root url: http://my.org/document/root.html has linked pages with urls like: http://my.org/document/a.html http://my.org/document/aub/b.html but not like: http://your.org/z.html assessing the longevity risk of a web site will require algorithms for aggregating the risk metrics of its individual pages. additionally, the structure of the site might serve as an indicator of risk. to analyze this structure we can exploit the wealth of work and algorithms on graphs and the characterization of the web as a directed graph.69 in this characterization, resources (documents) at urls are nodes and the hyperlinks from documents at urls to documents at other urls are directed edges in the graph. the organization of a site's internal structure might be appropriate for risk analysis, just as for an individual page. using graph analysis methods to derive cliques or strongly connected components from graph representations of site structure may make it possible to develop a set of patterns that reflect good site management. based on the static analysis of a site's structure, it would then be possible to analyze changes to it over time. how the web site evolves should be considered another indicator of risk. a site where links are added or modified regularly and which conforms to a discernable structure exemplifies good management practices, and thus lower risk. site evolution patterns could be measured through one of the graph similarity algorithms such as editing distance or maximal common subgraphs.70 monitoring a web site in a technical and organizational context a web site is a collection of web pages, but it also resides on a server within an administrative context, all of which may be affected by the external technical, economic, legal, organizational, and cultural environment. identifying, monitoring, and managing the ecology of a web site involves the individual and collective analysis of a number of factors at these different levels—more than just checking for http codes that indicate a page is unavailable or has moved. problems can be caused by server software misconfiguration, bad cables and router failure, denial-of-service attacks, and many other factors. it is entirely possible that the biggest threat to the continued health of a web site has nothing to do with how well the site is maintained or even how often it is backed up, but rather the fact that the backup tapes are stored in the same room as the server and a single catastrophic event (fire, flood, earthquake) could destroy them both. illustration 2: ecology of a web site comprehensive care of a web site has to include: hardware and software environment, including any upgrades to the operating system and web server, the installation of security patches, the removal of insecure services, use of firewalls, etc. administrative procedures, such as contracting with reputable service providers, renewing domain name registration, etc. network configuration and maintenance, including load balancing, traffic management, and usage monitoring. backup and archiving policies and procedures, including the choice of backup media, media replacement interval, number of backups made and storage location. physical location of the server and its vulnerability to fire, flood, earthquake, electric power anomalies, power interruption, temperature fluctuations, theft, and vandalism. some of these environmental factors can be monitored remotely, in tandem with direct monitoring of the web site itself. slowness or unresponsiveness could indicate hardware failure or power interruption, excessive load on the server from legitimate use, web crawling, hacker attack, or a network problem. network utilities such as ping and traceroute can help determine whether the problem is confined to web services, the particular machine, or the larger network. just as dataloggers monitor environmental states in physical libraries and send alerts when an undesirable condition arises, and just as more traditional alarm systems can signal breaches in physical security, specialized software for the web can reveal internal security hazards, such as viruses, trojan horses, outdated software, missing patches, and incorrect configurations. adapting these tools and utilities will add to project prism's preservation risk management toolkit. assessing the impact of technological watersheds on web site integrity just as some print publications failed to make a successful transition from cold type to hot type, or hot type to completely electronic production, the continuing success of an internet venture depends in part on its ability to adapt to new technologies. technological change always puts an enterprise under some stress, because it interrupts routines, necessitates staff changes or retraining, incurs expense for equipment and supplies, and in some cases may require a complete reconceptualization of the business plan or method of operation. since growth of the internet really took off in the early 90s, the continued robustness of any net-based enterprise has required a significant level of technological flexibility and adaptability. what kinds of technological change place the continued existence of content at greatest risk? to answer, we must first understand how technological change induces risk. several mechanisms can be postulated; their applicability to any particular site depends on the content and its audience. a few examples: mechanism nature of threat means to detect tools to detect external discernability failure to maintain up-to-date software operating environment vulnerability to malicious code, such as viruses, worms, and other hacks examine current status of operating environment web crawler (partially) and specialized software tools partial or full, depending on consent of site operator failure to upgrade file formats, encoding schemes, etc. incompatibility with modern software; unreadable content due to obsolescence examine current status of mime types and other attributes standard web crawler partial failure to use modern tools competitive disadvantage: less visual appeal, harder to navigate, or less functionality examine current status of mime types and other attributes standard web crawler partial table 2: examples of how technological changes can induce risk predicting what kinds of technological change most seriously threaten content will require retrospective analysis. through the longevity study (http://www.library.cornell.edu/preservation/prism.html) and future crawls of the internet archive, project prism is identifying significant technology watersheds that may put web sites at risk. determining whether a sea change, such as the shift from html to xml, will put much content at risk may be at least partially revealed by examining past shifts of similar magnitude, such as the critical mass shift from the gopher to http protocol. the web crawler and other tools can be used to analyze the use of markup languages, mime types, and other attributes of web pages that reflect evolving standards and practice. certain periods may merit closer scrutiny than others. times of intense and rapid growth generally coincide with greater competition and the need to be more agile and flexible to survive. periods when many new standards and features are introduced would also be expected to involve greater risk to content. the web sites that have been captured in the internet archive provide an ideal set of materials for testing these hypotheses by allowing characterization of the introduction and domination of markup languages and formats, the introduction of various types of dynamic behavior, and changes in the use of header fields and tags. combining the pieces into a program project prism is using the web crawler to study risk factors for web pages and web sites. at the server level, we are reviewing the kinds of tools that can be developed or adapted to analyze and mitigate potential risks. while an organization may take on the preservation management of its own web sites, project prism is interested in scenarios that must consider two kinds of organizational players, the entities that control the web sites and the entities that are interested in the longevity of those web sites. in the first round, significant factors in the administrative context and external environment are being identified, but in-depth work in these areas will be part of follow-on research. while project prism is currently exploring the passive monitoring of web sites that are not within an organization's control, the team expects to develop a methodology that also allows for other mandates an organization might have to: monitor changes to a web site, which may require negotiated access. recommend modifications to web sites and web pages to enhance longevity, in addition to monitoring. actively enforce policy for a web site and ensure compliance to specified standards, which will require cooperation and collaboration. just as an actuarial assessment changes with the times, with better understanding of life styles, and with medical breakthroughs, so too will the ability to detect potential risks to digital and web-based resources. by developing a flexible and adaptive risk management strategy, project prism adds to our knowledge base and offers a methodology for conceptualizing the problem while research moves forward. acknowledgements we thank liz chapman for her keen editorial skills. the work described in this paper is supported by the digital libraries initiative, phase 2 (grant no. iis-9905955, the prism project). notes and references to see the extensive list of notes and references, click here.   copyright 2002 anne r. kenney, nancy y. mcgovern, peter botticelli, richard entlich, carl lagoze, and sandra payette (on january 21, 2002 the article was corrected to add acknowledgements.) top | contents search | author index | title index | back issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/january2002-kenney   social annotations in digital library collections search   |   back issues   |   author index   |   title index   |   contents d-lib magazine november/december 2008 volume 14 number 11/12 issn 1082-9873 social annotations in digital library collections   rich gazan department of information and computer sciences, university of hawaii 1680 east-west road, honolulu, hi 96822 abstract in order to incorporate web 2.0 functionality effectively, digital libraries must fundamentally recast users not just as content consumers, but as content creators. this article analyzes the integration of social annotations – uncontrolled user-generated content – into digital collection items. the literature review briefly summarizes the value of annotations and finds that there is conceptual room to include user-generated content in digital libraries, that they have been imagined as forums for social interaction since their inception, and that encouraging a collaborative approach to knowledge discovery and creation might make digital libraries serve as boundary objects that increase participation and engagement. the results of an ongoing case study of a web 2.0 question and answer site that has made a similar transition from factual to social content are analyzed, and eight decision points for digital libraries to consider when integrating social annotations with digital collection items are proposed. 1. introduction at the beginning of a term, many college students are faced with the choice of whether to buy a new or used textbook. while used textbooks are obviously less costly, they often carry another benefit new textbooks don't: highlights, underscores and other annotations by their previous owners. even though the author of, and rationale for, the annotations may be unknown, the fact that somebody found particular sections of the book important enough to emphasize tends to make the eye linger. ideally, annotations can make learning and knowledge discovery feel less like a solitary pursuit and more like a collaborative effort. from the earliest legal and religious texts, marginal annotations and glosses have informed, challenged and often confused subsequent readers. at first glance, it would seem that the trustworthiness of an unknown individual who has interpreted or appended an author's work would be questionable, but several reasonable assumptions can be made that contribute to the perceived authority of an unknown annotator. at the very least, they read the work and took the time to make the annotations, which may question or clarify certain statements in the text, and create links to other works, authors or ideas. the subsequent reader of an annotated work then has one or more additional perspectives from which to evaluate the usefulness of the text and annotations, and more implied permission to add his or her own interpretations than in an unannotated text. published scholarly works are objects for discussion in an ongoing conversation among a community of knowledge seekers, and whether via formal citation in later publications or annotations in existing ones, all are designed to advance the generation and exchange of ideas. social computing, or web 2.0, operates in much the same way. whether via links, tags, social bookmarks, comments, ratings or other means, providing users the means to create, share and interact around content typifies the web 2.0 approach. most instances of web 2.0 operate from a model of aggregate peer authority. for example, no single expert tags (essentially categorizes) photographs on a site like flickr.com, but tags from an aggregation of non-experts can make a photograph 'findable enough.' similarly, hotel ratings or movie reviews from a large-enough number of non-experts can provide a general sense of quality or trustworthiness. most critically, knowledge discovery and transfer is no longer restricted to a model of one expert creator to many consumers. in web 2.0, consumers are creators, who can add their voices to both expert and non-expert claims. users get the benefit of multiple perspectives and can evaluate claims in the best tradition of participative, critical inquiry. though designed as systems for knowledge discovery, the majority of digital libraries operate from the traditional expert model. subject experts create content, digital library experts provide access to it, and individual users consume it. very few systems have been built with an architecture that encourages users to create content, associate it with collection items, or share their impressions with other users. providing digital library users read-access to collections is the traditional finish line. providing them write-access – the ability to append content to that in a digital collection – is something else entirely. usually, we would no sooner invite user alteration of digital collection items than we would distribute crayons with illustrated children's books, but this is the way of the web. it is rare to find an online article, blog or product review that does not now have a space for user ratings, comments or both. in this way, conversations spring up and ideas are exchanged, resulting in an added dimension of engagement with both the text and fellow readers. i use the term 'social annotation' to refer to uncontrolled user comments and interactions around a digital resource, to distinguish it from more formal senses of content annotation. this article argues that integrating web 2.0-type social annotations into digital libraries can serve the larger goals of supporting users' information seeking needs and practices, and encourage increased exploration and engagement. this article addresses what social annotations could look like in digital libraries, with examples drawn from answerbag, a web 2.0 social question and answer site that has confronted some of the same challenges of both encouraging and harnessing uncontrolled content. eight decision points for digital libraries considering social annotations are proposed. 2. background digital libraries are complex sociotechnical artifacts that are much more than searchable electronic collections. even initial definitions in the literature were fairly broad; borgman (1999) bisects the conceptions of digital libraries into those of researchers (content collected on behalf of user communities) and librarians (digital libraries as institutions or services). at that time, digital library literature was understandably concerned with mapping the boundaries of the field, and lesk (1999) identifies an inadequate focus on user needs in digital library research. in the answer garden project, ackerman (1994) called for elements of social interaction to be included in digital libraries, with user-user communication and exploration of collections identified as important components of digital library architecture. ackerman specifically mentioned leveraging the collected wisdom of others, rather than launching a search cold, but it is difficult to find a modern system that embraces this idea. however, the most recent delos digital reference model (delos 2007) adopts a much broader view of digital libraries, one with room for users as both content creators and interactors: "the delos network of excellence on digital libraries now envisions a digital library as a tool at the centre of intellectual activity having no logical, conceptual, physical, temporal, or personal borders or barriers on information. it has moved from a content-centric system that simply organizes and provides access to particular collections of data and information, to a person-centric system that aims to provide interesting, novel, personalized experiences to users. its main role has moved from static storage and retrieval of information to facilitation of communication, collaboration, and other forms of interaction among scientists, researchers, or the general public on themes that are pertinent to the information stored in the digital library." (delos 2007, p. 14) the idea of allowing users to annotate digital library collection items is not new, but most previous efforts have been concerned primarily with annotations as supplemental avenues for retrieval (see, for example, frommholz 2006; golovchinsky, price and schilit 1999). however, the digital library for earth system education (dlese) collaborators claim that digital library annotations can "engage the community" by allowing users write-access to the collection, and thus "capture diffuse and ephemeral information" (arko et al. 2006). dlese's focus is on the educational uses of digital libraries, and they used annotations of digital library collection items to capture content in the realm of effective pedagogical strategies surrounding collection items, as well as feedback about the system for iterative evaluation. also, in both dlese and the opendlib system (agosti and ferro 2003), annotations adhere to a formal metadata structure. this contrasts with a web 2.0 approach, where the content and process of user content contributions are much less restricted, but the core issues of increasing user engagement and capturing user-generated content for the benefit of future users are the same. previous studies of digital collection item annotations have tended to focus on task-based environments such as academic collaborations, where social convention tends to keep annotations formal and content-focused. however, as with annotations in paper books, sometimes the value of an annotation goes beyond its content. marshall (1998) suggests that the very act of evaluating a handwritten annotation's relevance creates a level of critical engagement that would not happen while reading a clean copy of a book. marshall studied university students' annotations in textbooks, and found that students preferred books that had been marked by previous readers, as long as the marks were intelligible. she also found that annotations serve many functions beyond formal analysis of content and concluded that digital library annotation functions should support: naturally placed annotations, distinguishable from the source item non-interpretative markings fluidity of form informal codings smooth transitions between public and private annotations integration with reading similarly, sherman (2008) studied marginalia in english renaissance texts and found that students of the time were routinely taught that simply reading a book was insufficient. in order to have a "fruitful interaction" (p. 4) with a text, marking it up with one's thoughts and reactions was considered essential. marginalia and other signs of engagement and use – even such apparently content-neutral additions as food stains – sherman sees as valuable evidence of reader reaction, and the place of the physical information object in people's lives. providing users the ability to annotate digital content also creates new streams of naturalistic evaluation data, evidence of engagement stronger than a page view or a link to the collection item from another page. in a study of flickr.com, ames and naaman (2007) created a taxonomy of motivations for annotation along two dimensions: sociality and function. the latter dimension echoes people's motivation to annotate printed textbooks: the function of making important or interesting passages more easily findable for later review. the sociality dimension is a component of the web infrastructure – making photographs findable for others, and creating shared tagsets for people with similar interests, so they might collaborate more easily. in this sense, photographs are boundary objects (star and griesemer 1989), around which diverse individuals can interact and communities can build (gal, yoo and boland 2006). digital collection items can also be boundary objects, even if those conversations take place asynchronously. can social annotations fit into current digital library architecture? two concept maps in the delos 2007 reference model, in the resource (figure 1) and user (figure 2) domains, suggest that they can. giving users write-access to collections essentially means they would be creating a new resource type, one that need not append content directly to the item record, but may populate a separate table with an associative link. figure 1 shows that according to the delos conceptual model, a resource can have a "hasannotation" relationship with an information object that is conceptually equivalent to other metadata. figure 1: delos digital library resource domain concept map (delos 2007) similarly, the delos 2007 user domain concept map (figure 2) shows that an end-user can have roles of librarian, content consumer or content creator. starting from the more general actor node, there is a direct conceptual relationship with the resource node; the actor "isa" resource, in the same sense as a collection item. figure 2: delos digital library user domain concept map (delos 2007) one promising application of web 2.0-type collaborative annotation of digital collection items can be found in the penntags project (day 2006). designed as both a personal resource and as a toolkit for collaboration around digital objects, penntags allows social bookmarking and annotation of opac items as well as web urls and journal article database items (though the latter are available only within the university of pennsylvania community). this brief review suggests that there is conceptual room for users as both interactors and content creators in digital libraries, and that annotations have been a historically valid form of user-generated content. web 2.0 has provided an infrastructure within which users can participate, and when given the chance, they have done so enthusiastically. this leads to the research question driving this article: can social computing functionality in the form of social annotation translate well to a digital library? to address this question, the results of a long-term participant observation of a web 2.0 social question and answer site are analyzed, resulting in eight decision points that should be considered when deciding how or whether to incorporate social annotation in a digital library environment. 3. research setting and method answerbag (http://www.answerbag.com) is a social question and answer site designed around a "one question – multiple answers" architecture. launched in 2003, answerbag became the author's research testbed in 2004 and is now a thriving web site with over 7 million unique visitors per month. administrator-level access to all site data is available, and research is conducted as a participant observation. users submit and rate questions and answers, and the highest-rated answers are listed first, serving a collaborative filtering function that still allows people to view the full range of answers to any question. most relevant to this discussion is the answer comment function, which allows users to append uncontrolled content to any answer, essentially a social annotation. as with most digital libraries, answerbag was not initially designed to be a social site. at first, only factual q&a were permitted, and there was no user comment function, only answer ratings to fuel collaborative filtering. however, as so often happens, people began to use the site in ways beyond those the designers had intended. they used answer fields to conduct discussions about the finer points of questions and answers, many of which took the form of opinion. once they had discovered other users with similar interests (or those with divergent views), they also used answer fields to communicate with one another on a purely social level. social content was thought to dilute the value of the site and was removed by site moderators. but soon, moderators were removing more content than they were allowing. as site traffic grew, human review of every piece of content submitted to the site could no longer be maintained, and social questions and answers were allowed by default. when answer comments were introduced soon after, in mid-2005, page views and site traffic nearly doubled in the following four months. the primary goal of answerbag – allowing users to integrate diverse perspectives through multiple, collaboratively rated answers to a single question – was now taking place at the level of individual answers as well. answer comments allowed users to interact freely, build on one another's ideas and link to other content. more importantly, it created an environment where users could see the engagement of others with the question at hand and were encouraged to join the conversation. figure 3 shows an excerpt from a typical answerbag answer comment page. both the high answer rating and the combined effect of other users' annotations add to the perceived authority of the original answer, creating a collaborative, participative response instead of a single person's opinion. figure 3. excerpt from an answerbag answer comment page while figure 3 is a rather whimsical example of social annotation, it provides clear evidence of engagement with the content beyond mere page views. registered users now submit as many answer comments – social annotations – as they do answers. applying this web 2.0 model to digital libraries generates a number of design questions discussed in the following section. 4. analysis and discussion comparing the results of a long-term participant observation of the development and use of answerbag's answer comment function with some previous attempts by digital libraries to include annotations summarized in the literature review reveals eight major decision points: display ease of annotation anonymity control of content harvesting annotation content ease of retrieval traffic and network effects notification and sharing while this list is neither exhaustive nor applicable to every situation, these decision points reflect some of the tensions that can result from applying an information sharing model from one environment to another. it should also be noted that these decision points are interdependent – choices in one area will expand or constrain options in another. 4.1 display most digital library interfaces have been professionally crafted for optimum usefulness and engagement for their target audience. the display represents and enhances the collection, the institution(s) providing access, and the individuals responsible for developing and maintaining the resource. recalling marshall (1998), how social annotations are placed and navigated in relation to the associated collection items is critical to identifying and balancing controlled and uncontrolled content. at a minimum, three options should be visible: one to "view x previous annotations," where x is the number of annotations all previous users have associated with the item, another to "add annotation," and a third to "view my annotations." in this way, users must opt in to view annotations, providing a filter for users and designers who may feel uncomfortable with uncontrolled content. however, when answerbag introduced answer comments, it was a new feature, unfamiliar to many longtime users. simply providing a "view comments" link did not immediately result in clickthroughs. defaulting to displaying all comments inline beneath the answer (see figure 3) made users more aware of the comment feature and increased participation, but long comment threads made answer pages extremely long and difficult to scroll through. the compromise solution has been to show only the most recent comments inline, with a "view all comments" link available to access the entire thread (in figure 3, the "view all comments" link has already been clicked). 4.2 ease of annotation annotating digital collection items should be as easy as making marginal comments in a physical book. forcing users to populate multiple fields or to use pull-down menus for controlled data entry risks the reaction that annotating is more effort than it's worth. answerbag invites users to "add a comment" beneath any answer or existing comment thread via a simple button. the button opens a text entry box, positioned such that the answer and previous comments are viewable for context. social annotation in a digital library environment should be equally low-effort. 4.3 anonymity how users are given the option to identify themselves directly influences the quality of content they will contribute. if the use of real names is enforced, many people will be more hesitant to post content. for example, an early version of answerbag was imagined for use in corporate intranets, using an organization's policy manual as a question and answer framework, with staff comments around each answer as a way to capture and share tacit knowledge relevant to each procedure. unsurprisingly, few employees were willing to submit critiques of current policy with their names attached. conversely, a purely anonymous environment removes checks on behavior and can result in everything from unprofessional content to virtual vandalism. in a digital library environment, annotations may or may not need to be associated with a real name, or even a pseudonym, to convey authority. whether names or pseudonyms are made viewable to other users or not, some identifying information must be collected if a "view my annotations" function is desired. from the example in figure 3, answerbag users are identified by pseudonyms of their choosing, and some small measure of accountability is enforced by making their user profiles public. the dlese annotation metadata (arko et al. 2005) includes fields for an annotator's name and/or organizational affiliation ("contributor") and whether or not their contact information was to be displayed in the user interface ("share"). 4.4 control of content while allowing users to annotate without restriction can lead to meaningful knowledge discovery, it also invites profanity, chat conversations, spamming, and similar digital graffiti. while a personal or institutional philosophy of radical acceptance may seek to value and preserve all contributions, in most cases some oversight will be preferred. allowing users themselves to perform this function typifies the web 2.0 approach, either by allowing them to rate content on a scale of usefulness (however the rater defines it), or to flag inappropriate content, akin to the "flag this answer" link in figure 3. as arko et al. (2006) report in dlese, any feedback function necessitates regular human review, but the job of overseeing flag submissions allows librarians to see how the collection is being used, which can be a valuable source of naturalistic data. 4.5 harvesting annotation content the study by golovchinsky, price and schilit (1999) is an example of an attempt to use annotations as search queries. annotation content can be appended to, or associated with, a digital collection item record in order to generate additional keyword hits at search time. in some cases, especially where the collection content or audience is highly specialized, annotations may be a source of supplemental content-bearing key terms and thus worthwhile to include. since answerbag has no restrictions on content or audience, and a significant percentage of comments are purely social (e.g., "thanks for your answer"), comments have not been made searchable. 4.6 ease of retrieval people spend time annotating now to save time reviewing later. a digital annotation system should allow users to quickly access and review their previous annotations. the solution used by answerbag is to provide each user a profile page from which their activity can be viewed at any time, by anyone. in a digital library environment this has implications for both privacy and activity tracking that may be at odds with institutional policies, but which can be addressed by adding a permissions layer. 4.7 traffic and network effects a critical mass of traffic and participation is required to make any social computing function useful. for example, in the very early days of the answerbag rating system, most answers on the site were unrated, rendering the collaborative filtering function essentially useless. if a digital library's traffic consists of a few users per year accessing collection items, the social part of the social annotation function will essentially fall away and would resemble annotation for personal use only. interaction will be one-way, from past annotators to future readers, following the used textbook model, with no timely discussion around collection items. this is particularly true in an academic environment, where library usage tends to be linked to coursework and research projects with definite end dates. however, in an academic setting, interactions could be encouraged or orchestrated surrounding an assignment, and possibly integrated with courseware. social annotations could also be a forum for outreach following the book club model; non-traditional audiences could interact around digital collection items in a guided or semi-guided activity. a formal or informal cost-benefit analysis would need to reflect the goals of the institution, and the extent to which traffic reflects success. 4.8 notification and sharing two of the primary engines of web 2.0 are the ability to create update notifications and to share content across different sites. users can set up rss feeds and receive alerts when certain conditions are met, such as when content they have created draws a response. providing tangible evidence that the effort they took to post was not in vain encourages people to return and continue the conversation. similarly, articles on many sites include a link and icon inviting users to, for example, "digg this article," which registers it on digg.com, a social article popularity ranking site. if digital libraries invited users to share collection item pages via email and other social bookmarking sites, usage would predictably increase – as would instances of people harvesting and repurposing content. in a web 2.0 environment, users expect the freedom to share content across sites and to use it for their own purposes. digital library content would very likely find its way onto myspace pages or youtube videos, in original or customized form. the core issue that needs to be confronted is one of prioritization: maximize usage, or maximize control? though institutional policy and intellectual property rights will dictate that decision for some collection items, a user-centered focus requires that truly open access be considered. 5. conclusion imagining students, researchers and the public interacting around a digital library collection item via social annotations is an attractive idea. an unobtrusive list of social annotations associated with digital library collection items would allow alternative views of digital content, and create a sense of collaborative endeavor. taking advantage of the web 2.0 infrastructure and embracing a philosophy of releasing control of collection items would open the conversation to an even wider audience. however, the technological barriers are not as considerable as the institutional barriers. digital libraries, as institutions, services and as a research field, have created professional standards that have resulted in innovative systems, high expectations of service quality, and codified best practices. it is natural for digital library professionals to believe that we already know what users value. these assumptions need to be continually questioned, especially in light of other information models, like web 2.0, whose sheer popularity necessitates consideration in more traditionally formal information environments. in a review and synthesis of digital library evaluation concepts, saracevic (2000) concludes that digital libraries provide for interaction among people, and that the ultimate question for evaluation of digital libraries is "how are digital libraries transforming research, education, learning and living?" (p. 368). encouraging users to engage as freely as possible with both collection items and one another seems a logical strategy to address this question. just like a dog-eared textbook or a toy-strewn living room, social annotations may make a digital library look messy, but there is value and life in a physical or virtual space that has a lived-in, well-used and well-loved appearance. references ackerman, mark s. (1994). providing social interaction in the digital library. proceedings of digital libraries '94: first annual conference on the theory and practice of digital libraries (college station, tx), 198-200. agosti, maristella and nicola ferro (2003). annotations: enriching a digital library. in panos constantopoulos and ingeborg t. sølvberg, editors. research and advanced technology for digital libraries. proceedings of the european conference on digital libraries (ecdl 2003), lecture notes in computer science, heidelberg et al., 2003. springer, 88-100. ames, morgan and mor naaman (2007). why we tag: motivations for annotation in mobile and online media. proceedings of the sigchi conference on human factors in computing systems, san jose, ca, 971-980. arko, robert a., kathryn m. ginger, kim a. kastens and john weatherley (2006). using annotations to add value to a digital library for education. d-lib magazine, 12(5) . borgman, christine l. (1999). what are digital libraries? competing visions. information processing and management, 35(3), 227-243. day, annette (2006). using social bookmarks in an academic setting – penntags. . delos (2007). the delos digital library reference model: foundations for digital libraries, version 0.96, (november 2007). . frommholz, ingo (2006). what did the others say? probabilistic indexing and retrieval models in annotation-based discussions. tcdl bulletin, 2(2). . gal, uri, youngjin yoo and richard j. boland, jr. (2004). the dynamics of boundary objects, social infrastructures and social identities. sprouts: working papers on information environments, systems and organizations. 4(4), 193-206. . golovchinsky, gene, morgan n. price and bill n. schilit (1999). from reading to retrieval: freeform ink annotations as queries. proceedings of the 22nd annual international acm sigir conference on research and development in information retrieval, berkeley, ca, 19-25. lesk, michael (1999). expanding digital library research: media, genre, place and subjects. proceedings of the international symposium on digital libraries 1999: isdl'99, tsukuba, ibaraki, japan, september 1999, 51-57. marshall, catherine c. (1998). the future of annotation in a digital (paper) world. paper presented at the 35th annual gslis clinic: successes and failures of digital libraries, university of illinois at urbana-champaign, march 24, 1998. saracevic, tefko (2000). digital library evaluation: toward an evolution of concepts. library trends, 49(3), 350-369. sherman, william h. (2008). used books: marking readers in renaissance england. philadelphia: university of pennsylvania press. star, susan leigh and james r. griesemer (1989): institutional ecology, 'translations' and boundary objects: amateurs and professionals in berkeley's museum of vertebrate zoology. social studies of science, 19, 387-420. copyright © 2008 rich gazan top | contents search | author index | title index | back issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/november2008-gazan   mapaffil: a bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine november/december 2015 volume 21, number 11/12 table of contents   mapaffil: a bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide vetle i. torvik university of illinois at urbana-champaign vtorvik@illinois.edu doi: 10.1045/november2015-torvik   printer-friendly version   abstract bibliographic records often contain author affiliations as free-form text strings. ideally one would be able to automatically identify all affiliations referring to any particular country or city such as saint petersburg, russia. that introduces several major linguistic challenges. for example, saint petersburg is ambiguous (it refers to multiple cities worldwide and can be part of a street address) and it has spelling variants (e.g., st. petersburg, sankt-peterburg, and leningrad, ussr). we have designed an algorithm that attempts to solve these types of problems. key components of the algorithm include a set of 24,000 extracted city, state, and country names (and their variants plus geocodes) for candidate look-up, and a set of 1.1 million extracted word n-grams, each pointing to a unique country (or a us state) for disambiguation. when applied to a collection of 12.7 million affiliation strings listed in pubmed, ambiguity remained unresolved for only 0.1%. for the 4.2 million mappings to the usa, 97.7% were complete (included a city), 1.8% included a state but not a city, and 0.4% did not include a state. a random sample of 300 manually inspected cases yielded six incompletes, none incorrect, and one unresolved ambiguity. the remaining 293 (97.7%) cases were unambiguously mapped to the correct cities, better than all of the existing tools tested: gopubmed got 279 (93.0%) and geomaker got 274 (91.3%) while mediameter cliff and google maps did worse. in summary, we find that incorrect assignments and unresolved ambiguities are rare (< 1%). the incompleteness rate is about 2%, mostly due to a lack of information, e.g. the affiliation simply says "university of illinois" which can refer to one of five different campuses. a search interface called mapaffil has been developed at the university of illinois in which the longitude and latitude of the geographical city-center is displayed when a city is identified. this not only helps improve geographic information retrieval but also enables global bibliometric studies of proximity, mobility, and other geo-linked data. keywords: pubmed, medline, digital libraries, bibliographic databases, author affiliations,geographic indexing, place name ambiguity, geoparsing, geocoding, toponym extraction, toponym resolution   introduction while information retrieval systems have become increasingly sophisticated in topic-based searching, other aspects of the bibliographic record have received much less attention. the author affiliation is one such aspect. for example, in medline, the us national library of medicine (nlm)'s premier bibliographic database covering biomedical-related papers published since ~1950, every paper is manually indexed with mesh, their controlled vocabulary, and entrez-pubmed maps user queries into this vocabulary. beginning in 1988, the nlm started systematically indexing author affiliations, and only for the first-listed authors. as a result, it is easy to find papers on a topic like cancer with high precision and recall but it is nearly impossible to come up with a query to capture papers from, say, the united kingdom — out of all the affiliations our algorithm mapped to the united kingdom only 14% explicitly mention "united kingdom" (another 10% mention england, northern ireland, scotland, or wales). our motivation for geocoding affiliations in pubmed goes beyond basic information retrieval — it stems from efforts to disambiguate author names (torvik and smalheiser, 2009) and plans to carry out author-centered, bibliometric studies that include dimensions of geographic proximity and movement, and other data that can be linked to geographical locations. the problem addressed in this paper is as follows: given a free-form text string representing an author affiliation, output the name of the corresponding city (or similar locality) and its physical location (the longitude and latitude of its center). if the city cannot be inferred, then output the country, and state (or equivalent subdivisions) when possible. for example, given "mcgill university clinic, royal victoria hospital, montreal", then output "montreal, qc, canada" and its city-center coordinates. it should be noted that affiliation strings have been tagged as such in the xml distribution of medline/pubmed so extracting the affiliation string from a larger body of text is not an issue addressed here. why focus on the city and not on a more precise location such as the street address? our goal is to assign geocodes at a uniform level across a broad spectrum of bibliographic records from across the world, some very old and with limited information. we have estimated that street addresses are present in only ~10% of pubmed records. the city (or a similar locality), we hypothesize, can be inferred from an affiliation string in the great majority of cases. geoparsing refers to the process of extracting toponyms (names of places or geographical entities) from text which are then fed into a geocoder to identify the corresponding physical location on the globe. geoparsing and geocoding are active research areas, and a variety of related tools are available online. gopubmed® (doms & schroeder, 2005) provides faceted searching of pubmed with a focus on topics but also has cities assigned to records, although it is not clear whether their data is made available in bulk. nemo (jonnalagadda et al., 2010) performs clustering in order to disambiguate institution names in pubmed affiliations, an effort that is complementary to ours. geomaker (heilmann, 2009) is open-source and leverages yahoo! placemaker's extensive resources on places, organizations, and zip codes. other tools are open-source but designed for different genres: carmen (dredze et al., 2013) is designed to geocode twitter messages based on content and information about the users, while cliff (bhargava and d'ignazio, 2014) is designed to extract and geocode all mentions of people, places, and organizations from english natural language text. cliff uses a named entity extractor coupled with geonames, a large database of millions of place names but we found that this can introduce unnecessary ambiguities and produce strange results: "abteilung fur allergie und klinische immunologie, kinderklinik, universitat la sapienza, roma" incorrectly mapped to "baden-wurttemberg, germany", while "victoria hospital, london, ont" incorrectly mapped to "london, uk". to be fair, gopubmed got the same result in the latter case, and for the first case, geomaker returned nothing while google maps incorrectly returned a map of "erlangen, germany". these cases suggest that state-of-the-art tools are susceptible to systematic errors, rates of which we will estimate here, and compare to our own approach.   2 data and methods pubmed, which is the subject of this investigation, is a superset of medline — it covers older papers and out-of-scope journals and has records without mesh but otherwise has metadata similar to medline, including affiliations. as mentioned, the nlm started systematically indexing affiliations of the first-listed authors in 1988. however, not all publishers provide affiliations in the records submitted to the nlm, and their indexing policy has changed over time (for a summary see the medline/pubmed data element descriptions page). as examples: starting in 1995, usa was added to the end of affiliations when deemed appropriate; starting in 1996, email addresses were appended, and in 1999, nlm stopped editing affiliations to "delete street information or redundant data" (nlm tech bull, 1999). in 2013, they stopped efforts to edit and quality control affiliations (nlm tech bull, 2013), and in 2014, moved the affiliation xml node from being linked to a paper to being linked to an author on a paper (nlm tech bull, 2014). at the outset, we find that there is no typical affiliation string in pubmed: the majority are semi-structured (76% contain 3 or more commas, often used to separate department, institution, city, and state/country, in that order); many are non-english (~12% of university mentions are non-english like universitat, universite, universidad, uniwersytet); many are very short (4% have 40 or fewer characters, including punctuation); most are recent but some date all the way back to 1867; many common place names are ambiguous (paris, london, washington, new york, la, cambridge, and boston all are), some more than others (e.g., johnson, union, and university are names of places); all affiliation strings are subject to errors due to the authors, copy-editing, character encoding, transliteration, and the indexing practices at the nlm. our approach is to take the affiliation at face value. that is, we do not use any external information attached to (or inferred from) the bibliographic record like the journal's country of publication, or other papers by the same author. however, this information could be used as a further step to help resolve remaining ambiguities, or infer a city when none is found. although the final product is an entirely computational approach to mapping affiliation strings to a city, the design process necessitated significant manual effort. several aspects of the algorithm, including the following two tasks, were refined after processing the entire collection of pubmed affiliations multiple times. task 1. constructing a dictionary of city names, including known variant names, historical names, and misspelled variants, and their geocodes. first a list of country names (and variants) and us states was constructed by studying the ending of all affiliations in the collection. google maps was used as a first pass on chunks extracted from affiliations that followed a certain structure that included the name of a country after the final comma, where the preceding two chunks, separated by commas, were submitted together with the country name as input to the google maps api. the two preceding chunks were used because many countries have a hierarchical structure much like the us: city, state, country. as a result of this process, city names that never appeared in affiliations with this structure were not recorded during the first pass. as the algorithm and dictionary were iteratively refined, n-grams separated by commas in affiliations that were not assigned a city were collected and ranked by frequency, and then manually inspected in order to identify names of the most common cities missing from the dictionary. when google maps was unable to find the city, other resources were used on a case-by-case basis. importing all the records of a large-scale global resource of place names, like geonames, was considered but excluded in order to limit the overall ambiguity. task 2. constructing a dictionary of word n-grams that (almost) uniquely point to a country (or us state). all affiliation strings that were assigned to exactly one country were lowercased and all punctuation except space was removed. all 1-, 2-, 3-, and 4-grams that appeared on at least 3 different records were collected, and further filtered by restricting to n-grams that were 99% correlated with one specific country. for the usa, this process was repeated for its states and territories. this produced a total of 1.1m n-grams that almost exclusively point to a country, and when the country is the usa, can point to a us state or territory. for example, the 2-gram "iii friedrich" points to germany. this list helps not only remove ambiguity in city names but also permits assigning an affiliation to a country when no place name is mentioned. keep in mind that it is possible that a particular affiliation contains n-grams that point to multiple countries, particularly long unusual affiliations, but, as we shall see, it is rather rare that this phenomenon co-occurs with an otherwise unresolved ambiguity. also, shorter affiliations are less likely to contain an n-gram from the dictionary, and as such are harder to disambiguate. it should also be noted that the n-gram dictionary is not the only manner in which the list of candidate places is refined, and ambiguity in place names is not the only phenomenon that creates multiple candidate places. assuming that two preceding dictionaries are in place, we can now describe the mapping algorithm. what follows is a brief outline because of space limitations but further details are available upon request. the first step involves pre-processing, chunking, and filtering the affiliation string, hoping that one or more of the chunks contain exact place names. a few of the highlights include converting all utf-8 and html to ascii, converting affiliations with all capital letters to first cap words, expanding some pairs of parentheses, introducing commas in strategic places into affiliations with no punctuation, collapsing chunks across commas when the resulting chunk leads to a valid place name, removing text that looks like a long narrative, extracting hand-coded patterns of country-specific zip codes, email addresses, urls, phone numbers, and street addresses. once the pre-processing is finished, chunks of words that appear between commas are scanned for exact place names and placed on a high priority candidate list. a separate candidate list of lower priority is made up of place names that are a partial match within the chunks. these two candidate lists are then aligned with the countries and us states inferred from the word n-gram dictionary, zip code pattern, and email address in order to resolve part-of relations and prioritize the candidates. candidates that appear further to the right in the affiliation are given higher weight, unless they are country names, as are the candidates on the exact match list compared to the partial match list. the final component of the overall algorithm is a short list of manually hard-coded rules that override some of the assignments made by this automatic process. these include cases of extreme ambiguity and ambiguities that are hard to resolve otherwise such as "university, ms, usa", and "ibaraki prefecture, japan" vs. "ibaraki, osaka, japan", and avoid mapping "harvard university" or "harvard medical school" to "harvard, ma, usa" unless it explicitly says so. figure 1 provides a short list of non-trivial examples and their final successful assignments. figure 2 shows the web-interface in use. note the information sparsity in earlier records compared to more recent ones. figure 1: a list of non-trivial affiliation strings with mapaffil output shown in red.   figure 2: screenshots of the mapaffil web-interface to pubmed records using publication year as input (top figure shows 1942; bottom figure shows 2010). all fields are searchable — the affiliation field has been text-indexed using sphinx for mysql. records include links to pubmed (via pmid), google maps (via geocodes for cities), and a summary of the 2010 us census data (via fips code of the county that includes the geocode). columns are included for institution type and note whether ambiguity was unresolved or not.   3 results the algorithm was implemented using perl because of extensive use of regular expressions. the implementation has not been optimized for speed but was fast enough to process 12.7 million affiliations in less than a week using a 32-core server. table 1 shows a summary of the countries found in the collection of pubmed papers processed. note that the bulk of the records start in 1988 (when the nlm started indexing affiliations in medline) but go back as far is 1867 partly because pubmedcentral is included in pubmed. the usa is by far the most frequent overall but is not as dominant in recent years. table 1: worldwide distribution of 12.7m pubmed papers. table 2 shows the results of head-to-head comparisons between mapaffil and four other tools: gopubmed, geomaker, google maps, and cliff. these experiments were carried out using the respective web-based interfaces during a period of several days in may, 2015. a link to github suggested that cliff version 2.1.1 was running on the back-end. a strict definition of correct, unambiguous city was used. for example, inferring london, uk from "department of agricultural sciences, imperial college london, wye tn25 5ah, uk" was judged incorrect even though the correct location "wye, ashford, kent, uk" is near "london, uk". however, inferring an alternative name for the correct city was judged correct, as was inferring a more precise location, such as a district or suburb within the correct city. failure to resolve trivial part-of relations, as was often the case for cliff and geomaker, were judged correct instead of ambiguous. for example, it was judged correct when geomaker mapped "division of cell biology, netherlands cancer institute, 1066 cx amsterdam, the netherlands" to both "amsterdam, north holland, nl" and "netherlands". table 2: estimated performance rates based on a random sample of 300 affiliations. a smaller random subset of cases was deemed sufficient for estimating performance of google maps and cliff because their errors were not rare.   mapaffil gopubmed geomaker google maps cliff correct unambiguous city 293 (97.7%) 279 (93.0%) 274 (91.3%) 86 (65.2%) 77 (58.3%) incorrect 0 6 19 12 4 ambiguous 1 0 0* 0* 5 none 1 2 5 33 10 state 4 12 2 0 9 country 1 1 0 0 26 total 300 300 300 132 132 *note that geomaker and google maps had no ambiguous mappings by our design — the top ranked result was taken for each query, otherwise the majority of their results would be judged ambiguous. gopubmed represents an approach tailored specifically to pubmed affiliations — each pubmed identifier (pmid) was entered in their faceted interface and the mapped city looked-up in their "locations" category. this does not explicitly give a longitude-latitude pair but rather a point on a small map and the name of the location which was used for these comparisons. after mapaffil, gopubmed had the strongest performance: 93.0% of our test cases were correctly and unambiguously mapped to a city, compared to nearly 97.7% for mapaffil. the other tools had worse performances which reflect generic efforts that have not been tailored to the specific genre analyzed here — the author affiliations listed in pubmed. most of mapaffil's incomplete mappings were due to incomplete information available in the affiliation: "department of emergency medicine". produced no output in all tools except google maps, which mapped it to honolulu, hi, usa because of the present author's prior search history. here are some other incomplete examples: "department of laboratory medicine, mcmaster medical unit, ontario, canada.", "department of pediatrics, university of kentucky, usa." some of the cases that gopubmed got wrong or incomplete include "school of pharmacy, wingate university, wingate, nc, usa." which it mapped to nc, usa. furthermore, "halsooch sjukvaardsnamndens forvaltning, stockholms lans landsting." refers to stockholm, sweden but was mapped to lens, france; "japan science and technology agency, ishikawa, 923-1211, japan." refers to nomi city, ishikawa prefecture, japan but was mapped to ishikawa city, okinawa prefecture, japan. google maps got both of these right, while mapaffil got the first one right and the second ambiguous (it identified both ishikawa, japan and ishikawa, okinawa, japan), while cliff returned nothing for the first one and just japan for the second one. all geocoders were fed unedited affiliation strings. google maps and cliff could have performed better with some tweaking. for example, google maps tends to get overwhelmed and return "we could not find..." when given too much highly specific information such an email address and the name of a department within an institution. however, settings aside the 33 cases that returned "we could not find", still produces a high rate of incorrect mappings (12/(132-33) = 12.1%) because it appears to put more weight on names of institutions than names of places. cliff often removed names of organizations and people from the list of candidate places (e.g., ann arbor mapped to a person so was excluded as a city). with a little tweaking and pre-processing input given to both tools could help improve performance dramatically. geomaker uses information that is similar to that of google maps (names of institutions, places, and zip codes) except from a different source (yahoo! placemaker) and it refines the input/output. however, there was one case that cliff got complete and correct (mapped to lake worth, fl, usa) while few of the others did: "kathleen d. schaum, ms, is president and founder of kathleen d. schaum&associates, inc, lake worth, florida. ms schaum can be reached for questions and consultations by calling 561-964-2470 or through her e-mail address: kathleendschaum@bellsouth.net. submit your questions for payment strategies by mail to kathleen d. schaum, ms, 6491 rock creek dr, lake worth, fl 33467. information regarding payment is provided as a courtesy to our readers, but does not guarantee that payment will be received. providers are responsible for case-by-case documentation and justification of medical necessity." google maps timed out, gopubmed returned as sanamayn, daraa, syria, while mapaffil said usa because it filters out chunks of text that appears to be regular sentences. when applied to a collection of 12.7 million affiliation strings listed in pubmed, ambiguity remained unresolved for only 0.1%. for the 4.2 million mappings to the usa, 97.7% were complete (included a city), 1.8% included a state but not a city, and 0.4% did not include a state. figure 3 shows the rates of unresolved ambiguity and incompleteness over time. ambiguity has been very low since ~1980 but we see significant ambiguity in earlier papers. this is a reflection of how affiliations were written in earlier days. figure 2 shows that affiliations from the 1940s are very short, sometimes even just listing the name of a city, compared to the longer ones of today that include departments, institutions, street addresses, cities, states, countries, zip codes, emails, and so on. we also observe that the incompleteness rate has been slightly but steadily increasing over time since 1980. this probably reflects an increasingly diverse set of affiliations. we also found about 40k affiliations that only listed an email address, and email addresses in affiliations have generally been on the rise. figure 3: unresolved ambiguity and incompleteness over time. affiliation types were captured using simple regular expressions and placed into 8 different categories: educational, hospital, educationa-hospital, organization, commercial, government, military, unkown. first the affiliation was matched against edu or hos, or both. if neither matched, then one other category was matched if possible. org represents a generic research organization, and includes national institutes/laboratories/centers, associations, etc. gov includes institutions like local health departments but not national institutes, hospitals, or educational institutions. figure 4 shows the prevalence of the different kinds of institutions over time in the dataset. the two dominant categories are educational institutions and hospitals. we have performed preliminary experiments on large collections of principal investigators and their affiliations listed in nih and nsf grants, as well as inventors' addresses on uspto patents. nih and nsf are also dominated by education (and hospitals for nih). the patent genre is quite different. inventors often do not have an institutional affiliation, and their home addresses are listed, and the assignees are most often commercial entities. this makes the set of locations much more diverse. even so, mapaffil presently covers greater than 90% of these records. we expect some of the more generic tools tested in our experiments to have higher coverage for uspto inventor addresses but have not tested this yet. figure 4: affiliation types over time.   4 conclusion as mentioned earlier the current algorithm is the result of several iterations of refinement. at this point the accuracy of the algorithm has plateaued, in the sense that major new components are necessary to significantly improve performance. adding a thousand new (rare) cities to the locations dictionary would have little effect on overall performance. we find that incorrect assignments and unresolved ambiguities are rare (< 1%). the incompleteness rate is about 2%, mostly due to a lack of information. in order to improve completeness in these cases, one could include information external to the affiliation field such as other papers by the same author or construct a list of institutions that can be unambiguously mapped to one location. this information can be used both as a further step to help remove ambiguity or infer a city when only a country is given. nevertheless, the current performance is much greater than other tools and should enable new types of global bibliometric studies on geographical proximity and geo-linked data. as examples, we are presently studying the impact of local demographics on the diversity of co-authorships and topics in biomedical science, and building models of collaborative behavior where geographical proximity is one of several important explanatory variables. mapaffil is available from http://abel.lis.illinois.edu/cgi-bin/mapaffil/search.pl; the full pubmed affiliation dataset and batch processing is available upon request.   acknowledgements research reported in this publication was supported in part by the national institute on aging of the us nih grant p01ag039347 and the directorate for education & human resources of the us nsf award 1348742. the content is solely the responsibility of the authors and does not necessarily represent the official views of the nih or the nsf.   references [1] jonnalagadda sr, topham p. 2010. nemo: extraction and normalization of organization names from pubmed affiliations. j biomed discov collab. 2010 oct 4;5:50-75. [2] french jc, powell al, schulman e. 2000. using clustering strategies for creating authority files. j. am. soc. inf. sci., 51: 774—786. http://doi.org/10.1002/(sici)1097-4571(2000)51:8<774::aid-asi90>3.0.co;2-p [3] torvik vi, smalheiser nr. 2009. author name disambiguation in medline. acm tkdd 3(3): 11. http://doi.org/10.1145/1552303.1552304 [4] dredze m, paul mj, bergsma s, tran h. 2013. carmer: a twitter geolocation system with applications to public health. aaai workshop on expanding the boundaries of health informatics using ai, bellevue, wa. [5] doms a, schroeder m. 2005. gopubmed: exploring pubmed with the gene ontology. nucleic acids research 33 (web server issue): w783-w786. [6] gopubmed by transinsight gmbh. [7] zhang w, gelernter j. 2014. geocoding location expressions in twitter messages: a preference learning method. journal of spatial information science. 9: 37-70. [8] leidner jl. 2007. toponym resolution in text. phd thesis, university of edinburgh, uk. [9] heilmann c. 2009. geomaker. [10] bhargava r, d'ignazio c. 2014. cliff mediameter. mit center for civic media. [11] geonames. [12] nlm tech bull. 1999 nov-dec;(311). [13] nlm tech bull. 2013 sep-oct;(394):b4. [14] nlm tech bull. 2014 nov-dec;(401):e5. [15] medline/pubmed data element (field) descriptions.   about the author vetle i. torvik is an assistant professor in the graduate school of library and information science at the university of illinois at urbana-champaign where he teaches courses on text and data mining, statistical modeling, informetrics, and information processing. his current research addresses problems related to scientific discovery and collaboration using complex models and large-scale bibliographic databases. he has built a suite of bibliographic data mining tools and datasets that are available from abel.   copyright © 2015 vetle i. torvik ipres 2006 conference report: digital preservation takes off in the e-environment search   |   back issues   |   author index   |   title index   |   contents d-lib magazine december 2006 volume 12 number 12 issn 1082-9873 ipres 2006 conference report digital preservation takes off in the e-environment   cindy boeke the american society for cell biology as governments, state archives and libraries struggle to preserve and provide access to digitized government records and cultural heritage-related resources, the need for long-term digital storage is finally beginning to receive the attention it deserves. in today's internet-based environment, a variety of digital storage projects are moving from the theoretical realm into reality. as digital archives become more ubiquitous, the digital preservation profession is developing methods to certify and standardize their quality. at the same time, the archival community continues to struggle with finding acceptable formats for long-term preservation and obtaining ongoing funding for the exploding number of existing and proposed projects. these themes were discussed at ipres '06 (http://ipres.library.cornell.edu/program.html), which brought together 220 digital preservation specialists, records managers, institutional repository developers, archivists, and librarians from around the world. the event, which took place at cornell university on october 9-11, 2006, featured high-level speakers from prominent digital archives, governments, publishers, and libraries. ipres '06 was sponsored by the cornell university library, a long-time leader in the field of digital preservation and digital library development (http://library.cornell.edu/about/digital.html). cornell also is co-developer with the university of virginia of the fedora open-source digital object repository system, and home to the renowned cornell lab of ornithology's macaulay library of digital wildlife recordings (see appendix). the conference received support from jstor, microsoft, oclc and sun microsystems. merging the role of library and archives: the canadian model there still is a divide between the library and preservation fields, with libraries leaning towards collection development and providing access, and archivists focusing on long-term preservation. in many respects, however, canada is setting a new model for libraries and archives to work together, rather than in their traditionally separated roles. two years ago, canada's national library and archives became one organization, a fusion that ian wilson, librarian and archivist of canada, believes meets the needs of canada's government and public. "the public does not care about silos and professional differences between librarians and archivists," he said, adding that information professionals must tap into popular interest and not just "talk among each other." wilson also discussed library and archives canada's (http://www.collectionscanada.ca/index-e.html) twin goals to preserve canada's heritage and public record. he cited the 1990s as being the "worst documented decade in [canadian] history," due to the transition between a paperand a digital-based records system. now, he said, digital is no longer viewed as an ancillary means of preservation: digital technologies form the basis of how library and archives canada builds, acquires, preserves and provides access to collections. on the downside, he highlighted the huge future expenditures that will be required to maintain and keep online large-scale digital projects. preservation technologies archivists continue to consider which digital formats to adopt for preservation. on the one hand, dvds or web repositories with compressed files are not considered robust enough for preservation purposes. at the same time, they are useful formats and platforms to provide broad access to materials and collections to users. however, with more and more items being born digital, this issue will become less and less important, in the sense that the digital format of objects will be embedded in their creation. as the world continues its rapid transition from analog to digital, archivists in memory institutions still grapple with the dilemma of reformatting analog source material. in the specialized area of audio-visual collections, there is a growing acceptance of digital files as preservation copies, said carl fleischhauer, project coordinator, office of strategic initiatives, library of congress. this practice is reasonably well established for sound recordings but still emergent for moving image content. for example, many preservation projects use wave or broadcast wave as a target format for audio, but there is not yet a consensus for video. in the face of this uncertainty, many video reformatting projects adopt a "hybrid" approach. the preservation master is recorded onto conventional videotape, e.g., digibeta, while lossy compressed computer files, e.g., mpeg-2 or -4, are made for day-to-day viewing. meanwhile, fleischhauer noted, record companies, television broadcasters, and theatrical film producers have embraced digital technologies for production, although archivists in the commercial sector are troubled about how best to manage all of this new digital content. fleischhauer also highlighted the file-storage dilemma faced by archives that cannot afford large-scale server systems. these organizations often use cds and dvds for content storage, in spite of their awareness of short media life expectancies. this practice, fleischhauer said, highlights the need for consortial storage arrangements or the provision of trustworthy and affordable third-party storage services, one of the topics being explored by the national digital information infrastructure and preservation program at the library of congress. quantifying the quality of an archive there is a growing movement underway to develop metrics that measure the quality or "trustworthiness" of an archive. proponents of certification believe the process will create more standardized, reliable and credible archives that better meet the long-term needs of libraries, governments, and user groups. susan dobratz, head of the joint electronic publishing group of the computer and media services at the university library of humboldt-university berlin, discussed germany's nestor (network of expertise in long-term storage of digital resources) project (http://www.langzeitarchivierung.de/index.php?newlang=eng) that aims to set up "criteria for trusted digital repositories [and] recommendations for certification procedures of digital repositories." as more archives come online, she said that "users want guidelines and coaching on how to set up a trustworthy archive." robin dale, program officer, oclc office of programs and research, discussed the work of the rlg-nara digital repository certification task force, as well as the center for research libraries auditing and certification of digital archives project (http://www.crl.edu/content.asp?l1=13&l2=58&l3=142) funded by the andrew w. mellon foundation. dale said the audit process in many ways is more important than the actual certification, since it allows archives developers to analyze and respond to their archives' strengths and weaknesses in a systematic fashion. leveraging the audit checklist developed by rlg and nara, the crl project conducted several pilot audits, informing further checklist development. many preservationists in the audience were familiar with the rlg-nara audit metrics, having used them as a "checklist" during the creation of their archives (http://www.rlg.org/en/pdfs/rlgnara-repositorieschecklist.pdf). growing number of projects many impressive projects, including e-journal and international repositories, were profiled at ipres '06. their scope, technological development, and business models demonstrate a growing sophistication. libraries continue to rely on e-journals provided by subscription from publishers. as they discard paper holdings, long-term access to back-issues in electronic format has become crucial. as publishers change their holdings, subscription terms, and business models, libraries are often left in the dark as to whether complete archives of journals will be available in the future. to meet this need, the number of e-journal preservation archives is growing. such repositories are generally dark (or not accessible), provide back-up to publishers in the event they cannot or do not publish e-journals, and are supported by grants, governments and/or institutional membership. several e-journal archives were profiled at ipres, including: portico (http://www.portico.org/), funded by the andrew w. mellon foundation, ithaka, the library of congress, and jstor; the national library of the netherlands' (koninklijke bibliotheek, kb) safe places network and e-depot (http://www.kb.nl/dnp/e-depot/e-depot-en.html) preservation system; and lockss ("lots of copies keep stuff safe"), based at stanford university (http://www.lockss.org/lockss/home). the large number of international projects presented at ipres makes it clear that many countries are developing impressive digital archives, including switzerland, the netherlands, the united kingdom, and china. often repositories are developed by national libraries or large universities, although each country is using a different model for financial support and technical format, and much can be learned by examining these projects. details on these and many other projects and digital preservation developments can be found in the ipres speaker presentations (see http://ipres.library.cornell.edu/program.html). setting the stage for ipres '07 in many respects, the digital preservation profession continues to face the age-old archival dilemma. although many projects are becoming more high-profile, developers continue to struggle to obtain ongoing funding and find it difficult to create a commercially successful business model. like digital libraries in general, most digital archives are supported by grants, direct contributions from parent organizations, or government subsidies. the financial picture is further complicated when the projects receive one-time funding to establish themselves, and then must find additional support on an ad-hoc basis. at ipres '06, however, there was a buzz of excitement, as the digital side of the preservation field is becoming more necessary in today's e-environment. governments, national libraries, universities, and commercial enterprises not only recognize, but support, endeavors to preserve documents and resources for future generations. concurrently, many archives have achieved or are building greater levels of sophistication, standardization and technical complexity into their projects. "having 200 plus folks show up in centrally isolated upstate new york for a conference solely devoted to digital preservation is a tribute to how far we've come in the past several years," said anne r. kenney, senior associate university librarian for public services and assessment at cornell. "i think we can look to more conferences on the topic that present cutting edge initiatives and more opportunities for birds of a feather to get together to discuss particular issues in digital preservation." no doubt ipres '07 will provide the setting to discuss digital preservationists' challenges and successes in the e-era. appendix: cornell lab of ornithology's macaulay library one of the world's most unusual digital archives is housed at cornell: the cornell lab of ornithology's macaulay library (http://www.birds.cornell.edu/macaulaylibrary/about/index.html). the macaulay library is an impressive facility, with technical resources and capabilities rarely seen in the digital library and archival environment. the macaulay library contains 160,000 recordings and 3,000 videos of birds and animals from around the world, including the sounds of many species that are now extinct. forty people work at the macaulay library, including eight archivists, who manage, annotate, and preserve recordings in myriad formats dating back to the 1930s. the library depends on the work of scientists, many of them graduate students, who record animals in the field. to standardize metadata, the library developed free software that requires necessary fields to be filled out, such as time, type of equipment, etc., to better ensure the scientific quality of the data pertaining to each recording. the macaulay library, however, is much more than a digital repository. it is home to a vast array of production and post-production studios and equipment, climate controlled archives, and technical staff who are constantly improving the quality and distribution of its holdings. for example, explains media integration developer guillaume iacino, the web repository will soon be reconverted from a real-based platform to mpeg-4 to allow more options for users to playback the recordings. the number of hd videos is growing, and soon the library will unveil an image collection. with sustainability in mind, the archive has made a conscientious effort to diversify its income stream – selling sounds in a business-like fashion while remaining dedicated to providing free or low-cost access for education and research purposes. the library has many innovative ideas for revenue generation, from selling stuffed animals with embedded sounds of the actual animals to promoting the sales of licenses to commercial companies. the library, explains mary guthrie, production manager, operates on a cost-recovery basis. research and education uses of the recordings are subsidized, and the library helps recoup its costs through institutional and commercial licensing, along with a variety of commercial ventures and partnerships, from toys to endowments. the library is committed to expanding the reach of the archive, says greg budney, interim director of macaulay and curator of the audio collection. "we want to become readily available to everyone," he says, highlighting the need to reach developing regions in particular. "for example, we want conservation forest managers in ghana to access the sounds and recordings of wildlife from ghana." as the organization continues to grow, he emphasizes, the archives will continue to serve in their primary role as a research collection for education and conservation. copyright © 2006 cindy boeke top | contents search | author index | title index | back issues previous conference report | in brief home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/december2006-boeke   report on the 2nd international workshop on historical document imaging and processing (hip'13) search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine march/april 2014 volume 20, number 3/4 table of contents   report on the 2nd international workshop on historical document imaging and processing (hip'13) george v. landon eastern kentucky university george.landon@eku.edu doi:10.1045/march2014-landon   printer-friendly version   abstract the second international workshop on historical document imaging and processing (hip'13) was held august 24, 2013 in washington dc, usa, in conjunction with the international conference on document analysis and recognition (icdar 2013). the workshop, which brought together an international group of researchers who work with historical documents, was complementary and synergistic to the work in analysis and recognition featured in the main icdar sessions. technical areas covered in the workshop included information extraction and retrieval; reconstruction and degradation; text and image recognition and segmentation; and layout analysis and databases. the researchers, many with computer and engineering backgrounds, shared their ongoing work in building tools and methods to handle the digitization of historical documents.   introduction great strides in digitizing and indexing the world's physical documents have been made in recent years. in parallel, preservation and access of digital-born materials are getting the academic focus necessary to handle coming centuries of collections. however, as digital repositories become the primary source for future scholars, we are in danger of limiting access to pieces that can be digitized and indexed using currently available technology. this has motivated groups of scholars around the world to find new methods to digitize and index documents that are currently inaccessible. starting in 2011, researchers that had been working and presenting papers on novel methods to scan, index, and provide access to historical documents, held the first workshop on the topic. the first international workshop on historical document imaging and processing (hip'11) was held in conjunction with the international conference on document analysis and recognition (icdar 2011). this past year, the second international workshop on historical document imaging and processing (hip'13) was held in conjunction with icdar 2013. at hip'13 researchers from around the globe met in washington dc in august 2013 to discuss their work toward building tools and methods to handle digitizing historical documents. the majority of the participants came from computer science and electrical engineering. as those in the digital libraries fields already understand, the open problems in digital libraries are also open problems in these other domains. for researchers coming from computer science and engineering, work with historical documents has the added benefit of making culturally significant information accessible.   opportunities abound the more we digitize documents the more we realize that our current software and methods are unable to cope with the vast array of historical documentation stored in the world's memory institutions. archivists and libraries at institutions previously and/or currently involved in digitization projects can readily point to cases where current tools fail. this was highlighted by a pre-workshop tour. before the workshop began, participants were invited to tour the national archives in washington, dc. the tour highlighted numerous successful projects to digitize historical documents while also giving guests additional examples of where digitization is not currently possible. workshop attendees toured the national archives and given up-close demonstration of specialized digitization equipment. the workshop began the next day and the sessions titles themselves provide a glimpse at the breadth of the issues affecting digitization attempts for historical documents. "information extraction and retrieval", "reconstruction and degradation", "text and image recognition", and "segmentation, layout analysis and databases" are all critical areas that need improvements when handling historical documents.   workshop technical program   information extraction and retrieval extracting information from historical documents for follow-on retrieval remains an area of active research. in the first session, new methods for word spotting, populating ontologies, and feature detection were presented. this session included the paper "contextual word spotting in historical manuscripts using markov logic networks" by david fernández, simone marinai, josep llados, and alicia fornés, which won the international association for pattern recognition (iapr) best paper award for the workshop. (the full proceedings are available here.) authors of "contextual word spotting in historical manuscripts using markov logic networks" receive the iapr best paper award.   reconstruction and degradation we are currently seeing bias in many digitization projects that exclude fragile or damaged documents. this is certainly not due to a lack of importance of these documents, but due to limited digitization technologies that currently exist. researchers recognize these gaps in digitization technology and are working to develop new methods designed for specific classes of fragile historical documents. there are still many document types that are difficult or impossible to digitize. novel methods to address digitizing deteriorated negatives, excessively curved pages from bound documents, warped pages, and broken wooden documents were all presented.   text and image recognition as more and more documents are being digitized, performing recognition across very large collections is becoming a necessity. papers in this session presented methods that improved recognition of handwritten text in historical documents, improved ocr, and even modeling and comparing art styles across renaissance face portraits to determine unknown arts.   segmentation, layout analysis and databases large collections need additional metadata to aid in indexing and retrieval; however, this information is difficult to manually extract for most documents and this is especially true for historical documents. in the last session, two unique techniques were presented for automatic segmentation of text and drawings within digitized images of historical documents. another technique moved to segment individual japanese characters. all of these solutions are made more difficult by intrinsic characteristics of historical documents such as handwritten text or wood block printing. the variations in all of these documents were highlighted by the advanced metadata introduced and implemented in the european union's impact repository.   global participation all aspects of historical document processing are certainly global issues. memory institutions across the world are working to find ways to digitize historical documents or at least preserve them until digitization is possible. the global nature of this research is highlighted by the diverse attendance at hip'13. there were 70 attendees, up from 58 attendees at hip'11. thirty-one papers were submitted covering all areas of historical document processing. each paper had 3 reviewers and 18 of the 31 papers were accepted for a 58% acceptance rate (an improvement from the 71% acceptance rate at hip'11). country # of attendees       country # of attendees canada 3       japan 6 china 2       qatar 1 france 7       russian federation 2 germany 5       spain 3 greece 2       sweden 1 ireland 1       switzerland 7 israel 4       united kingdom 5 italy 1       united states 15     hip'13 competition familysearch international (fsi) hosted a workshop-affiliated competition to extract information from a large number of handwritten mexican marriage records. participants were asked to group a scrambled collection of these records by the contents of certain sub-regions of the document. these sub-regions contained geographic and chronological information. competition participants were evaluated based on correct classification of these images relative to ground truth.   the next hip workshop new unsolved problems seem to arise every day when dealing with digitizing historical documents. we expect to see many novel methods to handle these problems in future hip sessions. there is also a strong interest in increasing participation by librarians and archivists. currently, hip'15 is being planned to occur with icdar 2015 in tunis, tunisia, september 26 30, 2015. please consider submitting your research for presentation, or just attending to interact with other researchers working in the cross-section of historic documents and digital libraries.   references 2nd international workshop on historical document imaging and processing (hip'13), washington, dc, usa, august 24, 2013 (website). acm digital library. (2013). "proceedings of the 2nd international workshop on historical document imaging and processing". acm, new york.   about the author george v. landon is an associate professor in computer science at eastern kentucky university. he received a ph.d. in computer science from the university of kentucky. his research focus is in computer vision and image processing with particular applications in the digital humanities. he is particularly interested in developing new methods to virtually restore documents and photographs.   copyright © 2014 george v. landon digital libraries: challenges and influential work search   |   back issues   |   author index   |   title index   |   contents d-lib magazine july/august 2005 volume 11 number 7/8 issn 1082-9873 digital libraries challenges and influential work   william h. mischo university of illinois at urbana-champaign introduction as information professionals, we live in very interesting times. effective search and discovery over open and hidden digital resources on the internet remains a problematic and challenging task. the difficulties are exacerbated by today's greatly distributed scholarly information landscape. this distributed information environment is populated by silos of: full-text repositories maintained by commercial and professional society publishers; preprint servers and open archive initiative (oai) provider sites; specialized abstracting and indexing (a & i) services; publisher and vendor vertical portals; local, regional, and national online catalogs; web search and metasearch engines; local e-resource registries and digital content databases; campus institutional repository systems; and learning management systems. figure 1. federated search diagram. for years, information providers have focused on developing mechanisms to transform the myriad distributed digital collections into true "digital libraries" with the essential services that are required to make these digital libraries useful to and productive for users. as lynch and others have pointed out, there is a huge difference between providing access to discrete sets of digital collections and providing digital library services (lynch, 2002). to address these concerns, information providers have designed enhanced gateway and navigation services on the interface side and also introduced federation mechanisms to assist users through the distributed, heterogeneous information environment. the mantra has been: aggregate, virtually collocate, and federate. the goal of seamless federation across distributed, heterogeneous resources remains the holy grail of digital library work. background work as we look back over the evolution of digital library technologies and reflect on how we got where we are, it is important to consider the contributions of federally supported projects and also to note the provenance of other emerging information technologies. federal programmatic support for digital library research was formulated in a series of community-based planning workshops sponsored by the national science foundation (nsf) in 1993-1994. the first significant federal investment in digital library research came in 1994 with the funding of six projects under the auspices of the digital libraries initiative (now called dli-1) program. these dli-1 research and development projects were jointly funded by a federation comprised of the national science foundation (nsf), the national aeronautics and space administration (nasa), and the defense advanced research projects agency (darpa). (griffin, 2000). in 1998, at the cessation of the dli-1 program, federal funding for the dli-2 program was instituted with support from nsf, nasa, darpa, the national library of medicine (nlm), the library of congress (lc), the federal bureau of investigation (fbi), and the national endowment for the humanities (neh). also, in 1998, the corporation for national research initiatives (cnri), under darpa support, funded the three-year d-lib test suite program which provided continuing funding for several of the digital library testbeds created under dli-1. in aggregate, between 1994 and 1999, a total of $68 million in federal research grants were awarded under dli-1 and dli-2 (fox, 1999). following on two prototype projects awarded under dli-2, the national science, technology, engineering, and mathematics digital library (nsdl) program began its first formal funding cycle during fiscal year 2000 and awarded 119 grants between fy 2002 and fy 2004. dli-1 funded six university-led projects to develop and implement computing and networking technologies that could make large-scale electronic test collections accessible and interoperable. the projects employed multi-departmental teams in tangent with commercial vendors or software companies to push the envelope of digital library research. the six funded institutions with their primary focus were: the university of michigan for research on agent technology and mechanisms for improving secondary education; stanford university for the investigation of interoperability among heterogeneous digital libraries and the exploration of distributed object technology; the university of california-berkeley for imaging technologies, government environmental information resources, and database technologies; the university of california-santa barbara for the alexandria project to develop gis (geographical information systems) and earth modeling distributed libraries; carnegie mellon university for the study of integrated speech, image, video, and language understanding software under its informedia system; and the university of illinois at urbana-champaign for the development of document representation, processing, indexing, search and discovery, and delivery and rendering protocols for full-text physics, computer science, and engineering journals. in the illinois project, the overarching focus of the testbed component was on the deployment and evaluation of journals in a marked-up format within an internet environment. probably the most significant contribution of the illinois project was the transfer of technology to our publishing partners and other publishers (mischo and cole, 2000). it has now become commonplace for both major and small-scale publishers to provide web-based access to their full-text journal issues and articles. to illustrate this, the tdnet e-resource manager presently lists over 80k unique online full-text journal titles. it is astounding how far online journals have come in a very short time. many of these publishers support feature sets that closely follow those that were originally developed within the illinois testbed project. these include: full-text display using html and cascading style sheets (css), internal linking between citations and footnotes, forward and backward links to cited articles using dois and openurls, and, less commonly, the display of complex mathematics and special unicode characters directly in the html or xml full text. when dli-1 began, the world wide web was in a very nascent stage. at that time, the university of illinois' national center for supercomputing applications (ncsa) mosaic 2.0 beta was the web browser of choice, the html 2.0 standard was still under development, netscape had yet to release its first web browser and microsoft windows 3.1 was the standard personal computer operating system. in the area of focus of our illinois grant, the few full-text journals at that time were primarily comprised of static, proprietary files in the form of bit-mapped images. all of these 1994 emergent digital library technologies remain relevant and vital today. the dli projects took an "over-the-horizon" view that contributed greatly to advancing the state-of-the-art in their selected areas. just as earlier federal grant work led to the development of the internet and the web browser – albeit as offshoots from the primary work that was funded – the dli programs contributed to the development of best practices and standards for digital library work. some of the work led to significant technology transfer and spinoffs (e.g., google grew out of research performed under the stanford dli-1 project). an international collaboration by cornell and the uk eprint project, under dli-2, contributed to the development and adaptation of the open archives initiative for metadata harvesting (oai-pmh) specifications and protocols. overall, the dli-1 projects served to identify and define important document and data metadata standards, protocols for web-based access, and the issues surrounding federated and broadcast search protocols. another interesting effect of the increased federal funding for digital library research was that senior scholars from other computing disciplines were brought into the field (lesk, 1999). whether digital library work will continue to be interesting to the computer science community at large is an open question. computer and networking technology has changed dramatically over the last 10 years and nowhere is this more evident than in the rapidly evolving world of digital library implementation and practice. a large number of significant digital library standards and technologies have been developed by entities outside of the federally funded projects. a sampling of these include: publishers (professional societies such as aip and commercial publishers such as elsevier), publisher consortiums (crossref), bibliographic utilities (oclc's work with the dublin core and resource description framework (rdf) standards), w3c (xml, xslt), iso (sgml) academic consortia such as the digital library federation and the association for research libraries (oai, the arl portal), niso (openurl and metasearch initiative) library of congress (zing, sru/srw, mets), library integrated system vendors (ex libris with sfx, endeavor link finder plus), metasearch vendors (webfeat, muse global, ex libris metalib, endeavor encompass) web search engines (google scholar and google print, yahoo), computer companies (hp and d-space, oracle and microsoft sql utilizing xml), the open source community (dspace, greenstone, fedora, sakai). also, government funded projects outside the u.s. have made major contributions, including jisc (joint information systems committee) in the uk, the enrichuk project, and the minerva eeurope knowledge base. federation solutions several of the dli-1 and dli-2 projects examined issues connected with federation. in retrospect, the dli projects did not have the requisite technology to federate and integrate what was then an expanding universe of discrete, distributed information resources. the development of mechanisms for distributed search is an area of focus within the nsdl core integration project. the illinois, stanford, and michigan dli-1 projects all addressed issues connected with search interoperability and federated searching. it is interesting to contrast these interoperability approaches. the stanford model employed a shared information bus that utilized metadata schemes and search threading operating across heterogeneous information resources in a stateless web environment. the michigan view was built around a collection of collaborating software agents that tied together a set of servers spread throughout the internet. the illinois approach utilized broadcast, asynchronous searching of distributed, heterogeneous repositories. these approaches presaged the development of web services architectures. of course, these technologies remain relevant and help to delineate the issues surrounding federation. there has been a surge of interest in metasearch or federated search technologies by vendors, information content providers, and portal developers. these metasearch systems employ aggregated search (collocating content within one search engine) or broadcast searching against remote resources as mechanisms for distributed resource retrieval. google, google scholar and oai search services typify the aggregated or harvested approach. vendor systems such as ex libris metalib, endeavor encompass, and webfeat utilize broadcast search approaches. another one-stop shopping aggregated approach is represented by elsevier's comprehensive (in the sciences) scopus a & i service. it is interesting that google scholar is being held up as the competition for both campus institutional repository systems (at least in terms of search and discovery) and academic library federated searching. a related or connected issue is the debate surrounding the efficacy of metadata searching vs. full-text searching. this issue pits google-type web search against oai harvested search and discovery. one promising supplement to oai item-level metadata is collection-level metadata search. in fact, the aggregated and broadcast approaches can be complementary and synergistic. however, to facilitate this, broadcast searching needs to become standardized. the niso metasearch initiative (http://www.niso.org/committees/metasearch-info.html) seeks to develop industry standards for one-search access to multiple resources that will allow libraries to offer portal environments for library users offering the same easy searching found in web-based services like google. the niso metasearch initiative is proposing a standard built around a streamlined sru/srw protocol with rest (representational state transfer) queries that return xml. so, we return to the fact that we have, in our armamentarium, some powerful tools for resource access. in the next several years, we will extend the current standards and practices to offer more effective and efficient access to the distributed information environment. references edward a. fox. 1999. "digital libraries initiative: update and discussion" bulletin of the american society for information science, 26(1):7-11. october/november. stephen m. griffin. 1998. "nsf/darpa/nasa digital libraries initiative: a program manager's perspective." d-lib magazine july/august. doi:10.1045/july98-griffin. accessed april 15, 2003. michael lesk. 1999. "perspectives on dli-2 growing the field." bulletin of the american society for information science, 26(1):12-13. october/november. clifford lynch. 2002. "digital collections, digital libraries and the digitization of cultural heritage information" first monday, 7(5). may 2002, . william h. mischo and timothy w. cole. 2000. "processing and access issues for full-text journals." successes and failures of digital libraries: papers presented at the 35th annual clinic on library applications of data processing. eds. susan harum and michael twidale. march 22-24. copyright © 2005 william h. mischo top | contents search | author index | title index | back issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/july2005-mischo   the dataverse network: an open-source application for sharing, discovering and preserving data search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine january/february 2011 volume 17, number 1/2 table of contents   the dataverse network®: an open-source application for sharing, discovering and preserving data mercè crosas institute for quantitative social science, harvard university mcrosas@hmdc.harvard.edu doi:10.1045/january2011-crosas   printer-friendly version   abstract the dataverse network is an open-source application for publishing, referencing, extracting and analyzing research data. the main goal of the dataverse network is to solve the problems of data sharing through building technologies that enable institutions to reduce the burden for researchers and data publishers, and incentivize them to share their data. by installing dataverse network software, an institution is able to host multiple individual virtual archives, called "dataverses" for scholars, research groups, or journals, providing a data publication framework that supports author recognition, persistent citation, data discovery and preservation. dataverses require no hardware or software costs, nor maintenance or backups by the data owner, but still enable all web visibility and credit to devolve to the data owner.   background traditional approaches to storing and sharing data sets in social science have been either inadequate or unattractive to researchers, resulting into only a few scientists sharing their research data. most professional archives, although often considered the most reliable solution, do not usually facilitate control and ownership of the data by the author. once the author submits the data, the archive becomes fully responsible for the data management, cataloging and future updates. while this can be advantageous for some researchers, many prefer to maintain control of their data and to receive increased recognition. consequently, a researcher will often choose either to offer her data only through her own website, or, more commonly, to simply not share her data at all. neither choice provides adequate provisions for future preservation or a persistent identificant and access mechanism. journals and grant funding agencies are starting to require a data management and sharing plan for data authors. (see, for example, http://www.nsf.gov/news/news_summ.jsp?cntn_id=116928.) it is therefore becoming more important than ever to have a solution that satisfies these requirements while proving beneficial to data owners. the dataverse network (dvn) project provides an open source solution to this data sharing problem through technology (king, 2007, http://thedata.org). it offers a central repository infrastructure with support for professional archival services, including backups, recovery, and standards-based persistent identifiers, data fixity, metadata, conversion and preservation. at the same time, it offers distributed ownership for data authors through virtual web archives named "dataverses." each dataverse provides scholarly citation, custom branding, data discovery, control over updates, and terms of access and use, all through a user-friendly interface. this combination of open source, centralized standards-based archiving and distributed control and recognition makes the dvn unique across data sharing solutions. figure 1. the dataverse network hosts multiple dataverses, each one being a virtual web archive for storing and sharing research data. the data are organized in studies, which contain cataloging information (metadata), data files and complementary files. the studies can in turn be grouped into collections. the dvn software development started in 2006 at the institute for quantitative social science (iqss) at harvard university. it benefited considerably from its predecessor, the virtual data center (altman et al., 2001). since its release in 2007, the dvn hosted at iqss has hosted hundreds of virtual archives, and, in partnership with the data-pass alliance (altman, et. al, 2009), provides access to more than 37,000 studies (each one a separate research unit with its own description, documentation and data sets) and all together constituting more than 600,000 files (on the order of 1 to 2 tb in total) distributed across hundreds of dataverses. in addition, a growing number of institutions around the world host full dvn's federated with each other. in the next sections, the solutions that dvn offers for a successful and comprehensive data sharing and preservation system are presented. from incentives for data sharing to interoperability with other systems, the features described below are important for supporting data authors, journals, archivists, researchers and students with the increasingly needed, highly desirable, research data sharing initiative.   incentives for data sharing: recognition, visibility, ownership the success of data sharing can be enhanced by incentivizing authors with increased scholarly recognition and visibility. scholarly recognition is traditionally achieved through citations. however, referencing the original publication is insufficient in this case. the data themselves need a formal citation to allow a persistent reference from the original publication or other publications. data citation facilitates replication of the original research findings while providing deserved credit to data collectors and authors1 . a dataverse fills this need by automatically generating a persistent data citation upon creation of a study. details on the data citation standards are discussed in the next section. a dataverse can also be branded like, or embedded within, the author's website. as a result, the data appear listed in the author's site, together with all the functionality offered by the dataverse. authors instantly gain visibility for their work, without having to be concerned about the issues of preservation or archiving, both of which are quietly handled by the centralized dvn repository. in addition to recognition and visibility, authors want to maintain a level of ownership of the data they have produced. with the dataverse, even though the data reside in a central repository, authors have wide control over updates, descriptive information and restrictions for their data sets.   persistent citation for data the data citation generated by a dataverse includes: author(s), distribution date, title, global persistent identifier and url, universal numerical fingerprint (unf), and optional fields such as the distributor, data subsets, and versions. it follows the data citation standard as defined by altman and king (2007): verba, sidney; lehman schlozman, kay; brady, henry e.; nie, norman, 1996, "american citizen participation study, 1990", http://hdl.handle.net/1902.2/6635 unf:3:agyty1ubirxftnpzbexcda== inter-university consortium for political and social research[distributor] v1 [version] in this data citation, the main additions to traditional citations are the persistent identifier and the unf. the persistent identifier is critical for citing a digital object. this is what makes the reference permanent and independent; even if the digital object moves from one web site to another, the reference will link to an active web site. the dvn uses the handle.net® services supported by cnri, and its global handle registry™ to register the identifiers under a given naming authority handle (prefix). in combination with the local handle service, the handle.net services ensure that the prefix and identifier assigned to each study always resolve to a working url for the study. the pricing structure for issuing handles is based on the prefix. since all studies in a dvn are often assigned to one single prefix, an unlimited number of handles can be registered for fixed yearly fee associated with that given prefix. since digital object identifiers (doi® names) are based on handles, it is feasible to extend the dvn software to also support dois as the persistent identifiers in the data citation. once the pricing structure issuing dois at a variety of scales has evolved, one can easily imagine supporting some collections using handles and some collections using dois within a dvn. a dataverse assigns persistent identifiers to each study. since a study might include one or more data sets, each persistent identifier does not necessarily correspond to a single data set. however, a unique unf — the other part of the citation particular to a digital object — is generated for each data set in the study. the unf in the study citation is an aggregation of all the unfs for all the data sets in the given study. a data citation can also be provided to a subset of the data, and in that case, a new unf is generated which represents only that part of the data. the unf is generated based on the content of the data sets and not the format, and therefore does not change even if the format of the data changes. thus, a unf verifies that the data have not changed independent of the file format, guaranteeing that researchers are using the same data originally created by the data owner (altman, 2008). unfs are described further in the next section. the main objectives of the data citation are to give credit to the data authors, as well as to the data distributors when applicable; to strengthen the link between published results and the evidence base supporting them; and to provide a way to reference the data in perpetuity. the cited data can correspond to a published article, to a pending publication or to a published data set on its own, without an associated article. by generating the data citation automatically when a study is created, even before it is released to the public, the author can already provide a persistent reference to the data in an article for publication, and then release or "publish" the data once the article is in circulation. this offers a convenient solution to journals for linking research findings in an article to the underlying data when the data is not yet made public or released. finally, if the research study is updated in the future, the version in the citation is increased automatically. a citation without any versioning will resolve to the most recent study version. however, it is always possible to track back to older versions within a study, and even to verify whether the data set has changed by comparing the unf of a previous version to the latest unf.   format conversion and fixity data sets uploaded to a dataverse go through the following steps: the variable metadata (names, labels, data types) in the data set are separated from the primary data. summary statistics are generated for each variable. the data file is reformatted to a preservation format independent of any specific software package. a cryptographic algorithm is applied to the canonical format to get an alphanumeric string (the unf) based on semantic content of the data set. this data processing is supported for spss, stata and csv (with control cards for the variable metadata) file formats. the original file is also saved unaltered, and the steps are repeatable if needed in the future. the benefits of processing and reformatting the data files are obvious and critical to any data repository that guarantees permanence and replication. statistical software packages used to create data sets today might be not available in the future. data needs to be stored in a preservation format, and to be easily convertible to upgraded formats used by researchers now and in the future. the unf helps verify permanently that the data are fixed and unchanged from the data originally used by the author.   restricted access as a data sharing system, the dvn encourages authors and distributors to make their data public. the software, however, supports click-through terms of use and restrictions for those cases in which the author wants to limit the use or access of the data. there are three levels of access control: a public study with terms of use. in this case the descriptive information for the study can be viewed without any conditions, but the data files can only be accessed after the user agrees to the terms of use assigned to that study. a study with restricted file(s). in this case the descriptive information remains public, but one or more data files are restricted. only password-authorized users are allowed to view and download the restricted files. a restricted study. when the entire study is restricted, the metadata are still searchable, making part of the description discoverable, but access is not allowed to either the full cataloging information or the data files. users may request access to a restricted file or a full study, which can subsequently be granted by the dataverse owner who is usually the author or distributor of the data. authorization to a set of studies can also be given based on a range of ip addresses associated with a university or other institution. ultimately, however, data authors have control over who can access their data. this is again a way to provide distributed ownership to the author, while the data are secured, preserved and accessible when needed through the dvn. for restricted data files, there is an added advantage of using the unf in the data citation. users can validate that the study held by the archive contains the same data originally cited by the author, even in those cases when the data cannot be accessed.   protection for confidential data even though authors can choose to restrict a data set, access control is not sufficient to protect some levels of high-risk confidential information. the dvn approach to this issue is to provide additional protection through click-through terms of use. upon creation of a study in a dataverse, authors can be required to agree not to deposit data collected without institutional review board (irb) approval or not to deposit data containing confidential information. on the data user side, upon accessing a data set, a user can be required to agree not to link multiple data sets or additional information about the data that would risk identification of individuals or organizations. the language in the terms of use can be modified for the entire dvn or for each dataverse based on the requirements of the institution, data owners or publishers. for extremely sensitive studies, one can include the metadata of the study in the dvn with the full citation, including the unf, while storing the data itself securely offline. a more desirable future solution is to more explicitly support the storage and sharing of data sets with sensitive information since research data increasingly contain some form of confidential information, but would nonetheless be extremely useful for research and the advance of science. this can be achieved with tools that allow systematic analysis of disclosure on deposit; that allow selected files to be stored in separate repositories that provide security controls; and that support statistical disclosure limitation methods for data release and for online analysis. this problem is challenging and the subject of broad ongoing research and development.   data discovery a data-sharing framework would not be complete without useful data discovery functionality. a dataverse provides the capability to browse and search studies within that dataverse or across the entire dataverse network. all metadata fields describing the study and the data are indexed allowing advanced field-based searches. the metadata include roughly one hundred fields including cataloging information, such as title, author(s), abstract, date, geospatial information, and metadata fields describing the variables in a data set, such as name and labels. an application like the dvn not only allows researchers to find and access large data sets available in known archives, but more interestingly, it helps them find and easily access small data sets from other researchers that would otherwise sit in local computers with the risk of being lost.   subsetting and analysis data subsets and analysis might not be critical to a data-sharing framework but they do likely facilitate data usage and give incentives to data owners by enriching their offerings when using a dataverse. by processing data sets, standardizing formats and extracting variable information, the dvn software makes it possible to offer additional data services to users. processed data sets can be subset and analyzed on the fly through a dataverse interface. currently there are two supported data types that offer these additional services: tabular data sets and social network data (fung, et al., 2010) 1. tabular data sets data files with rows and columns (uploaded in spss, stata or csv file formats, as mentioned above) can be subset so that a user can extract only some of the variables. also, users can perform recodes, get summary statistics in a numerical or graphical form, and run statistical models for selected variables. the statistical models are powered by zelig, (imai, et al., 2008) an r-based framework that standardizes the development and use of almost any statistical method contributed to the popular r statistical language. the dataverse makes use of these methods and runs them in an r server to provide real time analysis of the data without having to download the entire data set. 2. social network data social network data (data that describes a network of entities and relationships) can be uploaded in graphml (brandes, et al. 2002, 2010) format to provide additional services available to data users. for example, one can download a subset of the graph and run network measures such as page rank and degree. subsets can be based on either a manual query or a pre-defined query — such as extracting the neighborhood for a subset of data. the network data files are stored in a neo4j database which facilitates querying complex and large networks. in the case of networks, it becomes more critical to provide this type of functionality given that large graph data sets are complex in nature and hard to understand without first applying minimal manipulation and extraction of a subset. extraction, analysis and (soon-to-be-provided) visualization tools enhance the use of shared data making them more understandable and manageable even when data users are not familiar with the line of research of the original data.   easy to use and maintain all the features and benefits discussed above would be for naught without a user-friendly, low-maintenance interface. with a dataverse, data owners can administer all the settings and manage studies through a web interface. creating a personal web data archive does not require any software installation or special request since a dataverse can be created through a web form from a dataverse network. it is then hosted, maintained and upgraded by the network. this model is becoming increasingly more common as applications are switching to services on the web. (such applications are often referred to as one form of cloud applications or "software as service".)   interoperability and standards finally, although data owners and distributors should not be concerned about the system's architecture, a good data-sharing framework must interoperate with other systems and support standards used by other repositories. the dvn imports and exports its study metadata in various xml formats — data descriptive initiative (ddi, version 2), dublin core, fgdc, and marc. it supports the open archive initiative (oai-pmh) protocol to harvest metadata from one system to another, and interfaces with the lockss (lots of copies keep stuff saved) system (reich and rosenthal, 2001), which is used by many digital libraries, to preserve and back-up the data by copying them in multiple locations. an individual dataverse can be lockss-enabled so all the owned records are also copied to additional specified locations. figure 2. a dvn is federated with other dvns or other repositories through oai-pmh. metadata are exchanged using metadata schema standards such as ddi, dublin core and fgdc. data in a dvn are copied to other locations using lockss preservation software and the oai protocol. support of standards and broadly-used protocols allow a dvn not only to federate with other dvn installations, but also to share data with repositories that use a different technology. this project is funded by a number of sources. preservation of data deposited to the iqss dvn is funded by the murray research archive endowment. continued development and support of the software are funded by the generosity of various universities, including harvard university. additional funding is achieved through grants to develop new features and expand the software to fully support data beyond social science. the dvn software is open source and therefore the project encourages contributions from the software development community and any interested organizations.   summary the data sharing initiative is evident and growing. with journals and grant funding agencies starting to require easy access to the data used in research studies, and with more data available for research every day, technology solutions for data sharing are critical. yet authors need incentives to participate if we want to make data sharing successful. through data citation, web visibility and ease of use, the dataverse network enables data authors to gain recognition and maintain ownership of their data while addressing their data archival concerns through management by a centralized standards-based repository. the features presented here make the dataverse an attractive solution for publishing data and satisfy the requirements for managing and sharing data set by funding agencies and journals.   note 1 studies show that data sharing can provide additional scholarly recognition by not only increasing citations to the data, but also increasing citation to the original publication for those research studies that make their data easily accessible (piwowar et al. 2007).   references [1] altman, micah. 2008. "a fingerprint method for verification of scientific data", in advances in systems, computing sciences and software engineering (proceedings of the international conference on systems, computing sciences and software engineering 2007), springer verlag. [2] altman, micah; andreev, leonid; diggory, mark; krot, michael; king, gary; kiskis, daniel ; sone, akio; and verba, sidney. 2001. "a digital library for the dissemination and replication of quantitative social science research", social science computer review 19(4):458-71. [3] altman, micah and king, gary. 2007. "a proposed standard for the scholarly citation of quantitative data", d-lib magazine 13(3/4). doi:10.1045/march2007-altman. [4] altman, m., adams, m., crabtree, j., donakowski, d., maynard, m., pienta, a., & young, c. 2009. "digital preservation through archival collaboration: the data preservation alliance for the social sciences." the american archivist. 72(1): 169-182. [5] brandes, u., eiglsperger, m., herman, h., himsolt, m., and marshall, m.s. 2002. "graphml progress report: structural layer proposal", proc. 9th intl. symp. graph drawing (gd '01), lncs 2265, pp. 501-512. © springer-verlag. [6] brandes, m eiglsperger, j. lerner. graphml primer. 2010. http://graphml.graphdrawing.org/primer/graphml-primer.html. [7] fung, b., wang, k., chen, r., yu, p. 2010 "privacy-preserving data publishing: a surey of recent developments" acm computing surveys, volume 42, issue 4, doi:10.1145/1749603.1749605. [8] king, gary. 2007. "an introduction to the dataverse network as an infrastructure for data sharing", sociological methods & research, 36(2): 173-199. [9] imai, kosuke, king, gary and lau, olivia. 2008. "toward a common framework for statistical analysis and development", journal of computational and graphical statistics, vol. 17, no. 4 (december), pp. 892-913. [10] piwowar ha, day rs, fridsma db. 2007 "sharing detailed research data is associated with increased citation rate" plos one 2(3): e308. doi:10.1371/journal.pone.0000308. [11] reich, vicky and rosenthal, david s.h. 2001 "lockss: a permanent web publishing and access system", d-lib magazine, june 2001 volume 7 number 6. doi:10.1045/june2001-reich.   about the author mercè crosas is the director of product development at the institute for quantitative social science (iqss) at harvard university. dr. crosas first joined iqss in 2004 (then referred to as the harvard-mit data center) as manager of the dataverse network project. the product development team at iqss now includes the dataverse network project, the murray research archive, and the statistical and web development projects (openscholar and zelig). before joining iqss, she worked for about six years in the educational software and biotech industry, initially as a software developer, and later as manager and director of it and software development. prior to that, she was at the harvard-smithsonian center for astrophysics where she completed her doctoral thesis as a student fellow with the atomic and molecular physics institute, and afterwards she was a post-doctoral fellow, a researcher and a software engineer with the radioastronomy division. dr. crosas holds a ph.d. in astrophysics from rice university and graduated with a b.s. in physics from the universitat de barcelona, spain.   copyright © 2011 mercè crosas collection level description the museum perspective search  |    back issues  |    author index  |    title index  |    contents d-lib magazine september 2000 volume 6 number 9 issn 1082-9873 collection level description the museum perspective   heather dunn professional programs canadian heritage information network introduction many organizations -libraries, archives, museums, government agencies, and others -have content that they need to make accessible on the internet. museums have long had computerized collections management databases, and many of them are offering item-level meta-data to the public on the web. but the content of museums’ collections databases is invisible to search engines -so what is the means for resource discovery on the web? the creation of web pages containing collection-level description (meta-data describing the collection) is a solution that would facilitate resource discovery. however, there are problems associated with the creation of these descriptions, whether they are manually written, or automatically generated from the item-level data. what is collection-level description in the museum context? how can these descriptions be created? what terminology should be used? we need to take a closer look at these issues if we wish to use collection-level description to assist in allowing access between organizations, across disciplines, among resources with different types of content, and for audiences with varying levels of expertise. 1. defining "collection-level" in the museum view what is collection-level description in the museum context? even within a single museum, the concept of a "collection" may have many different interpretations. consider the following definition of "collection" from the artlex lexicon of visual art terminology: "collection -an accumulation of objects…. collections can be formed around any of a variety of parameters. they may be centered upon a medium or technique, a certain period or group of artists, or a subject, for instance; or they may be encyclopedic, as can be the entire collection of a large museum. museums typically have both permanent collections and traveling collections. also see accession, deaccession, donation, gallery, and patron". [1] in the museum view, a collection may consist of the entire holdings of a particular museum. or, a collection may be a discrete part of a museum’s collection; it may be centred on some type of similarity between the items in the collection. for example, a museum collection may consist of the works of a particular artist (e.g., the monet collection) or group of artists, a particular medium or technique (the print collection), or a certain discipline (the ethnology collection). museums also define collections in administrative terms: for example, the collection of a particular donor (the barnes collection), or suited for a particular purpose (the education collection). the concept of "collection" is different again in the context of a collaborative digital resource built from the content of many museums. in a collaborative digital resource, the concept of "collection" becomes very fluid; it can extend beyond the physical walls of individual museums and allow users to combine and re-combine objects at will. for example, a user of on-line resources might wish to bring together all the works of art created by one artist, regardless of where they are physically housed. a "virtual exhibit" can also be considered a collection, bringing together individual objects housed in many physical locations. researchers wishing to access information about these collections on-line may be interested in any of the above interpretations of "collection". museums cannot predict what their users will consider to be a collection, in what language or terminology they will request the data, or what level of information they need (is it for an elementary school project, or a ph.d thesis?) ideally, collection-level description for resource discovery should provide access to many interpretations of "collections" that are dynamically created by the user. 2. why is collection-level description important? many museums wish to provide public access to their collections databases over the internet. one of the main reasons for creating collection-level descriptions is resource discovery of object-level information held within databases. database contents (information and images about individual museum objects, for example) are invisible to web searches, and it is often impractical to create a web page for each item in the database. unless the contents of databases are described on the internet at the collection level, users will not be able to find the data in a web search. individual museums can use collection-level description to help web users discover the item-level information held in their database resources. some museums have created web pages for individual items in their collection. for example, the web site of the metropolitan museum of art, at provides access to a large collection. collection-level description could be used in such cases to facilitate resource discovery of museum web pages containing item-level descriptions or images. it can be used to help a searcher find a general class of items, even though the museum web site contains only references to specific instances of that class. for example, an art museum may have a web site containing text and images about their collection of works by monet. although a web search for "monet" may find the museum’s web pages, a search for "impressionist" or "french artists" may not. if the museum’s web site contains only images, it will not be found in a web search at all. appropriate terms can be added to the collection-level description to ensure that the museum web site is found, whether the researcher uses broad or specific terminology in their search. it is also possible to use collection-level descriptions for resource discovery within distributed resources. collection-level descriptions of resources created by individual museums (for example, descriptions of a museum collection, virtual exhibit, professional resource, etc.) can be contributed to a centralized location (e.g., a subject gateway or search engine) to be searched by users. when the users find a description of a resource that meets their needs, they can link to it and explore it in detail. more and more importance is being placed on data sharing among organizations -museums, archives, and libraries, and government agencies, as well as the corporate world, are striving to be "interoperable" at the local, national, and even global level. collection-level description is also important here. for example, a museum might have a collection of impressionist paintings, while a library holds a collection of books on the impressionist movement and individual impressionist artists. a museum might hold objects relating to the early settlement of a community, while governmental agencies or archives might hold statistical studies that are related (e.g., a census for the period). a researcher would not be able to see the relationship between the collections of the museum, archives, library, and government agency unless there were collection-level descriptions for each. so, collection-level descriptions facilitate cross-disciplinary, multi-level access to web and database resources for a diverse audience. but how would these collection-level descriptions ideally be created to fulfill these objectives? 3. what are the ideals for collection-level description? are they attainable? ideally, collection-level descriptions would be created following a well-designed standard which had been adopted globally, across disciplines and was suitable for resource description at the object or collection level. they would be automatically, dynamically created according to user requirements. they would be multi-lingual, and provide semantic links between object and class, between professional and public terminology. this is obviously not the reality, but how far are we from this ideal? what steps are being taken to make it happen? the consortium for the interchange of museum information (cimi) [2] has made some important advancements in the field of standards for museum resource description. phase 1 of the cimi dublin core testbed project [3] was undertaken in 1998 with the goal of testing "assumptions related to the flexibility and simplicity of the dublin core element set, and its suitability and readiness for deployment". seventeen cimi member organizations worked to create object-level descriptions using the dublin core standard, and identified issues surrounding the functionality of dublin core for resource discovery on the internet. one of the problematic topics that was raised as a result of this project was "characterizing resources as either item-level or collection-level -i.e., determining the unit of analysis for description such as with an exhibition, a collage of photographs, or other aggregated objects" (section 5.2 issues). phase 2 of the dublin core testbed project [4], which began in 1999, includes the publication of a "guide to best practice" for museums using dublin core, and an "examination of resource description framework (rdf) as an effective method for enabling interoperability between applications that exchange meta-data. rdf, an emerging standard of the world wide web consortium (w3c)[5], is "a foundation for processing meta-data; it provides interoperability between applications that exchange machine-understandable information on the web" [6]. among other applications, rdf can be used in resource discovery, cataloguing, and collection-level description. the w3c has developed a model for representing rdf meta-data, and has recommended the use of extensible markup language (xml) as a syntax for encoding this meta-data. the development of these standards for the creation, processing and encoding of meta-data is a vital step toward the goal of achieving cross-domain interoperability. although they have not yet been widely utilized by museums, and there are still many issues that need to be resolved, their validity for museum data has been demonstrated through the cimi testbed project. despite the great advancement in standards for resource description over the past few years, the terminological issues seem more daunting than ever. ideally, the collection-level description should provide access to both general and specific requests, regardless of the knowledge level, discipline, data requirements, and language of the user. knowledge representation tools such as thesauri can accomplish some of these goals. in order to bridge the semantic gaps between the language used in the collection-level description and the language used by internet users (who come with varying levels of subject knowledge), thesauri can be employed with the search engine. taking advantage of the associative, hierarchical, and equivalency relationships of a thesaurus will allow easier access in that the terminology used by the searcher does not have to match that of the resource description. part of the appeal of using rdf as the foundation for meta-data processing is that it allows use of the xml namespace facility[7] the namespace facility allows the documentation of restrictions and definitions for an organization’s meta-data. for example, a museum that is using the art & architecture thesaurus [8] as the controlled vocabulary for object names can declare this in their namespace. this machine-readable namespace declaration will eventually ensure that the meta-data is "understood" and processed as intended by its creators. in theory, by reading the namespace of an organization, a resource discovery tool will be able to determine the precise meaning of each of the organization’s meta-data elements, have access to each controlled vocabulary used in the organization’s meta-data, etc. 4. the reality: collection-level description in chin resources with these ideals in mind, museums are progressing toward the goal of using collection-level description to achieve interoperability. however, there are many problematic issues that need to be resolved before this can be accomplished. what terminology should be used in collection-level description to ensure access by both the generalist and the specialist? how can we provide access in multiple languages? how can we create linkages between the terminology used in the collection-level descriptions and the object-level descriptions? how can we accomplish interoperability when the emerging standards are still moving targets? some of the initiatives of the canadian heritage information network (chin) [9] can be used as an illustration of the use (and potential problems) of collection-level description for resource discovery, and of temporary solutions that can be employed until the meta-data and terminology standards catch up with user requirements. some of the most problematic issues have to do with the terminology used in resource description. museums often use highly specialized terminology to describe their collections, whereas internet users with no subject experience may use very general terms. conversely, museums may create collection-level descriptions using very general terms, to the frustration of the user who is searching for a very specific item. it is important that the terminology used in the collection-level description is specific enough to allow users to decide whether they have found an appropriate resource, but general and descriptive enough so that people from a wide range of disciplines and knowledge levels can discover the resources. this is easier said than done, as the discussion below will illustrate. chin has had considerable involvement with resource description in the museum context; for over 25 years, museums across canada have been contributing object-level meta-data (sub-sets of their collections management information) to a collective resource managed by chin. this growing, collective resource, now called artefacts canada, contains data on over 2 million objects housed in canadian museums, and is accessible to the public on the chin web site. chin has created links between the object-level meta-data in artefacts canada and the collection-level descriptions found in another chin product, the great canadian guide. museums contributing object-level data to artefacts canada do not use a common terminological standard; many museums do not use a standard at all. indeed, no vocabulary standard exists that would meet the needs of all museums. a wide range of vocabulary is used by the museums, from very specialized to very general terms, in both english and french. some museums use a classification system, such as the revised nomenclature for museum cataloging [10], in addition to object names; some do not use any classification system. to mitigate this problem, the getty’s art & architecture thesaurus (with chin’s addition of the 2600 most commonly-used french terms) has been integrated with the artefacts canada search engine. this enables the user to enter a search term such as "painting", for example, and obtain results which include objects that are catalogued as "watercolour" (a narrower term to painting), and "peinture" (the french term for painting). although the artefacts canada search engine works quite well in itself, the individual artefacts canada records are invisible to web searches. one solution to this problem is to provide web pages with collection-level descriptions for each of the museums, and to link from the collection-level description to individual instances of that museum’s collection. chin has accomplished this through the collection-level descriptions in the great canadian guide. the great canadian guide is another chin resource that has been produced in collaboration with the canadian museum community. the guide is an on-line gateway to over 2400 canadian cultural institutions and attractions; museums use an on-line form to provide and update basic information on their exhibits, hours, location, etc. each museum provides a short collection-level (scope of collections) description using free text, in either english or french (or both). they also select terms that represent their collection (e.g., clocks or time-keeping devices) from a controlled vocabulary list. because the museums are required to use a controlled vocabulary in describing their collections, the terms they use are standardized between institutions. another reason for using controlled vocabularies for collection-level description in the guide is that the terms can be automatically mapped to language equivalents so that the guide information can be searched and displayed in either french or english. the free-text portion of the collection-level description also serves an important purpose. in describing a collection, the significance of certain items in the collection may be higher than others, and should be highlighted for the user. for example, a museum may use very general terms to describe a collection of impressionist art, but it is likely important to the user that the collection contains a work by monet. the free-text portion of the description allows the museum to highlight individual items that cannot be described using the broad controlled vocabulary. one of the features of the guide is that it allows users to link from the collection-level description of a museum’s collection to the corresponding object records in artefacts canada. for example, if a user does a search on the internet for "furniture", he may find the guide page for the provincial museum of alberta, because the museum has used the term "furniture or furnishings" as one of its collection-level descriptors. the user can click on the term, and be presented with all the museum’s chairs, tables, etc. as they are found in artefacts canada. in this case, the collection-level description has been an effective means of resource discovery -the user was able to find the resource through the collection-level description, and then investigate the individual objects within the collection. however, the linkages between the guide’s collection-level description and the individual instances in artefacts canada are not always automatic. museums seldom use standardized classification terms such as "furniture or furnishings" in their object records, and the art & architecture thesaurus does not always make the link between the classification-level term and all the possible members of that class. therefore, chin has had to create associations between the broad classes describing collections in the guide to individual members of that class in artefacts canada. for example, when a user clicks on a collection-level descriptor in the guide (e.g., "clocks and time-keeping devices"), the search that is launched in artefacts canada has been manually supplemented with terms such as "watch", as this relationship is not included in the art & architecture thesaurus. although the terms used in the guide’s collection-level descriptions are from a controlled list, and the artefacts canada search engine is assisted by the associative, hierarchical, and equivalency relationships of the art & architecture thesaurus, there is still a semantic gap between the two levels of description. another problem with this model is that an internet searcher will not be able to use specific terminology in a web search. for example, if the user searched for "watch" on the internet, the whyte museum entry in the guide would not appear, as the specific term, "watch" is not in the collection-level description; the user would have had to know to search for "clocks and time-keeping devices". another problem is that the linkages between the resources are not dynamic. for example, if the museum adds records to artefacts canada that have not been included in the list of expanded search terms defined by chin, the linkages from the class to the member will not be there. if a museum removes all of its "clocks and time-keeping devices" from artefacts canada without changing its collection-level description in the guide, there will be a dead link in the guide. in attempting to link between object-level descriptions and collection-level descriptions, chin has chosen to have museums write collection-level descriptions for their collections, and is trying to develop methods of bridging the gaps between the search terms used by internet users and the terms used by museums in their collection-level descriptions. as we have seen, this is difficult because the knowledge tools must be able to link from the class to the object (and vice versa), and also create the connections between specialist and generalist terminology in multiple languages. museums are using a wide variety of controlled vocabularies such as classification tools and thesauri (and some are not using any); we need to find a way to bring these tools together. multilingual knowledge tools are the goal, but at the very least, there is a need for language equivalents at the classification level. no tool exists at present that will accomplish all this. it may also be possible to work backward from the object-level data to dynamically create collection-level description. for example, if a museum has catalogued its collection using specific terminology, we may be able to run these specific terms through a knowledge tool that would determine the general class to which those objects belong. this seems harder to accomplish, but could be done dynamically to reflect changes in the content of the resource. more study needs to be done to determine if this is feasible. chin has just begun to use collection-level description to enable access to distributed resources. learning with museums is a resource that makes on-line educational museum content more accessible on the internet. it has a distributed architecture, with the resources existing on the museums’ web sites, and only a meta-data record contributed to the central repository at chin. to participate in the learning with museums project, museums create meta-data for each of their on-line resources (virtual exhibits, educational games, etc.), by using an on-line cataloguing tool created by chin and provided to members on the chin web site. chin sends the meta-data back to the museum, and they embed it within their resource. chin can then harvest this meta-data periodically in order to update its meta-data repository. users (teachers, students, etc.) can search or browse the learning with museums meta-data, and when they find a resource they are interested in, they can link to it, wherever on the web it might be physically stored. learning with museums provides a description of the resource -in some ways, this can be considered a collection-level description, as it is a description of the "collection" of items that has been brought together to form the virtual exhibition. terminology used in these descriptions is based on a thesaurus of subject areas based on canadian school curricula (e.g., broader term: sciences; narrower term: chemistry). chin’s newest initiative, the virtual museum of canada, is currently under construction. the virtual museum of canada will use existing resources such as artefacts canada, the guide, and learning with museums as building blocks, and will also enable canadian museums to create new rich content to be included as distributed resources in the virtual museum of canada. current thinking is that museums will use a cataloguing tool (similar to that used in learning with museums) to catalogue the content they are linking to the virtual museum of canada. the cataloguing tool would allow them to provide chin with collection-level meta-data on the virtual exhibit, photo gallery, etc. that they have created. in addition to the collection-level descriptions that museums will provide to the virtual museum of canada, museums will be required to submit object-level meta-data about the individual objects featured in their virtual exhibits to artefacts canada. again, there is a potential problem in linking the collection-level meta-data to the object-level meta-data. the virtual museum of canada will include diverse types of content: collection-level meta-data pointing to virtual exhibits, web pages with collection-level meta-data describing museums’ scope of collections, and database records describing/illustrating individual museum objects. the challenge is to find a way to enable users to find individual items in a collection or virtual exhibit, as well as the entire exhibit/collection. we are currently exploring the options for solving this problem. library classification systems built into the search engines might offer part of the solution, if they are able to fill the semantic gap between our collectionand object-level descriptions. as well, we might be able to build linkages into the meta-data. for example, we could require that when a museum submits meta-data about a new virtual exhibit that is to be added to the virtual museum of canada, the museum might also be required to identify each object featured in the exhibit with a unique identifier that already exists in the object’s record in artefacts canada. the virtual museum of canada will eventually allow cross-domain searching to the holdings of the national library of canada and the national archives of canada. collection-level descriptions and rdf will doubtless play a large part in achieving seamless interoperability between these three disciplines. conclusion although there are still many challenges to be faced in using collection-level description to facilitate museum resource discovery on the internet, advancements are being made through the development of standards such as dublin core and rdf, and the increasing use of knowledge tools such as the art & architecture thesaurus. it is important that the museum, library, and archival communities work together to ensure that these developments lead to true interoperability and resource sharing on a global level. references [1] "collection." artlex vistual arts dictionary. july 26, 2000 [2] consortium for the computer interchange of museum information (cimi). july 26, 2000 [3] cimi dublin core metadata project phase 1 final report. consortium for the computer interchange of museum information. july 26, 2000 [4] cimi dublin core metadata testbed phase ii. consortium for the computer interchange of museum information. july 26, 2000 [5] world wide web consortium (w3c). july 26, 2000 [6] resource description framework (red) model and syntax specification. february 22, 1999. world wide web consortium. july 26, 2000 [7] namespaces in xml. january 14, 1999. world wide web consortium. july 26, 2000 [8] the art & architecture thesaurus browser. version 3.0. j. paul getty trust. july 26, 2000 [9] canadian heritage information network. july 26, 2000 [10] blackaby, james r; greeno, patricia, and the nomenclature committee. 1988. the revised nomenclature for museum cataloging: a revised and expanded edition of robert g. chenhall's system for classifying man-made objects. american association for state and local history, nashville, tn. copyright© 2000 heather dunn top | contents search | author index | title index | monthly issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/september2000-dunn   storage is a strategic issue: digital preservation in the cloud search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine march/april 2015 volume 21, number 3/4 table of contents   storage is a strategic issue: digital preservation in the cloud gillian oliver victoria university of wellington, new zealand gillian.oliver@vuw.ac.nz steve knight national library of new zealand steve.knight@dia.govt.nz doi: 10.1045/march2015-oliver   printer-friendly version   abstract worldwide, many governments are mandating a 'cloud first' policy for information technology infrastructures. in 2013, the national library of new zealand's national digital heritage archive (ndha) outsourced storage of its digital collections. a case study of the decision to outsource and its consequences was conducted, involving interviews of the representatives of three key stakeholders: it, the ndha, and the vendor. clear benefits were identified by interviewees, together with two main challenges. the challenges related to occupational culture tensions, and a shift in funding models. interviewees also considered whether the cultural heritage sector had any unique requirements. a key learning was that information managers were at risk of being excluded from the detail of outsourcing, and so needed to be prepared to assert their need to know based on their stewardship mandate.   1 introduction internationally, government 'cloud first' mandates are forcing serious consideration by many public service organisations of outsourcing information technology (it) requirements to external providers. cultural heritage institutions are no exception to this, but there is a paucity of advice and experience to draw on to inform decision making. the purpose of this paper is to document the decision taken to outsource the storage of the national library of new zealand's national digital heritage archive (ndha), and in so doing provide some empirical evidence to assist other institutions worldwide facing similar decisions. the tendency to equate digital preservation with cold storage could lead to incorrect assumptions about the outsourcing solution required. it is important to articulate the symbiotic relationship between access and preservation. those providing digital preservation services need to provide access and preservation management to materials in active storage (i.e. active retrieval and active management of collections over time). the paper begins by providing the background of the international and national governmental context with regard to cloud computing, and then reports on the literature relating to use of cloud computing by cultural heritage institutions. this is followed by the case study of the ndha context and the decision to outsource, drawing on interviews conducted with the key individuals involved. the paper concludes with discussion of findings and draws out some key recommendations for others considering outsourcing the storage of digital collections.   2 background — the global context the united states national institute of standards and technology (nist) has developed a concise definition of cloud computing to serve as a basis for comparison of services and deployment strategies: "cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. " (mell and grance, 2011, p.2) the nist definition further identifies service and deployment models. of relevance to this paper is the infrastructure as a service (iaas) model, where "the capability provided to the consumer is to provision processing, storage, networks and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. the consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, and deployed applications; and possibly limited control of select networking components (e.g. host firewalls)" (mell and grance, 2011, p.3). four deployment models are identified: private, community, public and hybrid clouds. it is the private cloud model that is most relevant to this paper, where "the cloud infrastructure is provisioned for exclusive use by a single organization comprising multiple consumers (e.g. business units). it may be owned, managed, and operated by the organization, a third party, or some combination of them, and it may exist on or off premises " (mell and grance, 2011, p.3). internationally, governments have seen the potential for cloud computing to enable the delivery of more efficient and effective public services, with compelling cost economies (irion, 2012). it has been argued that cloud computing architectures are fundamental in the necessary transformation of governments to provide citizens with services in the digital age (fishenden and thompson, 2013). new zealand is of course not immune from these global trends. in 2011, the new zealand government embarked on a programme of transformation of the public sector to achieve significant cost savings and economies of scale via the use of shared services (guy, 2011).   3 cultural heritage institutions and the cloud despite the ubiquity of cloud computing and its promotion by governments worldwide, concerns about trustworthiness and the mandate for digital archives to preserve unique treasures in perpetuity have meant that cultural heritage institutions have scarcely been early adopters of this innovation. new zealand's ndha is possibly the first, and perhaps the only, national cultural heritage initiative to outsource the storage of its collections. nonetheless, the first published reports of outsourcing are starting to appear, notably from britain. in 2014, the united kingdom's national archives released a set of guidance documents for archives considering taking this step (national archives, 2014). the guidance includes three provisos which should be considered the bottom line — they must underpin any negotiation of an outsourcing contract: "first, data held in archives must be expected to be both preserved and accessible beyond the commercial lifespan of any current technology or service provider. second, an approach to addressing serious risks, such as loss, destruction or corruption of data that is based purely on financial compensation will not be acceptable, as this takes no meaningful account of the preservation and custodial role of archives; and, third, in order to reinforce the criticality of the first two elements, explicit provision must be made for pre-defined exit strategies ... and effective monitoring and audit procedures" (p.10). the guidance document provides a long list of the benefits of outsourcing. perhaps the most significant of these from a digital archiving perspective is the potential for improved capability in digital preservation. because of the feasibility of automated replication in multiple locations and the specialized expertise of vendors in terms of digital storage and integrity checking, it may be possible to achieve improvements at bit preservation level (p.11). this would also need to be reinforced through referential integrity checks across the database and mets files, as well as any other systems supporting the digital preservation programme. the tna guidance is accompanied by four case studies of outsourcing by cultural heritage institutions (case studies can be downloaded here). the settings of these case studies range from a local history centre to the british parliamentary archives, but do not include a national library or archives.   4 new zealand in 1965, the national library of new zealand (nlnz) was established as a standalone government department. in 2003, new zealand's legal deposit legislation was updated to include digital resources (new zealand government, 2003). this was significant as it meant that any new zealand digital content created was required by law to be deposited at the national library, to be kept in perpetuity. therefore it was necessary to build a repository to serve as a digital archive, and in 2008 the ndha was launched (see knight, 2010 for background on development and implementation). in 2011, nlnz (together with archives new zealand) was integrated into the department of internal affairs (dia). the government chief information officer (gcio) is also part of dia. the gcio is charged with delivering sustainable business savings of nz $100 million per year by 2017 (new zealand government, 2013). one strategy identified to achieve this is discontinuing in-house owned and operated technology assets, and moving to a service based model instead (new zealand government, 2013). as a first step towards this goal a contract was negotiated with three vendors to act as approved data centres — so effectively a private cloud. in effect, the dia was tasked with leading the implementation of infrastructure as a service (iaas) across the whole of government (incorporating storage as a service and backup as a service amongst other potential services), and the ndha agreed to be one of the initial pilot groups, with a specific focus on outsourcing storage. the size of the ndha collection can be seen in the following figures: figure 1: total intellectual entities and files in permanent repository   figure 2: size of permanent repository, in terabytes the background to the decision to pilot outsourcing including the concerns identified by national library stakeholders and details relating to the migration method have been documented by cynthia wu (wu, 2013).   5 the case study the motivation for this case study was to contribute to a much larger project: interpares trust, led by professor luciana duranti of the university of british columbia. the goal of this international research agenda is to "generate theoretical and methodological frameworks to develop local, national and international policies, procedures, regulations, standards and legislation" (interpares trust, 2015) for trustworthy digital records and data in a global, networked environment. of special concern is the need to ensure a persistent digital memory, hence the decision to outsource the storage of new zealand's digital memory seemed particularly worthy of investigation. the purpose of the ndha case study therefore was to contribute to interpares goals, in particular to inform the development of policy relating to the use of cloud storage providers. data was collected by interviewing seven individuals in may through june 2014. interviewees were representatives of the three parties involved in deciding to outsource, negotiating the contract and implementation. the three parties were government technology services (the branch of dia with responsibility for implementing iaas across government), the ndha and the vendor. snowball sampling identified those individuals who could comment from a strategic perspective, as opposed to a focus on operational detail. this was interpretive research, so the findings cannot be generalized, but they do nevertheless provide a rich picture of the benefits and challenges of outsourcing. quotations used below are attributed to one of these three groups by the use of initials: im (for information managers working with ndha), it (for members of government technology services) and ve (for individuals associated with the vendor).   5.1 benefits there were clear benefits identified by interviewees, and very genuine enthusiasm for the features of an outsourced environment. it was explained that the opportunity to outsource storage to the cloud was presented at a very good time — the existing in-house it infrastructure needed to be upgraded, and its capacity to store increasing amounts of data was of concern. one respondent noted that the decision to outsource was an "opportunity to look at the whole topography of how the ndha was laid out" (im2), in other words to revisit original design decisions and to refine where necessary. this is a very significant factor given that the initial design and implementation of the ndha was a pioneer endeavor — there were no pre-existing digital preservation systems that could be used as templates. in outsourcing the storage component it was possible to take advantage of the vendor's experience in managing large sets of data and load balancing. other benefits identified pointed to the fact that the hardware used would be state of the art, and of a consistent standard not likely to be seen in an in-house it facility. a contractor whose business depends on the quality of service provided will have a tailor made modern facility, whereas in-house it services are likely to be characterized by organic and possibly haphazard growth over long periods of time. it was also pointed out that a contractor will be much more attuned to customer service in the sense that if greater capacity was required, it would be made available as soon as possible. in contrast, in-house requests for greater capacity could involve a lengthy negotiation process navigating internal approval channels. another benefit that attracted comment related to greater transparency about the costs involved in digital preservation activities, and consequently being able to make informed decisions about particular courses of action. for instance, whether to manage digitised content in the same way as born-digital content. one respondent made an impassioned plea for those working in the cultural heritage sector to seize the opportunity represented by this new service and funding model, to escape from the victim mentality that is characteristic of this sector when it comes to the storage of collections, both digital and physical: "... we tend to kind of preload, get some capacity and fill it and then panic and try to get some more investment. which i think in the longer term is unhelpful because it masks the true year on year costs of operating the business and makes it easier for government or other investors once every 10 years to say here's a bit of money, go away — and managing within that becomes the challenge for the institution, rather than continually having a genuine conversation ..." (it3). the point being made was that those working in the cultural heritage sector are making decisions reactively rather than proactively, and always being on the back foot. what's worse is that this condition is accepted as a way of life.   5.2 challenges two main challenges emerged from the interviews, relating to occupational cultures (tensions between the information management and information technology perspectives) and to funding. these two areas were touched on by all interviewees. from the information managers' perspective, being taken seriously, being included in decision making, was a major challenge initially: "in the beginning we weren't even invited to meetings, we were just told this stuff was happening" (im1) from the information technology management perspective, the mindset of cultural heritage professionals was perplexing to say the least. there was little understanding why information managers would want to be concerned with the detail of outsourcing, rather than just leaving everything to the experts — i.e. those working in it. "library and archives ... have a different view of control than other branches and departments. they want to have a lot more control over where their information assets are, they want to know about them ... to a level of detail i think is unnecessary." (it1) this interviewee compared the attitudes encountered in the cultural heritage sector with other branches of government: "... other people are prepared to give up a bit more control and trust ... provided their criteria are met." (it1) in other words, from the it perspective the information managers were perceived as crossing boundaries into areas which were not of their concern. the vendor also commented on the differences encountered from their perspective: "... the business was very much involved right from the outset to understand the solution we were putting up, the level of discussion and technical diligence that they went through was a little bit more in depth than the typical engagement you would have." (ve1) another it respondent drew an analogy to the infrastructure of the physical world saying that the provision of it services should be viewed in the same way as the services provided by electricians and plumbers: "you are always reliant on third parties, so recognize the professional disciplines there" (it3) this difference in perspective is not necessarily one of trust but may be traced back to the emergence of digital preservation and the need for libraries and archives to understand technology and infrastructure more deeply in order to be able to attest to the authenticity and integrity of their digital collections over time, as they have historically done in the physical world. eventually, the gap in understanding between information management and information technology professionals was bridged by an intermediary. this was an it contractor who had been heavily involved in the initial design and subsequent implementation of the ndha. the contractor was someone therefore who had a deep understanding of the mission and purpose of the ndha and consequently why decisions to do things a certain way had been made. in addition, the contractor was well known and respected by colleagues in government technology services. this 'honest broker' role was felt to be essential to the success of the project. the other main challenge concerned a change in the funding model. this change was a consequence of moving from the purchase of equipment for storage for use in house, to the provision of storage as a service. in the former case, funding would be drawn from an organisation's capital expenditure (capex). in the latter, the costs are accrued to the organisation's operating budget (opex). and it's not simply a case of funding being reallocated from capital to operating — a much more complicated scenario is at play. this, it must be stressed, is not a situation unique to the new zealand government environment, as was made clear by an interviewee: "it's ironic, in the uk, the states and the asia pacific [region] there is this big push to consume things as a service, which everyone knows is moving you down an opex route. and the financial models aren't there to let you do it. and in most places it's the governments that are pushing you down this route, and yet the same people are going — well, we can't support funding it. so — it will be sorted, because it's the way the world is going, it's just a case of when. it will be awkward for a period of time" (it1). this interviewee went on to say "we've created this whole new channel with no new funding. that's a real pressure. ... the actual unit prices of storage are cheaper. the security and the service we're getting is better. the offering stands in its own right but we've got these funding issues around it" (it1). it is indeed particularly ironic given that the push for the adoption of cloud computing was motivated by the government's need to realize significant cost savings and economies of scale via shared services (wu, 2013). as another respondent made clear, it was extremely difficult to make the decision to move to an outsource model on the basis of cost savings, as the operating costs for in-house storage were not transparent. so rather than an informed decision, there was a need to take a leap of faith in deciding to make the change.   6 unique requirements of the cultural heritage sector interviewees were asked if they considered that the cultural heritage sector had any different or unique requirements that needed to be taken into account when moving to an outsourced model for storage. the vendor and the information managers were in agreement that a higher level of assurance was needed: "it have a very vanilla view of servers, infrastructure, applications ... [in contrast] the ndha, they are very passionate about the data that sits in there, on the servers, the longevity of it, the reputable data, ensuring that it doesn't change over time, probably more than what it folk typically are." (ve1) the vendor went on to explain that "... the business was very much involved right from the outset to understand the solution we were putting up, the level of discussion and technical diligence that they went through was a little bit more in depth than the typical engagement you would have." (ve1) "if most organisations lose a document, so long as they get the document back they're pretty happy. but because of digital preservation being what it is, you don't want to lose or corrupt any of the bits, they have to be exactly the way they were before " (im2). another unique feature identified by interviewees related to the mission and purpose of the ndha. an all of government contract had been negotiated which included a catalogue of specific services that were to be provided. the contract was understandably written for generic workloads — reflecting the types of information transactions taking place every day in government offices. in the case of the ndha however the amount of data needing to be stored was significantly larger than the norm, and needed to be kept in perpetuity rather than for a limited period of time. the ndha workload was described as having: "very high initial throughput, and very peaky workloads at ingestion and ... very random recalls and the capacity is much larger than what the service catalogue was originally written for" (ve2). the volume of data and longevity requirements, together with the need for greater assurance, impacts on the nature of back up carried out, as well as on protection and retention regimes.   7 discussion findings from this case study indicate that very clear benefits are possible for cultural heritage institutions in moving to the cloud, particularly in terms of gaining access to state of the art equipment and facilities, and expertise in dealing with extremely large datasets. in the ndha case, risks associated with data sovereignty were minimized, as the only option was the service provided by what was in effect a private cloud. given those benefits though, the decision to outsource did surface a number of other issues which are likely to be internationally relevant. the over-arching issue which was identified by one interviewee as an opportunity, is the paradigm shift in terms of control and financial model. to really understand the profound nature of this change in model, and respond to it as an opportunity rather than a threat, it will be essential for information managers to understand the strategic dimensions of storage, and broader infrastructure decisions. storage, whether for physical or digital collections, has tended to be regarded as purely an operational concern, but the decisions made about storage have the potential to influence the very core of the institution, and may impact its ongoing viability. hence taking a long term view, and understanding the consequences of outsourcing from a control and financial management perspective, are essential. andrew abbot, in his theory of professions, articulated the idea of a competition for jurisdiction (1988). abbott argues that professions arise as a result of system disturbance, and eventually establish their jurisdiction over a particular problem area — or put another way, their responsibility for a given set of issues. as society develops new technologies, new problems emerge, and occupations either respond or lose ground to other, newer professionals — a question of survival of the fittest. this idea has been explored in the past in the library and information science domain, with the various professions active in the information environment viewed as engaging in a competition for jurisdiction (van house and sutton, 1996). given the complexity of today's information environment, the competition for ownership of specific domains has become more and more acute. for information managers to be able to establish their claim for specialist expertise, to be acknowledged as having a particular perspective to contribute to decision making is often an uphill battle. one clear prerequisite is for information managers to be clear about their responsibilities, and to be ready to explain repeatedly why their concerns matter. in so doing, they open up the potential to shift the thinking of the other professions involved. thinking needs to move beyond a narrow competition, to viewing the information environment as expansive and complex enough to need a network of independent specialists, similar to the healthcare environment for instance. faced with intractable views, however, the best option might be to try to identify an 'honest broker', someone who understands both sides and is respected by everyone involved. a key learning from this case study is that information managers should expect to be excluded from the detail of outsourcing, and so must be prepared to be assertive and to establish their need to know based on their mandate to act as stewards of information as an authoritative resource.   8 conclusion cultural heritage institutions should investigate using storage as a service offerings, and also look ahead to utilizing other cloud based services. being aware of the short term consequences of cost saving (i.e. increased burden on operating budgets) must be factored into decision making, and set against potential long term benefits. although it is not possible to generalize from the one instance explored in this paper, it seems likely that the requirements of cultural heritage institutions are likely to differ in terms of data quantity, longevity required, and spikes in activity level from those expected in generic, everyday office situations. being able to articulate this difference, and to explain stewardship responsibilities, will assist in negotiating appropriate service levels. the ideal situation is one where a trusted individual can be identified, who can act as broker between information management and information technology professionals to assist in raising awareness of the different perspectives involved. the nature of this changing environment, where in-house operations can be delivered as a service by third parties, is one where opportunities can be threats if information managers are not equipped to respond appropriately. much of the responsibility for ensuring that information managers can adapt and be effective in this complexity rests with educators. it is imperative that new entrants to the information professions are equipped with the knowledge and skills necessary to approach and understand technology and infrastructure as a strategic issue within their sphere of influence.   references [1] abbott, a. (1988) the system of professions: an essay on the division of expert labor. university of chicago press. [2] fishenden, j., & thompson, m. (2013). digital government, open architecture, and innovation: why public sector it will never be the same again. journal of public administration research and theory, 23(4), 977-1004. http://doi.org/10.1093/jopart/mus022 [3] guy, n. (2011, august 18). speech the future of government ict. [4] interpares trust (2015). [5] irion, k. (2012). government cloud computing and national data sovereignty. policy & internet, 4(3-4), 40-71. http://doi.org/10.1002/poi3.10 [6] knight, s. (2010). early learning from the national library of new zealand's national digital heritage archive project. program: electronic library and information systems 44 (2) 85-97 [7] mell, p. and grance, t. (2011). the nist definition of cloud computing. special publication 800-145, national institutes of standards and technology. [8] national archives (2014). how cloud storage can address the needs of public archives in the uk [9] new zealand government (2003). national library of new zealand (te puna mātauranga o aotearoa) act 2003. [10] new zealand government (2013). government ict strategy and action plan to 2017. [11] van house, n., & sutton, s. a. (1996). the panda syndrome: an ecology of lis education. journal of education for library and information science, 131-147. [12] wu, c. (2013) adoption of infrastructure-as-a-service at the national library of new zealand. paper presented at archiving 2013, imaging technology and science, washington, dc. published in final program and proceedings, pp176-182.   about the authors gillian oliver is the programme director, master of information studies at victoria university of wellington, new zealand. she is honorary research fellow at the humanities advanced technology and information institute, university of glasgow and at the open polytechnic of new zealand. her professional practice background spans information management in the united kingdom, germany and new zealand. her research interests reflect these experiences, focusing on the information cultures of organisations. steve knight is the programme director, preservation research and consultancy at the national library of new zealand. prc's primary focus is preservation of and access to new zealand digital content with a particular view to modelling and developing solutions that can be scaled to national level.   copyright © 2015 gillian oliver and steve knight a framework for contextual metadata used in the digital preservation of cultural objects search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine november/december 2012 volume 18, number 11/12 table of contents   a framework for contextual metadata used in the digital preservation of cultural objects joan e. beaudoin wayne state university joan.beaudoin@wayne.edu doi:10.1045/november2012-beaudoin2   printer-friendly version   abstract the quality and the quantity of contextual information found in the descriptive metadata associated with digital representations of cultural objects are frequently inadequate for assuring that users will understand the nature of both the original object that has been digitally preserved, and the digital representation itself. this paper proposes and defines a framework for ensuring that relevant contextual metadata is easily collected and maintained. after identifying and describing eight important dimensions of context, the paper shows how implementing the framework, through a series of questions and prompts, results in a descriptive metadata record that accommodates the important aspects of an object's context. using two very different cultural objects as examples, an etruscan tomb painting and a 19th century bridge, the framework demonstrates that sufficient contextual information can be recorded in a metadata schema to enable effective future search, retrieval, examination, use, management, and preservation interactions.   introduction understanding as much about the context of an object as possible, such as it's history, composition, purpose, authenticity, ownership and utility, is critical to a meaningful understanding of that object. this holds true for appreciating all types of cultural objects — a work of art, a sound recording, a monument of historic significance, or a digital photograph of that historic monument. unfortunately, the metadata captured during digital preservation that is necessary for providing a good understanding of an object is often fragmentary and incomplete. in an effort to remedy this situation, this paper identifies the kind of information that needs to be captured and presents a framework for recording this metadata. digital surrogates of cultural objects were chosen as the focus of this project as they are primarily non-textual in nature, and so the contextual information recorded about these items is particularly important. an analysis of the various kinds of context found in the digital preservation literature was undertaken to develop the framework presented below. this literature review, also published in d-lib magazine1, served to identify and investigate the various forms of context believed to be useful in digital preservation efforts surrounding cultural objects. the main themes found in the literature were then used to develop the basic dimensions of the framework. each dimension speaks to an aspect of context distinct from the others and suggests the richness of the data that needs to be recorded. as each dimension of the framework encompasses a specific set of qualities, definitions of the dimensions would be highly useful here.   definitions of the dimensions of context broadly speaking, the technical dimension surrounding context concerns digitization processes and techniques. this includes aspects such as file formats, hardware, software, operating systems, migration, emulation, storage, data loss, encapsulation of technical information, and compatibility issues. it identifies the processes completed, the individual(s) who performed the work, and the dates when the work was completed. furthermore, this dimension includes information about the various versions of files, file names and storage locations, the software used, and the equipment upon which the work is to be displayed. the next dimension, readily recognizable as critical to understanding an item's contextual information, is utilization. this dimension of context speaks to the needs of users. it includes audience needs, task support, tools required for interactions with the item, accessibility, audience characteristics, and the types of analyses to be supported. additional issues described within this dimension concern the use setting of the original and the digital item. this includes the political, social, and organizational environment of its use, and details about the digital content's accessibility and persistence over time. those characteristics of a work that are dependent upon a direct, tangible interaction with items are addressed in the physical dimension of context. this includes features of analog and digital items which are sensory in nature. issues relating to how an object is experienced (e.g., scale, materials, texture, arrangement, sound, brightness, smell, etc.) are recorded within this dimension. as the material nature of an item is changed in the analog to digital transformation, and also in the mediated interactions which result from the software and hardware used, the details recorded here can be critical to understanding the original object. other aspects lost in the analog to digital transition have to do with determining an item's extent, how the parts of a single item, or several items, are related to one another, impermanent relationships between digital content, the various paths that can be taken through an object, and network linkages between items. based on the unclear boundaries and the sometimes mutable relationships which exist with digital items, this dimension of context is named intangible. another dimension to be identified as useful to contextual information concerns the reasons why a digital item is preserved and identifies specific decisions about its preservation, storage and handling. this dimension, curatorial, identifies the individual(s) who requested that an item be preserved, and provides background information concerning the motivation for preservation efforts. also recorded within this dimension is information concerning additional files representing the same object and/or the same digital item. this information is useful in making informed decisions about the uniqueness of digital files and the coverage of a single object. finally, aspects concerning the care and handling of the digital file are addressed within this dimension. the next dimension of context, authentication, is closely related to that of curatorial, as it concerns the verification of the digital content. this dimension includes how the content has been maintained and it addresses security issues. for example, this dimension asks who has had access to the digital data and who has been responsible for its protection. it also records information pertaining to any modifications made to the data after deposit and explains any changes that have been made to the data. authorization is the dimension of context which records information concerning the intellectual property rights surrounding the digital content. this dimension identifies the copyright holder and delves into aspects surrounding rights management such as legal agreements regarding the specifics of access, use and exchange. the final dimension, intellectual, is concerned with the significance of the original cultural object and, by proxy, its digital surrogate(s). this dimension contains information that is typically understood to be the scholarly record of an item. information pertaining to an object's meaning, function, technique of manufacture, historical import, cultural narratives and the communication of ideas are addressed in this dimension. each dimension of context highlights a unique aspect important to future interactions with, and knowledge concerning, digital content. some dimensions speak to an item's physical manifestation, several examine the digital expression and yet others deal with aspects that reflect the intellectual history surrounding an object. the variability seen in the dimensions' basic characteristics helps to highlight the complexities involved in recording contextual information about digital content, and point to the need for a framework within which to capture the critical information. cultural materials in a non-textual form are particularly challenging items to bring forward in time via digitization. in the case of these cultural objects lost knowledge about the original item and its digital surrogate is not easily recreated through a close reading of the object-surrogate itself. this situation means that a critical evaluation must be made of the kinds of data being recorded for cultural materials lest we run the risk of saving content that has no real useful future.   development of the dimensions of context as each dimension of context identified above is a somewhat abstract concept, a series of questions and prompts was developed to aid in the practical application of the framework. the questions and prompts presented here are not meant to be all-inclusive or applicable to every situation. however, in their current form they help to point out the breadth of information that needs to be accounted for within each dimension. for example, the questions and prompts in the technical dimension clearly address many areas; the format of the original and the digitized items, the digitization process, who performed the work, the size of the digital item, the hardware and software used in the creation of the item, what technologies will be needed for future access, naming of the archived file, file locations, and the names of associated files. making each data collection point explicit in this way helps ensure that important contextual information is recorded for future use and understanding of the item. collecting useful contextual metadata for non-textual, cultural objects was the focus of this project. the framework was developed for this specific category of objects. how well this framework would support text-based object metadata collection, or metadata for objects in domains outside of the cultural heritage sector, is not clear.   technical dimension format of original: what format is the original item? digitization date: when was the item digitized? digitized by: who digitized the item? digitization format (image, audio, video, animation file type): what format is the item? digital file format: what file type was used to create the digitized item? digital dimensions: each question asks about the dimensions of the archival version of the digital file.       resolution: what is the resolution for the archival file?       bit depth: what is the bit depth for the archival file?       file size: what storage space is required to save the archival file?       file dimensions as displayed/ played: what is the extent of the file? this may be recorded as pixels high and wide, length of playback, etc. digitization equipment (scanner, digital recorder, digital camera): what digitization device was used to create the digital file? post-digitization processes (cropping, corrections, data cleaning): what post-digitization processes were completed to ready the item for use or preservation in the digital archive?       software used: what software was used for each process?       date completed: when was each process completed?       completed by: who completed each process? archival file named: what is the name of the file used for preservation purposes? archival file located: where is the file used for preservation purposes located? derivative file(s) named: what is the name for each derivative file created from the archival file? derivative file(s) located: where is each derivative file located? equipment needed for playback: what hardware/device is needed to interact with the digital file? software needed for playback: what software is needed to interact with the digital file? other technical concerns: are there additional technical issues associated with this item? if so, please describe these.   utilization dimension audience needs of original item (educational, leisure, legal, medical, youth, etc.): what is the primary need addressed by this item? what is the original impetus to add or preserve the item? audience needs of digital item (educational, leisure, legal, medical, youth, etc.): what is the primary need addressed by this item? diversity of original audience (homogenous or heterogeneous): what is the basic type of audience for this item? diversity of digital audience (homogenous or heterogeneous): what is the basic type of audience for this item? social, political, organizational setting for original item's use: what is the typical setting for the item? social, political, organizational setting for digital item's use: what is the typical setting for the item? audience task support, analyses and interactions for original item: how would past, current, and future audiences use the item? what task(s) would it support? what analyses would be carried out? what interaction(s) between the user and the item would exist? what tools are necessary to interact with the digital content? audience task support, analyses and interactions for digital item: how would past, current, and future audiences use the item? what task(s) would it support? what analyses would be carried out? what interaction(s) between the user and the item would exist? what tools are necessary to interact with the digital content? accessibility of original item (non-technical restrictions): how accessible is this item? is only the surrogate of the item available? is there a particular aspect of the item that is restricted? accessibility of digital item (non-technical restrictions): how accessible is this item? is only the surrogate of the item available? is there a particular aspect of the item that is restricted? persistence of original item (importance of future access): how critical is lasting access to the original item? why should access to the item be maintained? at what point can this content be discarded? persistence of digital item (importance of future access): how critical is lasting access to the digital item? why should access to the item be maintained? at what point can this content be discarded?   physical dimension format of original: what is the physical form of the original item? scale of original: what are the dimensions of the original item? material(s) of original: what materials were used in the creation of the original item? technique(s) used in creation of original: what techniques were used in the creation of the original item? surface/textural/auditory/olfactory characteristics: what sensory characteristics are present in the original item? behavior or functionality of original: how does the physical item behave/function? relationship(s) and, or arrangement of parts or items: how is the original item arranged or related to its parts, or other items? creator's intention: what was the creator's intention in the creation of this item? how was the original item meant to be experienced?   intangible dimension object boundaries: what is the boundary of the digital file? how can the boundary of the digital file be identified and clarified? relationships and links: how does this digital file relate to other digital files of the same item or related items? is the file an exact copy of another file, is it an earlier or later version of the file, are there alternate views of the same item available, does this file show a detail of an item? does the file show a view of one item in a set of items (e.g., chess set, coffee service, etc.) and can these relationships between items and files be made explicit?   curatorial dimension preservation requestor and background: who was responsible for requesting the material being preserved? what background information about this individual or group useful to the curatorial context? preservation reason for digital item: why did this digital file become part of the preservation record? decisions of note: additional decisions of note that were made concerning the preservation of the materials? special storage needs: do the files require specialized storage needs? special handling needs: do the files required specialized handling needs? multiple surrogates: do multiple representations of this work exist?       list multiple surrogates, different view: if multiple surrogates exist, please note files.       multiple surrogates of same view: are there multiple surrogates of the same view?       list multiple surrogates, same view: if multiple surrogates exist, please note files.   authentication dimension provenance: how has the digital content been maintained and who has had responsibility for the item(s)? where has the item been stored? how secure is the storage at each location? how long has the content been stored? digital file change tracking: has the file been modified? how has the file been modified?       date of change: when was the change made?       changes made: what changes were made?       changes made by: who made the change?       additional versioning information: why was the change made? are there additional aspects of the change needing explanation?   authorization dimension copyright holder: who holds intellectual and legal copyright over the item? copyright holder's contact information: what is the copyright holder's contact information? rights management: is there a written copyright agreement between the copyright holder and the institution holding the digital file? can the item digitized according to copyright law and, or any agreement(s) made with the copyright holder? how can the digital item be accessed, used and exchanged according to copyright law? copyright expiration: when does the current copyright expire?   intellectual dimension meaning: what was the artist expressing, or alternatively, what has the item come to signify? this is probably the closest contextual facet to traditional cataloging for cultural materials since it is a parallel idea to subject indexing. subject terms applied to cultural objects tend to be what the item is, or what it is a representation of, rather than what it is about, however. function: what was the object's intended purpose? technology: what technologies were used in the creation of the object? how were the materials manipulated to create the object? historical or cultural import: why is the original object important? cultural narratives: what is the story behind the object?   application of the dimensions even across non-textual cultural objects, practical application of the framework is likely to vary according to the needs and availability of institutional resources, as well as the specific items being preserved. for example, contextual information recorded in the framework for a scholarly research collection of items concerning pre-columbian pottery is likely to take a different form than that recorded for a limited number of items on the same topic held in a public library. the following examples, images of cultural objects chosen at random, show a practical application of the framework. note that while the data recorded for many of the questions is factual, some of the entries were simulated to protect the privacy of individuals and institutions.   example 1: tomb of the leopards. tarquinia, italy. c. 480 b.c. (image credit: al mare)     technical dimension format of original: analog digitization date: digital photograph, june 16, 2001 digitized by: al mare digitization format (image, audio, video, animation file type): image digital file format: jpg digital dimensions:       resolution: 300 dpi       bit depth: 24       file size: 91.5 mb       file dimensions as displayed/ played: 1544 x 1131 digitization equipment (scanner, digital recorder, digital camera): digital camera canon digital ixus 300 post-digitization processes (cropping, corrections, data cleaning): auto-contrast, image rotation, auto-color balance carried out on derivative files       software used: photoshop 7.0       date completed: june 17, 2001, color balance adjusted on derivative files jan. 12, 2012       completed by: diana romano archival file named: mare_2001_00206.tif archival file located: drive: 2001_2 folder: mare derivative file(s) named: mare_2001_00206l, mare_2001_00206m, mare_2001_00206s derivative file(s) located: drive: 2001_2 folder: mare_derivatives equipment needed for playback: standard computer and monitor software needed for playback: image viewer capable of displaying tif and jpg files other technical concerns: color adjusted on all derivative files on jan. 12, 2012 by diana romano. utilization dimension audience needs of original item (educational, leisure, legal, medical, youth, etc.): cultural audience needs of digital item (educational, leisure, legal, medical, youth, etc.): educational diversity of original audience (homogenous or heterogeneous): homogeneous familial diversity of digital audience (homogenous or heterogeneous): homogeneous academic social, political, organizational setting for original item's use: funerary social, political, organizational setting for digital item's use: educational audience task support, analyses and interactions for original item: viewing paintings in situ, lighting is required as this is an interior space audience task support, analyses and interactions for digital item: knowledge construction, visual analyses, visual comparisons to other items, ability to zoom and pan image would be useful accessibility of original item (non-technical restrictions): the original painting is located in a tomb in the necropolis of monterozzi, also known as the etruscan necropolis at tarquinia, italy and as of the date of this record this tomb can be visited. accessibility of digital item (non-technical restrictions): no accessibility restrictions, beyond those associated with copyright of digital image by photographer persistence of original item (importance of future access): as this painting is an important example of etruscan funerary art, access to the item and its surrogates should be maintained. the painting has historical, technological and cultural significance. persistence of digital item (importance of future access): the digital image should be maintained in perpetuity. physical dimension format of original: wall painting scale of original: wall height at highest point, approximately 2.5 meters material(s) of original: lime plaster and pigments technique(s) used in creation of original: fresco secco surface/textural/auditory/olfactory characteristics: wall surfaces have applied lime plaster. painted decoration was created through the use of natural pigments. behavior or functionality of original: stationary, painted representation used for funerary purposes relationship(s) and, or arrangement of parts or items: the walls and ceiling of the tomb have been painted. the painting represents a single scene of feasting within a tent in an outdoor setting, as can be seen by the plant life illustrated behind and beneath the couches and figures. a patchwork painted ceiling is found above the scene. two confronting leopards with plants are painted in the pediment area above the main scene opposite the entrance. creator's intention: although the exact intent behind the creation of this painting is unknown, the painting in this tomb represents etruscans feasting within a tent. images of feasting found in funerary contexts are thought to be a reflection of the feast which took place at the time of interment of the deceased. intangible dimension object boundaries: the digital image records the wall of the tomb opposite the entrance. the left corner of the tomb has been cropped slightly and is outside of the frame. the right corner of the tomb is just visible on the right. the top boundary is cropped within the second ring of the third central circle from the back wall. the bottom boundary is unclear and within the poorly preserved section of the lower wall. its boundary lies above where the wall meets the floor. relationships and links: an image taken at the same time shows a detail of the couple in the left corner of the wall opposite the entrance mare_2001_00207.tif. curatorial dimension preservation requestor and background: dr. katz, an archaeologist associated with the university, is currently examining the etruscan tomb paintings at tarquinia for a scholarly article. preservation reason for digital item: this image should be preserved since the wall paintings' condition is in danger of deteriorating and this will likely increase with the further passage of time. decisions of note: dr. katz noted that the color of the digital image is slightly green when compared to the original painting. although the derivative files have been color corrected, future images derived from the archival image should have their color adjusted when created. special storage needs: none special handling needs: none multiple surrogates: yes       list multiple surrogates, different view: detail of wall opposite entrance: mare_2001_00207.tif, details of musicians and dancers on the right wall when entering tomb: mumford_2007_00034, mumford_2007_00035, mumford_2007_00036, details of servants on the left wall when entering the tomb: mumford_2007_00037, mumford_2007_00038.       multiple surrogates of same view: no       list multiple surrogates, same view: not applicable authentication dimension provenance: the digital file was uploaded to the hard drive of al mare's personal computer on june 16, 2001. it was uploaded to the wikimedia site on december 12, 2011 by the photographer. diana romano copied the digital image received from the photographer on an external drive onto the networked drive named 2001_2 on june 17, 2001. the file can be found on the 2001_2 drive in the folder titled mare. the 2001_2 drive is accessible and modifiable by staff in the digital media department only. the drive and its contents are maintained by the information technology department and they oversee the security and perform routine back-ups and data analysis checks of the data stored on the drive. digital file change tracking: the derivative image files have been color balanced.       date of change: jan. 12, 2012       changes made: color balance adjusted       changes made by: diana romano       additional versioning information: dr. katz noted the image had a greenish cast that was not found in the original painting. the archival file has not been changed from its original state. authorization dimension copyright holder: al mare copyright holder's contact information: [email protected] rights management: the copyright holder of this image has released it into the public domain. anyone has the right to use the image for any purpose with no restrictions, regardless of country. copyright expiration: not applicable, as the copyright holder released image into the public domain. intellectual dimension meaning: this scene is interpreted as a composite of actual events and underworld activities. the painting represents a funeral complete with banquet, music and games. we know from other etruscan tombs that hades and persephone, both underworld goddesses, partake in similar activities. one of the male figures shown in the image holds an egg, a common symbol in mediterranean art which may have connections to the theme of renewal. the colorful ceiling of the tomb, with its checkerboard and concentric circle design, indicates that the event takes place under a tent. that the scene is set outdoors can be noted by the shrubs represented beneath and behind the couches. this event is believed to represent a funerary banquet which would have taken place just over a tomb's entrance at the time of the deceased person's interment. function: comforting the tomb's patron and the family and friends of the deceased. technology: carving of the tufa to form the tomb, choice of and processing of pigments, fresco painting technique. historical or cultural import: this is fairly complex work for historical and cultural reasons. the lively figures represented are uniquely etruscan, yet the artist of the central banquet scene also shows the influence of greek art. the women's clothing is similar to those worn by their greek counterparts, although there are distinctly etruscan aspects to them. the men's mantles are purely etruscan, however. an additional etruscan aspect to the painting concerns the interaction taking place between the men and women. here women and men are shown banqueting together, and this is in contrast to greek scenes where women were excluded — unless they were hetairai (courtesans). this may suggest that the role of women in etruscan society was different than what was found among the greeks at that same time. the rendering of women as light-skinned, while men are showed as dark-skinned is supported by a long artistic tradition. it speaks of women who do not spend time outside and all the cultural implications that go along with sheltering, protecting, and, or hiding women. musical instruments represented are also noteworthy and they too have been borrowed from their greek neighbors. cultural narratives: a family gathering with eating, drinking, and music with a funerary purpose. this scene does not contain explicit imagery associated to the underworld. the scene provides us with a sense that the etruscans had a refined and pleasant lifestyle. interestingly, this joie de vivre is not present in all etruscan funerary art. other etruscan tombs depict themes that are decidedly more violent and morbid. although the variation in themes found in etruscan tomb paintings is not well understood, it is believed that the themes may indicate political and familial dealings.   example 2: roebling and roebling. brooklyn bridge. (image credit: steve w. lee © 2005)     technical dimension format of original: analog digitization date: january 1, 2005 digitized by: steve w. lee digitization format (image, audio, video, animation file type): image digital file format: jpeg digital dimensions:       resolution: 300 dpi       bit depth: 24       file size: 200.7 mb       file dimensions as displayed/ played: 2816 x 2112 digitization equipment (scanner, digital recorder, digital camera): digital camera canon powershot sd600 post-digitization processes (cropping, corrections, data cleaning): no post-digitization processes       software used: not applicable       date completed: not applicable       completed by: not applicable archival file named: lee_2005_02317.tif archival file located: drive: 2005_5 folder: lee derivative file(s) named: lee_2005_02317l, lee_2005_02317m, lee_2005_02317s derivative file(s) located: drive: 2005_5 folder: lee_derivatives equipment needed for playback: standard computer and monitor software needed for playback: image viewer capable of displaying tif and jpg files other technical concerns: none utilization dimension audience needs of original item (educational, leisure, legal, medical, youth, etc.): functional structure used for travel, transportation and commerce audience needs of digital item (educational, leisure, legal, medical, youth, etc.): educational, research diversity of original audience (homogenous or heterogeneous): heterogeneous diversity of digital audience (homogenous or heterogeneous): homogeneous academic social, political, organizational setting for original item's use: civic structure for travel, transportation and commerce social, political, organizational setting for digital item's use: educational audience task support, analyses and interactions for original item: allows travel over east river between the new york city boroughs of manhattan and brooklyn, supports movement of goods and people audience task support, analyses and interactions for digital item: knowledge construction, visual analyses, ability to zoom and pan image needed for close examination of architectural systems accessibility of original item (non-technical restrictions): fully accessible structure, which currently carries non-commercial motor vehicles, pedestrians and bicycles. in its earlier history it was also used by horse-drawn vehicles, streetcars and elevated trains accessibility of digital item (non-technical restrictions): fully accessible image, with no copyright or technical restrictions for usage. persistence of original item (importance of future access): the bridge plays an important role in allowing people to travel over the east river between the two boroughs of manhattan and brooklyn. there are several other bridges that span the east river currently. however, the bridge is a culturally significant landmark for both new york city boroughs. persistence of digital item (importance of future access): this image offers a general view of the bridge from manhattan. this item should be maintained indefinitely, with its destruction allowable only after the accessioning of another image with similar levels of quality and accessibility. physical dimension format of original: architecture scale of original: length of span 1595.5 feet (486.3 m); width: 85 feet (26 m); height above mean high water: 135 feet (41.2 m) material(s) of original: limestone, granite, cement, cast iron, steel technique(s) used in creation of original: masonry caissons and towers, suspended steel deck with steel cable stays surface/textural/auditory/olfactory characteristics: crossing the brooklyn bridge is often a communal experience, as many tourists make the passage from manhattan to brooklyn for the famous pizza served up by grimaldi's and the ice cream by the brooklyn ice cream factory. the crossing is often an exciting experience thanks to the cars passing by beneath the upper deck, the ships slowly plying the east river, passing cyclists and pedestrians, and the fluttering of flags atop the towers. behavior or functionality of original: functional structure providing passage over a body of water. relationship(s) and, or arrangement of parts or items: the bridge has a separate deck for pedestrians and bicyclists that is located above the deck for motorized vehicular traffic. the pedestrian/cyclist deck runs along the center of the bridge and allows for close inspection of the manhattan and brooklyn towers. creator's intention: john a. roebling, who designed this, the longest spanning bridge in the world at the time of its opening in may of 1883, created a lasting monument that attests to the ingenuity of humankind. intangible dimension object boundaries: view of bridge from east river bikeway southeast towards brooklyn. both towers are visible in this image. complete view of brooklyn side of bridge. manhattan end of bridge truncated. northwest end of bridge and manhattan outside of the frame to left of image, as viewed. relationships and links: view of bridge taken on the same day, by the same photographer, using the same equipment. view of bridge from brooklyn northwest toward manhattan lee_2005_02318.tif; detail of northwest (manhattan) tower lee_2005_02319.tif; close-up of deck and cables at northwest (manhattan) tower lee_2005_02320.tif; view of traffic crossing bridge as seen from southeast (brooklyn) tower towards manhattan lee_2005_02321.tif curatorial dimension preservation requestor and background: stephanie lapeer, historian who is working on a book which examines how the brooklyn bridge has been illustrated across time. preservation reason for digital item: this high quality image provides a clear view of the bridge in 2005. decisions of note: none special storage needs: the archived file is large and so accommodations may need to be made for its copying and storage for preservation purposes. special handling needs: none multiple surrogates: yes       list multiple surrogates, different view: grossman_2006_00010.tif (aerial view); grossman_2006-000011.tif (entire from the northwest); grossman_2006-000012.tif (entire from the southeast); richards_2003-44567.tif (detail of southeast (brooklyn) tower); lee_2005_02318.tif (view of bridge from brooklyn northwest toward manhattan); lee_2005_02319.tif (detail of northwest (manhattan) tower); lee_2005_02320.tif (close-up of deck and cables at northwest (manhattan) tower); lee_2005_02321.tif (view of traffic crossing bridge as seen from southeast (brooklyn) tower towards manhattan).       multiple surrogates of same view: none       list multiple surrogates, same view: not applicable authentication dimension provenance: the image was stored on steve w. lee's (photographer) password protected personal computer prior to being copied to drive: 2005_5 folder: lee by diana romano on march 2, 2005. the 2005_5 drive is accessible and modifiable by staff in the digital media department only. the drive and its contents are maintained by the information technology department and they oversee the security and perform routine back-ups of the drive. digital file change tracking: no modifications made to the original file       date of change: not applicable       changes made: not applicable       changes made by: not applicable       additional versioning information: not applicable authorization dimension copyright holder: steve w. lee copyright holder's contact information: [email protected] rights management: copyright agreement with photographer allows the free use and exchange of the digital image. no restrictions. copyright expiration: life of the photographer plus 70 years. intellectual dimension meaning: although not overt, the work expresses the power and ingenuity of humankind. it is a powerful symbol of human accomplishment as the majority of the work was completed with basic hand tools and devices. function: bridge allows people and goods to pass over the east river between brooklyn, new york and manhattan, new york. technology: considered an important feat of engineering. john a. roebling is credited with developing a massive twisted steel cable consisting of multiple strands of wire wrapped within an outer layer of wire. this cable became a standard material used in suspension bridges. the design of this bridge created the longest suspension bridge in the world at the time of its construction. its colossal masonry towers and high, arching suspended deck, rising to 135 feet above the high mean water mark of the east river at its midpoint, were designed so that the many ships navigating the busy salt waters of the tidal straight would not be impeded. the 135 foot clearance of this bridge became the standard for future bridge construction. roebling designed the bridge to withstand structural loads 5 times what it would be required to carry. these careful calculations have meant that the bridge has been able to withstand the heavier loads of modern vehicles and traffic and any potential weaknesses of materials. the bridge was widened from 4 to 6 lanes of traffic, with its deck trusses and approaches redesigned by david steinman, between 1944 and 1954 after trolley cars no longer used the bridge to cross the east river. historical or cultural import: beyond its technological importance to the history of bridge building and engineering in general, is the fact that the bridge allows easy access between brooklyn and manhattan. as brooklyn's population was roughly 400,000 at the time the bridge was completed and the borough contained many rural areas, its construction helped ease the overcrowding in manhattan and ushered in the development of brooklyn. cultural narratives: while the bridge has become an iconic symbol for the new york metropolitan area, it has had its share of misfortunes associated to it. its architect, john a. roebling, died as a result of an accident which occurred while he was examining a location for the brooklyn tower of the bridge. washington roebling, who took over the position of chief engineer for the bridge after his father's demise, was paralyzed as a result of working in the caissons of the bridge. emily (warren) roebling, washington roebling's wife, oversaw the completion of the bridge after his paralysis. interestingly, she is not frequently acknowledged in the construction record for the bridge. in total, twenty individuals died as a result of the bridge's construction. another twelve were killed within weeks of its opening in 1883 after a woman's scream panicked pedestrians who believed the bridge was going to collapse.   future work testing of the framework in a practical setting with collections of digital cultural objects is needed in order to judge its usefulness to the preservation process. testing the dimensions with various collections and in different settings would help determine their clarity and practicality in practice and so a prototype will be developed and tested by individuals in the cultural heritage sector. an additional analysis will be undertaken of the metadata associated with cultural objects in current practice and how what is found maps to the dimensions presented here.   conclusion the work presented here sought to discover a means of alleviating the contextual gap which exists in the metadata recorded for cultural objects digitized for preservation purposes. contextual information is fundamental to understanding many aspects about digital content. the framework presented here was developed to accommodate recording this critical information. through the development of eight dimensions of context a fuller record of digital content is posited. as keepers of our cultural heritage, it is our responsibility to ensure that our digital preservation efforts include recording contextual information about each object to enable future retrieval, assessment, management, access, and use.   notes 1 for the review of the literature undertaken to reveal the various kinds of context described in the digital preservation literature, see joan e. beaudoin. (2012). context and its role in the digital preservation of cultural objects d-lib magazine, november 2012, 18(11/12). http://dx.doi.org/10.1045/november2012-beaudoin1   about the author joan beaudoin is an assistant professor in the school of library and information science at wayne state university where she teaches and performs research on metadata, information organization, digital libraries, digital preservation and visual information. prior to her position at wayne state university she was a laura bush 21st century librarian fellow at the school of information science and technology at drexel university. in addition to a doctor of philosophy in information studies at drexel university, she holds a master of science in library and information science degree in the management of digital information from drexel university, a master of arts in art history from temple university, and a bachelor of fine arts in art history from massachusetts college of art.   copyright © 2012 joan e. beaudoin download statistics what do they tell us? the example of research online, the open access institutional repository at the university of wollongong, australia search   |   back issues   |   author index   |   title index   |   contents d-lib magazine november 2006 volume 12 number 11 issn 1082-9873 download statistics what do they tell us? the example of research online, the open access institutional repository at the university of wollongong, australia   michael organ project manager, research online university of wollongong abstract a study was undertaken of download and usage statistics for the institutional repository at the university of wollongong, australia, over the six-month period january-june 2006. the degree to which research output was made available, via open access, on internet search engines was quantified. google was identified as the primary access and referral point, generating 95.8% of the measurable full text downloads of repository content. further long-term studies need to be carried out to more precisely identify factors affecting download rates of repository content. this data will assist institutions and faculty in measuring research impact and performance, as an adjunct to traditional bibliometric tools such as citation indexes. introduction consideration of new performance measures as universities and funding bodies worldwide seek to quantify return-on-investment for research output, and more closely monitor individual academic and institutional performance, new forms of bibliometrics are being sought that go beyond journal quality assessment and into the area of research quality and impact [1]. the measurement of research impact is an area with which australian universities are currently grappling, due to the proposed introduction of a research quality framework (rqf) by the federal government in 2008 [2]. though based on similar assessment processes in the united kingdom and new zealand, the precise details of the rqf are not known, though it is clear a variety of new performance measures are being considered to assess research impact [3]. the onset of the internet and an ever-changing information technology landscape have provided new statistical sources to supplement the data available from traditional tools such as citation indexes and journal circulation figures. for example, thomson's web citation index and google scholar are both developing statistical packages to monitor internet usage of research outputs. a new suite of bibliometric data available to research organisations and funding bodies is the download statistics generated by institutional repositories. as research outputs are increasingly exposed to the web via search engines such as google, organisations will be looking at, and making use of, institutional repository statistical packages. studies have already shown that the placement of research papers in open access repositories can increase citation rates by anywhere from 50 to 500% [4]. this is driving vice-chancellors and ceos to sign-on to these new pieces of research infrastructure, with the promise of improving and enhancing the reputation of their institution and research staff. it is still early days in regards to the institutional repository movement. software remains under-developed and sustainable economic models are in their infancy. whilst return on investment may be in the order of 30:1, the message is yet to be disseminated amongst senior executive and funding bodies as to the real worth of an institutional repository [5]. in australia, for example, only eighteen of the 38 higher education institutions have an open access repository, and only one of those (queensland university of technology) has mandated deposit of research material [6]. as with any new technology it is not clear precisely what effect these repositories will have on the research process, or what conclusions can be drawn from the statistics currently being generated, apart from the broad acceptance of the role they play in improving citation rates, as noted above. in the current environment of rapid technological change and tightening budgets, research online (http://ro.uow.edu.au) at the university of wollongong provides a working example of a research-focused institutional repository. the download statistics generated by research online over the six-month period january june 2006 may reflect general trends and providing pointers for future directions in regards to the measurement of research impact and performance. few such analyses have been published to date. though there are hundreds of such repositories worldwide, the majority have been in operation for a relatively short period of time (less than two years), and both the software and method of operation are still evolving. perhaps repository managers have been too busy dealing with software development issues, securing funding and staff, sourcing material, obtaining copyright clearances and implementing an often complicated upload process to engage in open debates on matters of statistics, though the topic is obviously an important one to them. in the case of the university of wollongong, ongoing statistical analysis of the repository will be vital in assisting with forward planning, especially for the university library as primary manager of the project. research online digital commons at the university of wollongong the university of wollongong was the first australian higher education institution to install proquest's digital commons institutional repository software. as a proprietary solution, housed off-site, it differs from the other leader in the field, the open source dspace, developed by mit and hewlett packard. corporate it support from proquest, a relatively simple upload interface, numerous operating instances in the united states and united kingdom, a relatively quick installation and implementation process (one month), low cost, and a built-in statistics generation package resulted in the october 2005 decision by senior university executives to support a two-year implementation project, commencing in 2006. other packages such as dspace, eprints and harvest road hive were investigated by a specially formed committee during 2005, however digital commons was considered by the university library and the research and innovation division to offer the best solution, at that time, for wollongong [7]. digital commons was installed locally on 12 december 2005 and the first papers were uploaded on 18 january 2006. the project had a clear goal from day one: make available, via open access, university of wollongong research output from the period 2000-2005, with the aim of improving citation rates and enhancing the reputation of the institution and its staff. the implementation team was also asked to monitor the performance of digital commons over the two-year project period. the built-in statistics package would provide timely data on the number of site hits and full text downloads being generated across the site. it was recognised that direct correlation between download statistics and corresponding improved citation rates, as revealed through the aforementioned citation indexes, would be a relatively slow process and subject to a oneto two-year delay, from point of upload to appearance in a relevant index. precise methods of measuring the impact of research online were yet to be developed, and its success, or otherwise, would not simply rest with what the download statistics told us. other measures would include acceptance by faculty and senior executive, the number of items uploaded, funding success (i.e., in regards to sustainable funding for the institutional repository service), ease of implementation of the package and goodwill generated on campus between the various parties involved, be they librarians, academics or administrative staff [8]. the impact of the repository would feed into the rqf assessment process and also the metrics used in the compilation of higher educational institution ratings tables, such as the shanghai 100 [9]. the proposed rqf was a significant driver in the university of wollongong's adoption of an institutional repository, though it was acknowledged there would be positive spin offs for the institution even if the rqf did not eventuate, with some commentators suggesting that, in the long term, associated bibliometric data could be integrated into faculty workflow and individual performance assessment programs. a local understanding of the statistics generated by research online therefore needs to be developed. download statistics digital commons provides a relatively simple statistics package, with output presented in excel files. at the repository level local administrators generate statistics on the number of full text downloads of individual documents and the number of hits on the cover page (i.e., item description or abstract page) that links to the document. these statistics can be further broken down by day, month or year. reports can also be generated for individual collections or series, and academics can monitor hit and download rates for their own papers. in regards to where those searches and downloads are coming from, digital commons also provides the facility to identify external referrals down to the domain level. internal, or local, referrals and downloads are not measured. referral data is only available at the site level, and not at the individual author level. during the study period we were not able to correlate hit and download statistics for individual papers with the location of the referral, though other packages such as eprints have this feature. the various statistics capabilities of digital commons can be highlighted by looking at the specific case of research online. during the six-month period january june 2006 the number of papers uploaded was 561, comprising predominantly refereed and non-refereed journal articles and conference papers from the disciplines of informatics, engineering and commerce (86%). following upload, the items were usually discoverable by google within 24 48 hours. as of 30 june 2006 research online was also harvested by roar, doar, arrow, oaister, google scholar, yahoo! and scirus. arising from this high internet visibility, over the length of the study period there were 19,447 hits to the site. these comprised 10,661 full text downloads of documents in pdf form, and 8,740 cover page downloads. the remaining hits (46) were related to searches within the site. during the study period 6.2% of the uploaded documents received greater than 50 full text downloads (ftds), with the vast majority (79.5%) within the 1 50 ftd range (table 1). table 1: spread of downloads (%), january june 2006, research online no. of downloads 0 1 10 11 50 51 100 100+ full text downloads 14.3% 41.2% 38.3% 3.6% 2.6% cover page downloads 3.8% 45.5% 47.3% 15% 2% the difference between the spread of full text downloads and cover page downloads is of note, specifically the fact that during the study period only 3.8% of cover pages had no hits, compared with 14.3% of the equivalent pdf documents. whilst length of time on the system is obviously a factor in the number of downloads generated, these figures point to the obvious fact that not all repository content will be subject to the same rate of usage. in regards to research online discovery, of the 8,740 cover page downloads, 2,134 (24.5%) were referrals from domains with two letter top-level domains. for example, there were 542 referrals from the 'au' (australia) domain, 191 from 'uk' (united kingdom) and 179 from 'in' (india), with a total of 77 countries identified in the statistics provided. the majority of referrals to the cover page 6,606 or 75.5% were from sites without two letter top-level domains, such as those coming from the united states. in regards to full text downloads of repository content, 3,308 (31%) were referrals from sites with two letter top-level domains. of these, 1,075 referrals from the 'au' (australia) domain, 245 from 'uk' (united kingdom) and 190 from 'in' (india), with a total of 79 countries identified. once again, the majority of referrals to the pdf document 7,353 or 69% were from sites without two letter top-level domains. in regards to country of origin, or referral information, webmasters have long been able to extract such data for individual web pages and sites. however the institutional repository brings this facility into the hands of librarians, repository managers and individual academics, in a relatively simple form. for example, digital commons provided statistics relating to specific domain and url. over the six-month period january to june 2006, the precise url for 5,449 (51.1%) of the full text downloads was known. of these, 95.8% were from google and its various domains around the world (e.g., www.google.com 1770; scholar.google.com 173; www.google.fr 139; www.google.ru 22; www.google.pk 17). the remaining 4.2% were from sites such as www.scirus.com and www.answers.com. in regards to discovery of cover pages, or abstracts, the figures vary slightly. according to the available data, 80.9% of cover page downloads were referred from google domains, and the remaining 19.1% from a variety of sources including yahoo! (12.1%). these two sets of figures suggested that users accessing research online from google are in the majority of cases going straight to the document pdf, rather then to the cover page. this is perhaps influenced by google's ranking of the pdf higher than the metadata page [10]. in looking at the monthly download statistics over the full study period, there was a steady increase between january may, correlating with the continued increase in upload of content and wider diversity of internet access points coming online. during may there were 3,433 full text downloads from the 404 items then on research online. however in june there were only 2,684 full text downloads from 561 items. this discrepancy may be seasonal, and longer-term trend data will assist in clarifying this. one suggestion was that such fluctuations might be connected with the academic year. for example, the northern hemisphere summer holiday break and corresponding mid-year break in the southern hemisphere may impact upon download rates. in regards to the articles with the most number of downloads commonly referred to as the most 'popular' items research online clearly identifies these. the top ten downloaded items for the period january june 2006 are listed in table 2, both in regards to the number of full text downloads of the pdf and also cover page downloads, with their ranking indicated in brackets. table 2: research online top 10 full text downloads & cover page downloads, january june 2006 (nb: individual ranking given in brackets) title faculty full text downloads cover page downloads clothing the soviet mechanical-flâneuse creative arts 395 (1) 47 (22) modelling the draganflyer four-rotor helicopter informatics 366 (2) 197 (2) the pros and cons of rfid in supply chain management informatics 341 (3) 242 (1) strike 1912 – looking for australia's earliest workers' film history 237 (4) 58 (12) introducing location-based services into information technology curriculum: reflections on practice informatics 179 (5) 48 (19) a century of the phillipine labour movement history 179 (6) 72 (10) simmel, ninotchka and the revolving door creative arts 152 (7) 29 (45) a disgrace to our australian civilisation: mothers, miners and the commemoration of mortality in new south wales history 149 (8) 25 (89) a century of the labour movement in australia history 146 (9) 46 (25) the shooting of william (billy) mclean history 130 (10) 30 (64) conducting polymer-carbon nanotubes composites science 48 (41) 80 (6) personal firewall for pocket pc 2003: design and implementation informatics 38 (69) 76 (7) what are the benefits in crm technology investment? informatics 103 (15) 76 (8) image analysis using line segments extraction by chain code differentiation informatics 87 (17) 75 (9) visual perceptual process model and object segmentation informatics 29 (109) 89 (4) a fast neural-based eye detection system informatics 105 (14) 83 (5) japanese technology for aged care arts 29 (107) 95 (3) the figures reveal that there is no definitive correlation between the two download types. for example, the article 'clothing the soviet mechanical flâneuse' ranks number 1 in regards to full text downloads (395), but only number 22 in regards to cover page hits (47). this indicates that primary access to the document, once it is discovered, is via the pdf. against this, 'modelling the draganflyer four-rotor helicopter' ranks number 2 in both fields, with 366 full text downloads and 197 cover page downloads. in this latter case, we know from anecdotal evidence that this paper was set as a class reading by an academic in holland during april 2006, thereby accounting in part for the relatively large number of hits. however such results could also be related to how the item appears on the google search screen, and whether the cover page or pdf link are therein most prominent. in any instance, it is difficult to be precise due to the number of variables in the search and discovery process. in addition, these statistics do not by themselves offer an explanation as to why particular articles rank so highly, raising the question: what is the cause of high (or low) download rates? is it the inherent quality of the article, or perhaps the standing of the author? is it related to the length of time in which the item is available on open access, or has a lecturer or professor set the item as a required course reading and directed students to download it, as was the case with the draganflyer article? does the paper possess a generic title, which is easily picked up by search engines and gives rise to a higher ranking? in such a case the high hit rate may have nothing to do with quality or impact. similar questions were asked at the university of otago, new zealand, when between november 2005 and march 2006 a suite of 220 school of business papers generated 18,744 full text downloads from 80 distinct countries via the eprints repository software [11]. a detailed explanation for this high download rate was not forthcoming at the time and points to the need for further studies in this area. the aforementioned questions highlight the pitfalls in drawing conclusions from a limited data set. nevertheless, the research online download statistics are of use. what do the download statistics tell us? the need to compare and analyse statistical data across institutional repositories has been recognised, as has the desirability of collating download statistics for individual articles from institutional and discipline-based repositories and publisher databases in order to present a true picture of their popularity or otherwise as determined by hits [12]. publisher concern over the negative impact on their hit rates by the existence of archive copies is a hindrance. with google a significant research tool for students and an increasing number of academics, the ability to have one's output discovered by it in a quick and efficient manner is of primary importance. in the case of research online, the download statistics primarily indicate that institutional repository software packages such as digital commons achieve their goal of exposing research output to internet search engines such as google and yahoo! the important role google plays in the research and discovery process has become apparent. whether the university of wollongong figures are similarly reflected in the download statistics of other institutional repository packages such as eprints and dspace is not known to the author, though the dominance of google is most likely universal. another obvious finding from this limited study is the fact that, due to research online, full text downloads of research output occurred that would not otherwise take place. they also occurred in addition to that which is being generated by publisher online databases, personal web sites and discipline-based repositories. beyond this, the data provided reveals a number of developing user behaviours. it is clear that researchers are accessing research online in a variety of ways. primarily they are coming via google. in the majority of instances they go from google to the document pdf, rather then to the abstract or cover page. this is reflected in the number of full text downloads as compared to the number of cover page downloads (10,661 : 8,740). at present the number and range of material on research online is limited, and more detailed assessments of the download statistics need to be made covering a wider variety of material and over a longer time period. for the present, the numbers available tell an interesting story. they point to the success of the repository software and open access protocols in making research output available on the internet. beyond this, the repositories themselves may have impacts that were never foreseen. for example, it has been observed that academics and researchers may be influenced by their download statistics to alter the direction of their research [13]. on the basis of relatively low download statistics for a particular strand of research, they could decide to pursue a more popular strand, as identified to them by higher download rates. this would be a reflection of the increasing trend for research initiatives to be driven by business imperative and government policy, rather than as an expression of pure research. the usefulness of institutional repositories in this regard may assist with ensuring their long-term financial sustainability. yet the statistics available on research online also reveal that some of the best performing (i.e., most popular) items are from areas that do not figure in the traditional citation indexes, such as the creative arts and history. download statistics provide a powerful tool for repository managers and librarians to sell the importance of this innovative technology to faculty and funding authorities. this will perhaps be their most important use. acknowledgements in the compilation of this article i would like to thank my colleagues helen mandl, natalie keene and lucia tome for their advice and commenting on an early draft of this article. mention must also be made of arthur sale, steven harnard and susan gibbons for their inspiration. references 1. l. tome and s. lipu. indicators of journal quality, research & development discussion paper no.6, university of wollongong library, 2004, 14p. 2. research quality framework, department of education, science and training (dest), australian government, canberra. [web site, accessed 6 november 2006]. url: . 3. hon. j. bishop, knowledge transfer and engagement forum keynote address by the minister for education, science and training, sydney, 16 june 2006. [web site, accessed 6 november 2006]. url: . 4. c. hajjem, s. harnard and y. gingras, 'ten-year cross-disciplinary comparison of the growth of open access and how it increases research citation impact', ieee data engineering bulletin, 25(4), 2005, 39-46; 'the effect of open access and downloads ('hits') on citation impact: a bibliography of studies', the open citation project, 6 june 2006. [web site, accessed 6 november 2006]. url: . 5. j. houghton, 'scholarly communication costs and benefits: the role of repositories'. presentation to the successful repository conference, brisbane, 29 june 2006. [powerpoint presentation, available on web site, accessed 6 november 2006]. url: . 6. t. cochrane, 'how can a repository contribute to university success?' presentation to the successful repository conference, brisbane, 29 june 2006. [powerpoint presentation, available on web site, accessed 6 november 2006]. url: . 7. content management system evaluation team, university of wollongong, 2005. [web site, accessed 6 november 2006]. url: . 8. b. weaver, 'success is in the eye of the beholder'. presentation to the successful repository conference, brisbane, 29 june 2006. [powerpoint presentation, available on web site, accessed 6 november 2006]. url: . 9. academic ranking of world universities 2005, institute of higher education, shanghai jiao tong university, china. [web site, accessed 6 november 2006]. url: . 10. l. carr and a. sale, irs: interoperable repository statistics, a proposal to activity area (iv) pilot services of the call for projects in the jisc digital repositories programme, march 2005, 12p. submission by the university of southhampton, key perspectives ltd., the university of tasmania, long island university and the counter project. [web page, accessed 6 november 2006]. url: . 11. n. stanger and g. mcgregor, 'hitting the ground running: building new zealand's first publically available institutional repository', the information science discussion paper series, number 2006/07, march 2006, 10p. 12. s. harnard, 're: self-archiving, journal usage and cancellations', american scientist open access forum, 8 october 2005. [list posting, accessed 6 november 2006]. url: . 13. s. gibbons, 'making a repository a success with your academic staff'. presentation to the successful repository conference, brisbane, 29 june 2006. [powerpoint presentation, available on web site, accessed 6 november 2006]. url: . copyright © 2006 michael organ top | contents search | author index | title index | back issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/november2006-organ   levels of service for digital repositories search  |    back issues  |    author index  |    title index  |    contents d-lib magazine may 2002 volume 8 number 5 issn 1082-9873 levels of service for digital repositories   william g. lefurgy1 u.s. national archives and records administration <[email protected]> abstract digital preservation research has made important gains in recent years, and the capability for libraries and archives to manage digital collections continues to grow. this is obviously good news in that an expanded body of digital records, publications, and other objects will be preserved and made available. what is less obvious is that there is no magic bullet in the offing for dealing with all permanent digital materials: only a fraction will meet necessary conditions for optimal preservation and use. materials that deviate from these conditions can still be included in digital collections, but finding and using them will be more difficult, perhaps substantially so. this article outlines conditions that govern the persistence of digital materials and suggests a model for future levels of service for digital repositories. background over the course of a 25-year career working with historical documents and publications, i have spent my share of time worrying about the problems posed by materials in digital form. from my first job as a manuscripts curator at the maryland historical society, it was clear that digital materials would eventually have to be included in the holdings of many repositories, both because the technology held great promise for facilitating research and because certain recorded information would exist in no other form. this sense of inevitability only grew as i moved to the baltimore city archives and then to the u.s. national archives and records administration. the worry came from wondering how to preserve digital materials over the course of technological change and to keep them easily available to researchers. there seemed to be no easy answer, and my personal experience (such as getting "404—page not found" responses when clicking some hyperlinked footnotes) did not instill much confidence that a solution was imminent. i could only assume that technology itself would eventually provide the means for repositories to overcome all the problems associated with digital materials. my work over the last several years in helping federal agencies manage a diverse and ever-growing array of digital materials has led to a rethinking of this assumption. it became obvious that some digital materials presented more problems than did others in terms of keeping them available for the long term. results from digital preservation research also began to indicate that the future held an uneven promise for various kinds of materials. what this all means is that digital holdings for many repositories will not be equal in terms of how effectively they can be preserved and made available. the easiest way for me to think about it is as differences in levels of service: repositories will be able to do more with certain kinds of digital materials and less with others. this has implications for evaluating materials for potential permanent retention as well as for conceptualizing systems that will manage and preserve such materials. the concept of levels of service reduces my worry. while it does affirm that poorly accessible materials will always be with us, it also provides a path beyond the monolithic view of digital materials where all preservation and access challenges are grouped into one insoluble problem. by teasing apart the underlying strands of the issue, it is possible to envision some practical solutions. introduction archivists, librarians, and others with an interest in preserving and making available digital information face an impending paradox. this stems from the prospect of developing solutions for long-term management of digital records, publications, and other objects. once in place, these solutions will help fill a pressing need for repositories to preserve and make available significant information in electronic form. but as digital materials are increasingly acquired it will become obvious that not all can be equally preserved and used. digital materials vary in how they are constructed, organized, and described, and these factors will play a huge role in determining preservation and access possibilities—even when advanced systems, technologies, and techniques are available to repositories. current research indicates that digital materials can be managed independent of specific technology. "persistence" is the term used to indicate the degree to which this is possible [2]. for complete persistence, materials must adhere to strict conditions regarding their construction and description. these conditions make it possible to use technology to dynamically recreate a digital object based on explicit and consistent rules defining the object's content, context, and structure. but in a world where few standards govern the technical construction of a digital item (a report can exist in any one of a dozen common file formats) and fewer still govern how an item is described (the report may or may not identify an author, date of issue, or other descriptors), it is realistic to expect that many materials will not fully meet the rigorous conditions for persistence. this likely will remain the case even when user-friendly tools are established for creation of persistent digital materials. inevitably, materials will vary in their degree of compliance with established rules. looking ahead, digital collections can be seen as falling into three levels: optimal, enhanced, and minimal. the optimal level will consist of fully persistent digital materials that can be placed in an information technology architecture that permits their maintenance in perpetuity without significant alteration of content, structure, or any other significant characteristic. such materials will also retain their original context (e.g., their relationship among themselves and with other materials), and they will remain discoverable through multiple attributes. the enhanced level will have materials that possess some persistent qualities but that lack others. perhaps the structural rules are variable or metadata are incomplete, but the materials will nevertheless permit a degree of continuing preservation and discoverability. the minimal level will be populated by digital materials that have few, if any, persistent characteristics. they might consist of loosely structured files in various native formats with minimal metadata; preserving their significant characteristics and making them discoverable will be difficult. these levels will dictate the extent to which a repository can manage collections of digital materials and make them available to users. conditions required for persistence broadly speaking, persistence requires two parts. the first is an architecture that defines the system that will acquire, manage, preserve, and access digital materials in a repository (or among repositories). the second is a specification for the materials that will go into the system. the most influential conceptual construct for both parts is the reference model for an open archival information system (oais). the oais model outlines a design where digital materials are placed into a package with three basic elements: content information (the electronic bits along with all details needed for a user to interpret the bits) preservation description information (details regarding provenance, context, reference, and fixity/authenticity); and descriptive information (metadata needed to enable identification of the package). materials are transmitted to repositories through submission information packages (sips). use of sips enables persistence: they implement decisions regarding the essential characteristics of digital materials and provide for preservation and access in a manner that is independent of specific technology [3]. the oais model is dependent upon construction of sip elements according to detailed, rigorous, and transparent rules, since this is the only way an automated system can effectively manage and manipulate digital information. for optimal performance, each individual digital object within a package must be consistently described and structured. it is possible to modify existing materials (adding metadata, converting to different formats and so forth) to build sips, but this is labor intensive and may raise questions about the authenticity of the materials. the preferred means would be to incorporate the rules and consistency needed for sips with the technology used to create digital materials. creation would have to occur under an enduring, widely accepted, and carefully controlled process—which is a radical departure from current practice. extensible markup language (xml), a universal format for structured documents and data, offers a practical demonstration for creating materials under the controlled process needed for sip elements. xml can allow for highly reliable abstraction of a digital object's significant properties, such as structure, formatting, and contextual relationships. the abstractions are provided through document type definitions (dtds) or schema, which are expressed rules governing how materials are constructed and presented. these rules "allow machines to carry out rules made by people. they provide a means for defining the structure, content and semantics" of digital materials [4]. other methods of creation apart from xml can also be used to create persistent materials. but whatever method is chosen, it must be capable of generating materials according to known rules that an information technology architecture can manage into the future. the basic ideas behind oais are represented in many of the most promising digital preservation research projects. the library of congress national digital library program, for example, relies on creating exacting electronic reproductions using standard formats and assigned metadata [5]. the australian victorian electronic records strategy uses highly controlled methods of electronic document creation and description to enable archival management [6]. the curl exemplars for digital archives (cedars) project in the united kingdom is based directly on the oais model and involves, among many other activities, developing practices for creating detailed representation information about the significant properties of digital objects [7]. cedars is also exploring emulation as a mean of digital preservation. emulation involves developing encapsulated packages that contain a specification for recreating an original computer application to view and interact with objects created by the application [8]. the specification could be an abstraction of the original software code and related documentation, or it could be based on a customized emulator built to provide access via a host platform. in either case, the specification provides the expressed rules needed for technological independence [9]. in developing its electronic records archives (era), the u.s. national archives and records administration also looks directly to the oais model. the era project is focusing on persistent object preservation, which involves managing digital objects with clearly defined structures and metadata to permit ongoing access and retrieval [10]. all of these efforts focus on work with highly persistent digital materials: that is, materials whose context, content, and structure are transparent and well-defined. the bottom line is that we are headed for a future where those digital materials that conform to exacting rules can be effectively preserved and accessed. but these materials will constitute only a small fraction of the overall universe of digital information. nearly all the digital materials now in existence and many of those yet to be created do not use clear and consistent rules (proprietary software depends on hiding many of the code-based rules used to structure and display objects, for example) and thus will not be easily managed through applications of the oais model. yet vast quantities will have value that warrants continued preservation, even if they are in a persistently non-persistent form. levels of service defined the scenario outlined above will require many digital repositories to adopt a strategy for providing different levels of service for different parts of their collections. levels of service can best be thought of as a matrix with one set of values determined by the available technology and the other set determined by the degree to which digital materials have persistent qualities. the first set depends on incremental development of new and improved tools and processes and can be seen evolving as follows: phase i: rudimentary tools and processes to preserve and access materials — roughly equivalent to what is obtainable today. phase ii: more advanced tools and processes — roughly equivalent to what will be obtainable over the next decade. phase iii: very advanced tools and processes — roughly equivalent to what will be obtainable after the next decade. the second set of values is tied to the degree to which digital materials are persistent (based on consistent and transparent rules for description and structure, standardized file formats, and so forth). in general terms, degrees of persistence can be represented by three categories: high: fully persistent materials that enable high confidence for ongoing preservation and access. medium: partially persistent materials that enable medium confidence for ongoing preservation and access. low: marginally persistent materials that enable low confidence for ongoing preservation and access. given that persistence is closely tied to the clarity and consistency of the rules used by digital materials, it follows that materials that are highly structured tend to be inherently easier to preserve and access over time. conversely, less structured materials tend to be harder to manage. another way to categorize inherent persistence is whether the materials are homogeneous (closely tied to known and consistent rules regarding structure, technical parameters, and metadata) or heterogeneous (not closely tied to known and invariable rules). for some homogeneous materials the rules are completely unambiguous, such as those used by delimited ascii (along with associated metadata) to represent a database file. because the rules are so clear, the technology and processes needed to preserve and access the file is comparatively simple and it can be kept available in perpetuity. other bodies of homogeneous materials are tied to rules that are less explicit but that are known and consistent to some minimal extent. for example, if materials are in a format that will remain accessible far into the future, and if the metadata are sufficient, they can have some degree of persistence. materials that are not connected to transparent, consistent rules are heterogeneous. most often with heterogeneous collections the rules are varied, unclear, or both. there could be a mix of file formats based on different operating systems, or a jumble of methods used to structure file content. a good example are the files on most personal computer hard drives, which typically contain a mix of spreadsheets, word processing, e-mail, images, and other formats and types. another example would be most world wide web sites, which are made up of html documents, graphic and audio files, java and cgi script, and other highly variable elements. heterogeneous materials generally have low persistence. since each object can differ from the next in unpredictable ways, effective and efficient preservation and access is difficult. there are options to convert heterogeneous materials to more homogeneous forms, but this is not always a practical solution, both in terms of cost and maintaining record integrity. heterogeneous materials can be preserved as a stream of bits, but they will generally be difficult to use over time as file formats become obsolete and other rules become increasingly opaque. the diagram below provides a graphical model of how levels of service will likely evolve. figure 1 levels of service over time as phase i of the diagram indicates, repositories now have two basic service options for digital materials. enhanced service is possible for some homogeneous materials (such as ascii delimited data), while minimal service is available for other materials. change will come, however, as a result of new technology and techniques generated by vendors and by the library and archival communities. this overall process of advancement is depicted in the diagram through the first "cloud," which represents the research and technology that will yield improved solutions. one predictable outcome from this first cloud will be improved service for homogenous digital materials. wider collaborative use of markup languages and associated schema, for example, will expand the categories and formats of digital materials for which repositories can provide an enhanced level of service. projects such as cedars and era, among others, will also lead to much greater understanding about the processes and technologies necessary for building persistent information technology architectures. phase ii indicates the general outcomes of these advances. highly structured homogenous materials (the green arrows) will continue to occupy the top service level, and the most significant change will be capability to provide better service for other homogenous materials (the blue arrows). note that blue arrows go to both the enhanced and to the minimal levels, since the improvements will be uneven in relation to all the materials potentially eligible for acquisition. this split would occur, for example, if a uniform xml-based process were used to generate some federal government reports while other federal reports were created in a less persistent manner. the xml-associated reports could reside in a higher level of service than the other reports, which would continue to occupy the minimal level. all heterogeneous materials (the red arrows) would also remain in the minimal level of service. progression to phase iii will require additional improvements, which are represented in the diagram as flowing from a second cloud. the primary feature of this phase will be wide availability of persistent materials that can be effectively managed in an integrated architecture. with green, blue, and red arrows pointing to the optimal level, the model posits that all varieties of digital materials will have the potential for robust preservation and access. but less persistent materials will also continue to occupy lower levels of service. these materials will include vast stores of legacy data as well as more current items that, for one reason or another, lack complete persistence. it is difficult to say what percentages of files will be associated with any particular level of service under any of the phases. that will depend on what comes out of the two clouds, most particularly with respect to changes in how stringently digital materials are created. one fact is certain: getting large quantities of materials into the enhanced and optimal service levels will require dramatic change in how digital materials are now produced and maintained. it will be possible to use tools and processes to modify or manipulate digital materials to move them into higher levels of service. this could involve converting to different file formats, reformatting content, attaching metadata, or "wrapping" files in some kind of software container. emulators or viewers may also enable suitable access to native formats. before undertaking such work, it will be necessary to analyze a number of factors, including the extent to which file integrity could be harmed and how much the effort will cost. some categories of files will have high enough value to warrant significant effort to move them into higher service levels, but for the foreseeable future many files will likely not have mobility among levels of service. conclusion there is much to be optimistic about with regard to digital preservation. technologies and processes are on the horizon (or in some cases already here) that will enable libraries and archives to do much better in terms of keeping and servicing digital materials. as important as this development is, however, it raises questions about how repositories will cope with the potentially enormous quantity of materials—both legacy as well as more contemporary—that will not easily fit into higher levels of service. these issues range from making decisions about timetables for acquiring digital materials to planning for operational systems to overall expectations for digital preservation in general. given that the initial capabilities of emerging preservation systems will be oriented toward homogeneous materials with uniform, well-defined rules, it makes sense to investigate policies and methods that encourage expanded creation of such materials. structured markup languages such as xml might prove to be the solution, most particularly if uniform schema are widely and consistently used. the greater the homogeneity, the better the level of service. the path is less clear regarding more heterogeneous materials, despite the fact that they currently comprise the vast majority of all digital materials in existence. technology will certainly offer better opportunities for doing more with such materials, but they will lag behind in terms of service. it is clear, however, that archives and libraries will need to make plans for coping with materials that lie on the continuum between optimal and minimal serviceability. the need to contend with varied levels of service does not, of course, in any way diminish the urgency for bringing digital materials under library and archival control. if anything, the prospect of inevitable differences in serviceability should cause repositories to reexamine strategies that involve deferring responsibility for digital materials. repositories may recognize an eventual need to manage such materials, but they may hold out hope that some future technology will solve the problem. but the problem will resist a simple solution. most digital materials now in existence—as well as those that will be created during the foreseeable future—will remain a challenge to manage for years to come, regardless of technological advances. the best course might well be to start capturing and managing appropriate digital materials now with the expectation that the future will bring varied improvements in preservation and access options. notes and references [1] the views and opinions expressed herein are those of the author and do not necessarily reflect those of the u.s. national archives and records administration. [2] reagan moore et al. "collection-based persistent digital archives part 1," d-lib magazine, march 2000, volume 6 number 3, . [3] consultative committee for space data systems, reference model for an open archival information system, june 2001, pages 2-4 through 2-7, . [4] world wide web consortium (w3c), . [5] library of congress national digital library program, . [6] public record office victoria (australia), victorian electronic records strategy final report, . [7] curl exemplars for digital archives (cedars), . [8] jeff rothenberg, avoiding technological quicksand: finding a viable technical foundation for digital preservation, a report to the council on library and information resources, 1999, . [9] david holdsworth and paul wheatley, emulation, preservation and abstraction, . [10] kenneth thibodeau, "building the archives of the future: advances in preserving electronic records at the national archives and records administration," d-lib magazine, february 2001, volume 7 number 2, .   top | contents search | author index | title index | back issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/may2002-lefurgy   institutional repositories and digital preservation: assessing current practices at research libraries search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine may/june 2011 volume 17, number 5/6 table of contents   institutional repositories and digital preservation: assessing current practices at research libraries yuan li syracuse university [email protected] meghan banach university of massachusetts amherst [email protected] doi:10.1045/may2011-yuanli   printer-friendly version   abstract in spring 2010, authors from the university of massachusetts amherst conducted a national survey on digital preservation of institutional repository (ir) materials among association of research libraries (arl) member institutions. examining the current practices of digital preservation of ir materials, the survey of 72 research libraries reveals the challenges and opportunities of implementing digital preservation for irs in a complex environment with rapidly evolving technology, practices, and standards. findings from this survey will inform libraries about the current state of digital preservation for irs.   introduction digital preservation is a significant problem facing libraries. libraries are struggling with how to preserve the scholarly and cultural record now that this information is increasingly being produced in digital formats. in the age of print, information was relatively simple to preserve since paper is a durable format when made properly and stored under the proper conditions. however, now that we have entered the digital age, preserving information has become a more complex task. digital information is fragile and faces many threats including technological obsolescence and the deterioration of digital storage media. the ultimate irony, as pointed out by paul conway, is that, "as our capacity to record information has increased exponentially over time, the longevity of the media used to store the information has decreased equivalently." [1] for example illuminated manuscripts have lasted for over 1000 years, but a cd will degrade in as little as 15 years. perhaps an even greater threat than the deterioration of storage media is technological obsolescence. in an article titled, digital longevity: the lifespan of digital files, julian jackson states, "the rate of change in computing technologies is such that information can be rendered inaccessible within a decade." [2] in many cases software upgrades may not support legacy file formats, and without the intervention of digital preservation techniques the information will no longer be accessible. if the digital scholarly record is to be preserved, libraries need to establish new best practices for preservation. for their part, creators need to be more proactive about archiving their work. the relatively recent development of institutional repositories (irs) offers some promise in ensuring the long term preservation of digital scholarship. however, there has been some debate about whether irs were intended to provide long-term preservation of digital scholarship. in her foreword to the 2007 census of institutional repositories, abby smith writes, "a conspicuous fact about institutional repositories, confirmed by the miracle project findings, is that there is no consensus on what institutional repositories are for." [3] she goes on to say: for example, many institutions that plan or pilot test repositories are motivated by the desire to change the dynamics of scholarly communication ... other institutions identify stewardship of digital assets, especially their preservation, as a key function of a repository. yet survey data confirm that repositories are not yet providing key preservation services, such as guaranteeing the integrity of file formats for future use. [4] perhaps one of the most often quoted definitions of an institutional repository is from clifford lynch's 2003 essay "institutional repositories: essential infrastructure for scholarship in the digital age." in this essay, lynch defines irs as: a set of services that a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members. it is most essentially an organizational commitment to the stewardship of these digital materials, including long-term preservation where appropriate, as well as organization and access or distribution. [5] this study aims to find out whether long-term preservation is part of the mission of institutional repositories at association of research libraries member institutions, and if so, what plans irs have to provide long-term preservation of their content.   methods this study investigated the following questions related to digital preservation of ir content: is preservation part of the mission and goals of irs? what preservation policies exist for irs? what preservation strategies are irs currently implementing? are the necessary rights and agreements in place to preserve the content of irs? are all of the materials in irs of sufficient quality and importance to warrant long-term preservation? do irs currently have the necessary sustainability in terms of funding and staffing to carry out long-term preservation of their contents? the authors of this study decided to send out a survey to arl libraries, because we thought that the majority would have irs. we also thought that most arl libraries would at least be thinking about digital preservation at this point, if not actively taking measures to ensure long term preservation of the contents of their irs.   literature review the growing body of literature available on digital preservation and institutional repositories comes from a diverse group of scholars representing equally diverse perspectives. this literature review provided insight into different facets of the authors' survey, such as digital preservation methods and strategies, content recruitment and sustainability issues related to institutional repositories, and opportunities and challenges concerning digital preservation in the context of institutional repositories. however, very few articles were found which examine current digital preservation practices of institutional repositories in the united states. librarian charles w. bailey, jr.'s "institutional repository bibliography" [6] offers a comprehensive view of the publication record on institutional repository topics, the majority of which focus on best practices, predictions, and opinion papers, as opposed to statistical analysis. compared with the large number of articles listed in the section on general literature related to irs, the subsection "institutional repository digital preservation issues" [7] has only a small number of publications listed. with digital content increasing exponentially in the current information age, libraries have come to realize the importance of digital preservation. paul wheatley states that "careful consideration must be given to the preservation needs of materials to be archived within an institutional repository" [8]. nancy y. mcgovern and aprille c. mckay [9] also described several significant opportunities for digital preservation offered by irs in their article published in 2008, including digital content management, opportunities for content to creators to learn about their role in digital preservation, and faculty legacy preservation. long-term digital preservation came to scholars' attention even before the birth of irs in 2002. in 1996, don waters and john garrett wrote a landmark report calling attention to the need for digital preservation by stating, "failure to look for trusted means and methods of digital preservation will certainly exact a stiff, long-term cultural penalty." [10] during the same year, the digital preservation coalition was established in the united kingdom, and in the united states the library of congress developed a national strategy for preserving digital information. in 2002, the consultative committee for space data systems (ccsds) published the recommendation for space data system standards reference model for an open archive information system (oais). the oais model provides a comprehensive framework for all functions required for digital preservation including ingest, storage, retrieval, and long-term preservation of digital objects. however, implementation of digital preservation in irs is still in its infancy. as pointed out by karen markey and others, "it may not be surprising that there is a gap between the claims of stewardship, or aspirations for stewardship, by institutional repositories and their current ability to preserve digital assets. organizational models for digital preservation are only now emerging and they are quite diverse ... implementation of digital preservation in irs, however, is still in its infancy." [11] with ir software gradually integrating support for preservation, there seems to be more hope for ir managers in implementing digital preservation for irs. however, it is not sufficient to rely only on software since various facets have to be considered when preserving digital content. as eliot wilczek and kevin glick state in their article "it seems obvious that no existing software application could serve on its own as a trustworthy preservation system. preservation is the act of physically and intellectually protecting and technically stabilizing the transmission of the content and context of electronic records across space and time, in order to produce copies of those records that people can reasonably judge to be authentic. to accomplish this, the preservation system requires natural and juridical people, institutions, applications, infrastructure, and procedures." [12] similarly, the challenges for digital preservation in the context of irs are also pointed out by nancy y. mcgovern and aprille c. mckay, including "little control over what is ingested into the ir; deposit of materials in less-optimal formats, with poor metadata and insufficient intellectual property rights clearance; and digital content that is difficult or costly to preserve." [13] as the preservation of ir content is becoming a bigger concern among ir managers, an assessment of current practices is needed. in 2005, anne kenney and ellie buckley from cornell university conducted a "survey of institutional readiness" on developing digital preservation programs. the survey found that "only about one third of institutions have developed, approved and implemented digital preservation policies." [14] five years later, what is the status of digital preservation practices in the context of irs among arl libraries? the survey results presented in this paper attempt to find out.   findings and analysis the survey contained six sections with a total of twenty-four questions, which aimed to investigate current practices in relation to the existence of digital preservation policies, digital preservation strategies, rights to preserve the content, content quality, and sustainability. as mentioned before, the survey was sent out to arl libraries. the arl website listed 125 libraries in may of 2010. of these, the authors limited their survey to the 72 academic libraries that had institutional repositories. fifty-two percent of the surveys were returned. of the surveys returned, 43 percent were returned completely filled out. the responses were collected and analyzed using online survey analysis tools and spreadsheets.   general questions the first section of the survey covered two general questions. the first question asked what platform survey respondents used for their irs. dspace was the most popular with 57.9 percent of survey respondents using it for their ir. other systems being used for ir platforms include 26.3 percent using digital commons, 5.3 percent using contentdm, 2.6 percent using digitool, and a remaining 7.9 percent choosing other. among the 7.9 percent who chose other, three respondents specified the other platform they were using. one ir used a digital commons back-end with an xtf based front-end, and another reported using a "thoroughly modified greenstone" system. the third respondent used various systems to make up their ir including; etd-db for electronic theses and dissertations, vt imagebase for digital images, and contentdm for archival and scholarly collections. the second question in this section asked whether preservation was part of the mission of the ir. for the vast majority, 97.4 percent, preservation was part of the mission of the ir. only 2.6 percent of respondents reported that preservation was not a part of the mission of the ir. one of the respondents who answered no commented that preservation would eventually be part of the mission of the ir. if respondents answered no, they were thanked for their time and exited from the survey. the rest of the questions were related to digital preservation, and most would not be applicable for an ir that did not have preservation as one of its goals.   preservation policies developing preservation policies ought to be the first step toward guaranteeing preservation actions. the strategies for preserving ir content and the decisions about what content requires short, medium, or long term preservation should be driven by preservation policies. with ir content growing rapidly, it is important to look at how policies have been developed to guide the implementation of digital preservation for ir content. in this survey, 51.5 percent of respondents indicated that their irs have preservation policies. encouragingly, this result showed that there has been an increase in digital preservation policy development since the 2003-2005 cornell survey. for further investigation, the authors asked whether or not the ir provides long-term preservation to all submitted content. seventy-eight percent of respondents indicated that they are committed to provide long-term preservation for their ir content. in examining the policies provided by the respondents, the authors found that many institutions guarantee preservation only for certain file formats; 90.0 percent of polices clearly identified supported or recommended file formats, while the rest of the institutions briefly say they are committed to long-term digital preservation of all materials housed in their irs . from the policies provided, the most commonly supported file formats are listed in the appendix, table 1.   preservation strategies the third section of the survey asked several questions about the strategies employed to preserve ir content. ninety percent of respondents reported that their ir content is at least backed up and stored in a secure storage system. sixty-three percent of the respondents reported that they had a checksum algorithm to detect errors in the data stored in their ir. however, other digital preservation strategies such as migration, emulation, and refreshing were reported by only half, or less, of the institutions surveyed (see figure 1). in the comments on this question, one respondent mentioned that the list of digital preservation strategies being used is a "developing list" and another respondent said that this was "in development." the survey went on to ask whether digital preservation strategies were handled internally by the ir system itself or with external systems and services. the data show that many institutions are taking advantage of some features of their ir system that support digital preservation. in addition these libraries supplement the limited preservation features of most ir systems with external preservation systems and services (see figure 2). the comments reveal some of the external systems currently being used to support digital preservation. they include lockss, metaarchive, duracloud, irods, cdl curation services, and interpares as well as bepress backup for digital commons repositories and campus it backup. checksums were mentioned as a preservation feature internal to the dspace repository system. the next question asked whether the institution had a digital preservation system in place for its ir content and other digital collections. the largest percentage, 39.3 percent, had no digital preservation system in place. the next largest category, 32.1 percent, was those that had a private lockss network in place. another 28.6 percent had a custom designed digital preservation system, and 10.7 percent shared the use of a digital preservation system with other institutions. encouraging to see was that 58.6 percent of respondents reported recording preservation metadata about the digital objects in their irs. some of the most frequently collected types of preservation metadata included technical information needed to preserve the resource, rights information, provenance or ownership history, and authorized change histories of the resource (see figure 3). however, consistency might be an issue, particularly if the ir is primarily collecting user-supplied metadata. one respondent pointed out that "not all collections have preservation metadata; it varies based on the sophistication of the collection." another respondent commented that they "are working on standards and best practices that address all types of metadata." in this section of the survey the authors also wanted to know whether the ir system could export all of its content and all of its metadata, since this is key for migrating to a new or better system in the future. most respondents, 96.7 percent, reported that the ir system was able to export all of its content, and 93.3 percent reported that their ir system was able to export all of its metadata. data about which ir systems could not export all their content and all their metadata was not collected.   rights and agreements copyright and intellectual property are also important issues to consider when thinking about the stewardship of scholarly materials. when open access (oa) was first conceived of as a solution to the scholarly communication problem, the ir was developed as a way to implement oa in academia. therefore, acquiring the rights from content contributors and copyright holders to distribute the content freely is an integral part of collecting content for irs. however, securing the necessary rights and agreements to preserve the materials is also important, because implementing long-term digital preservation strategies, such as migrating to new formats in the future, may necessarily involve changing the content to some extent. since preservation and access go hand in hand, the survey sought to find out whether irs have the necessary agreements in place with content contributors and copyright holders to preserve and provide access to submitted content. among the repositories surveyed, 72.4 percent indicated that they had made agreements with content contributors to provide preservation services for submitted content. these agreements were usually made during the deposit process. various types of agreements include online click through agreements, written agreements, policies, mous, and verbal agreements. however, making agreements with content contributors is only the first step, because for a significant portion of ir content, the content creator or contributor may not necessarily be the copyright holder. the survey results show that while most irs ask for permission from contributors to preserve content, not all will necessarily ask for the same permission from the copyright holders, such as publishers. when asked whether or not the ir secures permission from content contributors, 96.7 percent of respondents answered yes (see figure 4). however, only 56.7 percent indicated that they would ask for the same permissions from copyright holders if they were different from content contributors (see figure 5). the comments section revealed that many institutions do not consider providing copyright clearance on behalf of content contributors to be part of their responsibilities. most agreements provided by survey respondents state that content contributors need to warrant that they either own the copyright of the submitted content or that they have permission to submit the work if the copyright is owned by another party.   content policies the most important roles that irs play are to collect, manage, and disseminate the digital scholarship that their communities produce. collecting content is the first step to building an ir, and since their inception this is what ir managers have primarily focused their efforts on. digital scholarship can be collected in different ways, and how it is collected may affect its quality as well as the ability to preserve it. it is worth investigating how content is collected and how quality is ensured since different levels of preservation effort will be made depending on both the initial quality of the content and its format. eighty percent of irs reported that they have a collection policy in place. from the provided links to policies in the comments section, we discovered that collection policies mostly include selection criteria (such as the nature and type of the materials that can be submitted), recommended file formats, and procedures (such as withdrawal, access, and preservation.) as to how content is deposited in the ir, the survey asked about three methods: author self-archiving, by third party on behalf of the author, and by repository staff. the results showed that content is deposited in the ir by using all three methods in 92.0 percent of surveyed institutions. the next question asked survey respondents to indicate rough proportions for each type of deposit method. the answers varied widely, but the overall pattern showed that repository staff are still depositing much of the content that goes into irs. as we discussed, no matter how content is deposited in the ir, the quality of deposited content should be examined before digital preservation actions are considered, as the initial quality of deposited content can directly affect the success of digital preservation efforts. if the quality of the content cannot be assured, then significant problems may arise. these problems may include format obsolescence, poor quality or unreadable images or scans, insufficient metadata to manage and preserve the materials, etc. for this reason, the last question in this section examined whether or not irs have mechanisms in place to ensure the quality of submitted content. consistent with our expectations, 83.3 percent of respondents are using authentication mechanisms (see figure 6). authentication mechanisms allow an administrator to define resources that can be accessed and to track users as well as submitted content. in addition, 70.0 percent provide submission guidelines, and 66.7 percent indicated that repository staff review submitted content. these are all important actions to take in order to ensure that high quality content, worthy of preservation, is being submitted to the ir. results show that only 20.0 percent of respondents are also using a peer review system with their irs. it is not clear to us what content is subject to peer review, but we imagine that it would include the types of materials that typically employ peer review such as journal articles and conference proceedings. for previously published materials, most likely peer review occurred prior to deposit in the ir.   sustainability the last section of this survey looked at sustainability issues for irs as this has a direct impact on the preservation of their content. the first question asked if the ir had sustainable long-term funding. at this point the majority of irs, 63.3 percent, do have sustainable long-term funding. however, there are still a significant number of irs whose funding situation is uncertain; 13.3 percent of respondents reported that their irs do not have sustainable long-term funding, and 23.3 percent reported that they didn't know if their irs had sustainable funding. comments about this question ranged from "as long as the library decides it's a worthwhile project" to "the library's new strategic plan includes a long term commitment to the ir" and "it is funded out of the library budget." the next question asked if the ir had adequate and sustainable staffing. the data show that this is still a problem area for many irs. answers to this question are split right down the middle; 48.3 percent responded that they have adequate staffing, 48.3 percent responded that they do not have adequate staffing, and 3.4 percent said they did not know whether they had adequate staffing or not. one respondent commented that "at a keep-alive level, there is adequate staffing unless we lose staffing lines. as content increases and increased formats are handled that must be migrated, it's not clear that we could handle it with our existing staff." another reported that their "staffing is less than one fte," and still another commented that their "success means [they] need more than one full-time staff and one part-time student worker, but budget does not allow for it." numerous respondents had comments to make about this question, which further emphasizes the fact that adequate staffing levels are a concern for many ir managers. when asked what level of digital preservation the ir was currently providing, 20.0 percent responded that the ir was providing short term preservation. short-term preservation was defined as access either for a defined period of time while use is predicted or until materials becomes inaccessible because of changes in technology. medium term preservation was defined as continued access beyond changes in technology for a defined period of time but not indefinitely, and was reported by 36.7 percent. surprisingly to the authors, 43.3 percent reported that they were currently providing long-term digital preservation or access to the content for an indefinite period of time. although 43.3 percent report that their irs are currently providing long-term digital preservation, numerous comments show a slightly different picture. one respondent wrote, "we continue to develop standards and best practices. long term preservation is definitely our goal." another said, "by the end of this year, we should have detailed preservation policies and procedures in place. as part of the strategic plan implementation, we will work on implementing preservation policies and procedures." still another commented, "we aim for long term preservation, but i think we need a better preservation plan in place." it is hard to tell with complete accuracy whether 43.3 percent are actually providing long-term preservation today, but these comments seem to suggest that irs may be engaged in a planning process to provide long-term preservation rather than providing it in a fully operational way. responses to the last survey question strengthen the theory that most irs are currently in a planning mode rather than a fully operational mode for providing long-term digital preservation. when asked if the ir was currently engaged in planning a process to provide long-term digital preservation of its content, 67.7 percent answered yes; 16.7 percent said no; and only 16.7 percent reported that they were already providing long-term digital preservation. comparing the 16.7 percent from this question against the 43.3 percent who reported that the ir was currently providing long-term preservation in the previous question suggests that long-term digital preservation is really more of a goal than a reality for most irs at this point.   discussion/conclusion the results of the survey show that an increasing number of research libraries have started to move digital preservation programs ahead by developing preservation policies. the growing awareness about making agreements and securing permissions for preserving ir content signifies another step forward, although some concerns may remain when the responsibilities of seeking permissions are assigned to content contributors. content contributors may be frustrated if they do not have sufficient knowledge of copyright issues or if they lack the time to secure the necessary permissions from copyright holders to self-archive their previously published works. these issues impede the ability of an ir to collect content as well as to preserve content. an innovative approach needs to be developed to address these concerns. assuring quality of content and collecting content in formats that can more easily be preserved is another area that might need more consideration. a list of supported file formats could offer preservation guidance to content contributors; however, it may narrow the scope of content for irs. collection policies, such as selection criteria and submission guidelines, are helpful for guiding decisions about preservation efforts and ensuring that the content of irs is worth the cost and effort that it will take to preserve. since the ir is still in a stage of development at many institutions, lack of sustainable funding and adequate staffing could present an obstacle in implementing successful digital preservation programs. it will be important to address these sustainability issues as part of the planning process for building a digital preservation program. despite these challenges it is very encouraging to see a large number of digital preservation policies being developed and an increasing number of digital preservation strategies being implemented for irs. we expect to see great steps forward in the next five years.   acknowledgements during the process of the survey and preparation of this paper, we received a lot of support from our colleagues and friends. here we would like to thank robert mcgeachin and sandra tucker from texas a&m university library for sharing their ir managers email list with us. we also want to thank our colleague stephen mcginty from w.e.b du bois library at university of massachusetts amherst, and dr. marta deyrup from seton hall university library for their insightful comments on the paper.   references [1] conway, paul. preservation in the digital world. washington, d.c.: council on library and information science, march 1996. http://www.clir.org/pubs/abstract/pub62.html. [2] jackson, julian. digital longevity: the lifespan of digital files. york: digital preservation coalition. http://www.dpconline.org/events/previous-events/306-digital-longevity. [3] smith, abby. foreword to census of institutional repositories in the united states miracle project research findings, by karen markey, soo young rieh, beth st. jean, jihyun kim, and elizabeth yakel. washington, d.c.: council on library and information science, february 2007. http://www.clir.org/pubs/reports/pub140/contents.html#fore. [4] ibid [5] lynch, clifford a. institutional repositories: essential infrastructure for scholarship in the digital age. washington, d.c.: association of research libraries, february 2003. http://www.arl.org/bm~doc/br226ir.pdf. [6] bailey jr., charles w. institutional repository bibliography. http://digital-scholarship.org/irb/. [7] ibid [8] wheatley, paul. "institutional repositories in the context of digital preservation," microform & imaging review 33 (2004), 135-46. http://dx.doi.org/10.1515/mfir.2004.135. doi:10.1515/mfir.2004.135 [9] mcgovern, nancy y., and aprille c. mckay, "leveraging short-term opportunities to address long-term obligations: a perspective on institutional repositories and digital preservation programs," library trends 57, no.2 (2008): 262-79. http://muse.jhu.edu/journals/library_trends/v057/57.2.mcgovern.html. [10] waters, donald, and john garrett, preserving digital information: report of the task force on archiving of digital information (washington d.c.: the commission on preservation and access, 1996), 68. http://www.clir.org/pubs/abstract/pub63.html. [11] markey, karen, soo young rieh, beth st. jean, jihyun kim, and elizabeth yakel, census of institutional repositories in the united states miracle project research findings. washington, d.c.: council on library and information science, february 2007. accessed may 27, 2010. http://www.dspacedev2.org/images/linkto/clir%20report.pdf. [12] wilczek, eliot, and kevin glick, fedora and the preservation of university records. 2006. accessed may 2, 2010. http://dca.lib.tufts.edu/features/nhprc/reports/index.html. [13] ibid [14] kenney, anne, and ellie buckley. "developing digital preservation programs: the cornell survey of institutional readiness, 2003-2005." august 15, 2005. accessed may 15, 2010 http://worldcat.org/arcviewer/1/occ/2007/08/08/0000070519/viewer/file1088.html#article0.   appendix   table 1 format file extension text file formats pdf/a .pdf plain text (us-ascii, utf-8) .txt rich text .rtf xml .xml comma-separated values .odt, .ods, .odp image file formats tiff .tiff jpeg 2000 .jp2 jpeg .jpg audio formats aiff .aif, .aiff wave .wav video formats avi .avi motion jpeg2000 .mj2, .mjp2         figure 1         figure 2         figure 3         figure 4         figure 5         figure 6     about the authors yuan li is the scholarly communication librarian at syracuse university (su). prior to joining the su library, yuan worked as digital initiatives librarian at the university of rhode island; digital repository resident librarian at the university of massachusetts, amherst; digital initiative developer in the graduate school of library & information studies at the university of rhode island, and as metadata developer in the special collections and archives unit of the university of rhode island library. yuan holds an mlis from the university of rhode island and a master of engineering degree in applied computer science from the national computer system engineering research institute of china. she also holds a bachelor of engineering degree in computer science and technology from yanshan university (china).   meghan banach is the bibliographic access and metadata coordinator at the university of massachusetts amherst. in addition to providing leadership for the bibliographic access and metadata unit of the information resources management department, she is a member of the umass amherst scholarly communication team and focuses primarily on the management of electronic theses and dissertations in the institutional repository. she also chairs the umass amherst digital creation and preservation working group and serves on the metadata working group. her research interests center on managing, preserving, and providing access to digital materials. she holds an mlis with an archives management concentration from the simmons college graduate school of library and information science and a ba in history from mount holyoke college.   (on june 1, 2011, lead author yuan li's email address was added to this article.) copyright © 2011 yuan li and meghan banach beyond the scriptorium: the role of the library in text encoding search  |    back issues  |    author index  |    title index  |    contents d-lib magazine january 2002 volume 8 number 1 issn 1082-9873 beyond the scriptorium the role of the library in text encoding   suzana sukovic rare book and special collections library university of sydney, australia [email protected] (this opinion piece, which is based on the presentation the author gave at the conference "ecai 2001: towards an electronic cultural atlas", sydney, 12-13 june 2001, presents the opinions of the author. it does not necessarily reflect the views of d-lib magazine, its publisher, the corporation for national research initiatives, or its sponsor.)   introduction the appearance of electronic text centers within libraries during the last several years has been a significant development for both the library and research communities. at the same time, electronic texts have become a great challenge to the traditional roles in the library, research and publishing communities. development of electronic textual resources means dealing with documents in new ways and on different levels, often involving work on a document's content through text encoding. this development challenges the library's assumed position in the research process. stronger involvement of libraries in text development will enhance functionality of electronic texts and improve information retrieval. traditional library skills and tools used for cataloguing and indexing can be applied to textual encoding to identify geographical and personal names, dates, events, artifacts, etc. and to provide standardized access to this information. libraries have always provided this type of service, but some may see the application of the same skills to text encoding as crossing professional boundaries. 1. changing role of the library libraries have always dealt with value-loaded documents, yet have been able to maintain the libraries' value-neutral positions. libraries provide access to information created by others, but they are not creators themselves -that is the common understanding, or it was until recently. with development of information technology, various new tasks have been coming to libraries, some requiring new ways of interacting with documents. a new wave of changes accompany electronic text initiatives, positioning the library across at least three specialized fields: computer support, publishing and academic research. blurry though they are, boundaries do exist between these fields, and text encoding projects are testing and establishing these boundaries, mostly in practice. 1.1 libraries and computer support electronic texts need considerable technological support, and the distinctions between the librarian's and computer specialist's tasks are often not clear. librarians and computer support staff, however, have had long experience in dealing with technology together. electronic texts do not essentially change the existing division between the two professional areas. 1.2 libraries and publishing the issues surrounding electronic publishing, on the other hand, are so controversial that there is hardly any aspect of it that assumes common agreement, definition of the meaning of what constitutes electronic publishing being no exception. some authors distinguish between digitization and electronic publishing (mercieca 1999) but the national library of australia (nla) states that the nla "is operating on the basis that anything that is publicly available on the internet is published" (national library of australia 1999, point 3.3). in the new environment, traditional publishing is changing, and although there is much to be done in establishing standards for electronic publications, the library can consider making information "publicly available" to be its legitimate task. what is not completely clear is whether this legitimacy includes provision of whole content in addition to provision of traditional bibliographic information. presentation of the whole document rather than merely its description, whether called publishing or not, is not currently viewed as a library task. since the invention of print, librarians no longer produce documents by copying them. it used to be different, however. a look back in history shows a monk in a monastery library hand-copying a script, but also ornamenting it and accompanying the script with translations. (resemblance with the old practice is the reason why so many electronic text projects call themselves "electronic scriptoria".) as the monk copies the document by hand, he is not writing a tract about the text, "just" reproducing it. nevertheless, during the process he might unconsciously insert some elements of his own speech. perhaps he even decides to make a correction to the document or omits what he considers an unsuitable part. the monk is not -cannot be -objective, but objectivity is not at issue. thanks to the monk's effort, we have copies of valuable manuscripts. some of these manuscripts containing the monk's alterations are significant cultural contributions in their own right. if we can glean any lessons from history, they might include thinking about enhancements we want to make to documents, interventions we want to prevent, and the value of copying texts. 1.3 libraries and academic research the most sensitive questions about text encoding by libraries concern the new role that libraries may fill with regard to academic research. text encoding deals with content in a very direct way that appears to threaten the library's assumed neutrality in the research process. the traditional role of the library in the research process is based on the assumption that libraries deal with information in a value-neutral way, and part of their neutrality is in the fact that librarians do not treat a document's content. librarians provide information, but someone else interprets the information and considers its value in meaning systems. there are also philosophical and practical issues that cause resistance, both in the library and academic community, to changing roles in academic research and to the new digital document technologies. one of these issues involves libraries' limited resources. nevertheless, thanks to their ability to reposition themselves when required, libraries have already evolved from document repositories to information centers. might they not also evolve to take a greater role in text encoding of documents? the role of any research library includes provision of access to information and support for research and learning. information access, research and learning take place in a cultural context that includes libraries, and culture has never been neutral. libraries have been assumed to play a neutral role in academic research; however, classification systems and indexing, collection development and acquisitions policies, user services and policy decisions are all colored by cultural values and knowledge. the fact that librarians established procedures, codes and practices did not mean their work became neutral. by the same token, the library profession may need to develop ways of dealing with new value-based roles rather than decide that these roles are incompatible with the library's assumed position as a neutral information provider. 2. library's contribution to electronic textual resources why should libraries be involved in text encoding? the briefest answer is because text encoding affects information access and preservation, which are traditional library tasks. libraries possess strengths and skills for information organization, access, and dissemination as well as the proven ability to collaborate across disciplines. 2.1 managing information the academic community sees the library as the most reliable place where electronic texts can be guarded (modern language association, guidelines e1 1997). the library has built a good reputation for dealing with information technology, and users want to go to the same place for new developments. the library has already gained considerable experience with computing in the humanities field. mcgann describes the close association between computerization in the humanities and in libraries, and he says that this association occurred for one "simple and obvious reason: material demands have driven libraries to study and exploit computerized tools, which allow these research facilities to gain a measure of control over the massive amounts of data they are called upon to manage" (mcgann 1996, "the return of the library", 1st par.). yet, when it comes to electronic textual resources, libraries have not yet contributed their fundamental skills and tools for managing data. 2.2 serving the research community one part of the library's contribution to the research community is in providing good collections and access to the information held in those collections. another contribution is in developing a skill base, firstly within the library and then in the university community. universities that have well established electronic text centers are already developing significant experience in supporting researchers' encoding projects, either by providing instructions or by participating in these scholarly and teaching projects. 2.3 serving one instead of many? there is an underlying assumption that electronic resources are valuable to researchers and that libraries want to find new and better ways to support research. however, it is important to acknowledge that digital collections, electronic texts especially, require significant resources while serving only part of the academic community. electronic text centers are primarily engaged in developing resources for the humanities. the literature, well supported by anecdotal evidence, suggests a split between the researchers in the humanities along technological lines (olsen 1992; deloughry 1993; katz 1999; sukovic 2000). the decision to make a big investment that serves only a part of the research community is being questioned. while we cannot go into the whole argument here, it is important to acknowledge this doubt and address it briefly. research methods develop over time, with or against technological developments. research libraries cannot ignore significant advancements in research methods used by the academic area the library is supporting. when determining what their contribution to research should be -based on how electronic texts are used -libraries need to take cost/benefit factors into consideration. 3. approach to encoding once a library has decided to become involved in text encoding, the crucial question arises regarding which method of encoding is best. there is little doubt that using sgml or xml according to text encoding initiative (tei) guidelines is the library's best choice for developing scholarly electronic texts. 3.1 existing practice over the last decade, electronic text centers have established a number of standard approaches to encoding. accepted levels of encoding for each center were adopted based on many factors, some general -like the enormous number of texts to be converted -and some local -like the organization's commitment to electronic texts. various accounts of how electronic text centers were established provide evidence of procedures and practices being established in response to numerous considerations and demands. 3.2 draft guidelines the first draft guidelines for tei encoding in libraries were written in 1999 (friedland et al. 1999). the guidelines are significant for several reasons. firstly, they give librarian-encoders a sense of community. secondly, standards and codes of practice are necessary to avoid big mistakes and wasted time. thirdly, the draft guidelines provide researchers with a reference point to shape expectations and plan their research projects. the draft guidelines recognize five levels of encoding. levels 1 4 require no expert knowledge of content, but level 5 requires scholarly analysis. level 1 starts with fully automated conversion and encoding. the complexity of encoding increases at each level up to level 4, which includes basic content analysis. level 5 is reserved for scholarly encoding projects and requires subject knowledge -semantic, linguistic, prosodic and other elements beyond structural level are encoded. the draft guidelines are brief and do not give detailed specification for different levels of semantic encoding. they do, however, provide a framework and important orientation points. 3.3 traditional library skills taking stock of traditional library skills and tools, another important skill that librarians bring to text encoding is the library's well-established practice of naming documents and their content. libraries have traditionally dealt with recognizing and naming various references to people, places, organizations, objects, events, etc. semantic interpretation has been a regular library practice in assigning subject headings, choosing regularized forms of names, identifying languages used in a publication, and so on. a huge apparatus of codes and rules, thesauri, authority files and labelling systems, has been developed to support tasks of recognizing important information in the document and putting it in an accessible standardized form. the scholarly community depends on the library's interpretation of authorship and the content of whole documents, even corpora. should it not be acceptable for the library to continue such interpretation through text encoding at the word or phrase level? information professionals already produce good indexes describing content down to the paragraph level, and some researchers have expressed a desire for an even greater number of detailed abstracting services, bibliographies and catalogues. this certainly does not mean that librarians should venture into scholarly interpretation. it only means that they should apply their best tools and skills to this new area of endeavor. another thing to consider is whether the libraries can devote staff and resources to such fine encoding, but that is a matter for grant administrators, and university and library managers. in the long run, it might prove cheaper and more efficient to have librarians marking occurrences of personal and geographic names in texts and have academics continue with proper scientific research than for academics to perform both tasks. there is no reason why librarians should leave behind their experience in using large authority files when they start encoding document content. additionally, in order for academics to complete groundwork for their research (which is their main interest), it is a waste of time for them to have to learn what librarians already know about indexing and using authority files. instead, it would be more beneficial to everyone if librarians from electronic text centers got together with their colleagues, cataloguers and indexers, to see how best to bring together the skills from their professional spectrum. 4. encoding example the text encoding of diary of a trip to australia, 1897, by evelyn louise nicholson (nicholson 1999), serves as an example for the points raised herein. the diary is interesting mostly as a historical document. although it might be possible to consider literary, socio-psychological or any other characteristics of the text, they are, firstly, well in the scholarly domain. secondly, historical perspective is likely to be a background for other perspectives. although i could not provide historical analysis, i wanted to make the text searchable as a document on australian people and places of the time. important parts of any historical document are references to dates, people, places, events and objects of historical significance. the need to encode the dates in the diary was obvious. the text is not in a strict chronological order; therefore employing a mechanism to trace dates was useful. i marked up references to "today", days in the week and similar words when they marked the beginning of a new day or a set of events, and when their interpretation in the context was unambiguous. references to people, places, organizations and boats were also marked. the diary has a number of references to the university of sydney, and since these references are of particular interest for the collection, i decided they should be distinguished from other types of references. regularized forms of personal names and the university of sydney departments and buildings were provided when possible. personal names were regularized only when identity of a person was unambiguous. in all other cases, references to people were marked with attribute "person", thus making it possible to build a list of all references to people. references to the same person are brought together, even if their name was not regularized. 5. implications this type of text encoding is time consuming, especially if applied to all potentially relevant information. however, some texts and certain information in them warrant the attention of researchers, and rich encoding of core texts should not go amiss. enriched electronic texts also have the potential to showcase the university as an electronic publisher. in an era when private companies work hard to come up with another value-added product, librarians do not have to look far to take advantage of growing opportunities for which their profession has so naturally prepared them. the important question is not whether libraries should deal with document content -they always have dealt with it. the questions are: how can librarians extend their skills to textual encoding, and who is going to support this kind of work? although difficult to answer, these are not philosophical questions but rather are practical questions involving the use of standards and the availability of funds for the new task. additionally, there is the strategic issue of how to form new alliances of different library specialists and researchers. finding solutions to these issues could reap manifold benefits: significantly better information retrieval, electronic texts as better research tools, academics freed from doing groundwork themselves, raised status of research libraries, and a competitive product from the university. with centuries of experience in reproducing, cataloguing, classifying and indexing documents, as well as in information design and retrieval, librarians are well positioned to take a role in text encoding -to move beyond the scriptorium and beyond traditional library roles. references deloughry, thomas j. 1993. survey of language professors finds extensive use of computers. the chronicle of higher education 39, 33: a27. friedland, leeellen, nancy kushigian, christina powell, david seaman, natalia smith, and perry willett. 1999. tei text encoding in libraries: draft guidelines for best encoding practices: version 1.0 (july 30, 1999). [presented at the joint annual meeting of the association of computers and the humanities and association of literary and linguistic computing in june 1999]. available from http://www.indiana.edu/~letrs/tei/">http://www.indiana.edu/~letrs/tei/>; internet; accessed 19 august 1999. katz, stanley n. 1999. a computer is not a typewriter, or getting right with information technology in the humanities. lecture in the digital directions speakers series, university of virginia, 4 february. available from ; internet; accessed 3 june 1999. mcgann, jerome. 1996. radiant textuality. available from ; internet; accessed 13 december 2001. mercieca, paul. 1999. from digitisation to publishing: using sin on disc as a case study. in conference and exhibition: 1999 & beyond: partnerships and paradigms, 6-8 september 1999, sydney by australian library and information association of australia. reference and information services section 1999. available from ; internet; accessed 10 september 1999. modern language association of america. committee on scholarly editions. 1997. guidelines for electronic scholarly editions. available from ; internet; accessed 16 september 1999. national library of australia. 1999. guidelines for the selection of online australian publications intended for preservation by the national library of australia. available from ; internet; accessed 18 august 1999. nicholson, evelyn louise. 1999. diary of a trip to australia, 1897 [computer file]. sydney : university of sydney library, scholarly electronic text and image service. available from internet; accessed 14 january 2002. olsen, mark. 1992. signs, symbols and discourses : a new direction for computer-aided literature studies. available from ; internet; accessed 19 august 1999. sukovic, suzana. 2000. humanities researchers and electronic texts. lasie, 31, 3: 5-29. available from ; internet; accessed 12 december 2001. copyright 2002 suzana sukovic top | contents search | author index | title index | back issues editorial | first article home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/january2002-sukovic   the use of digital object repository systems in digital libraries (dorsdl2): ecdl 2008 workshop report search   |   back issues   |   author index   |   title index   |   contents d-lib magazine november/december 2008 volume 14 number 11/12 issn 1082-9873 the use of digital object repository systems in digital libraries (dorsdl2) ecdl 2008 workshop report   gert schmeltz pedersen technical information center of denmark technical university of denmark kåre fiedler christiansen the state and university library, denmark matthias razum fiz karlsruhe the 2nd european workshop on the use of digital object repository systems in digital libraries (dorsdl2) , held in conjunction with ecdl 2008 , took place on 18 september 2008, in aarhus, denmark. the workshop was attended by 40 people (including speakers) from 10 countries.(the 1st dorsdl workshop, , took place on 21 september 2006, in alicante, spain, see .) digital libraries and digital repositories are – in many ways – two sides of the same coin. the dorsdl2 workshop brought together researchers and practitioners from both fields, and aimed to transfer knowledge and connect. the target group for the workshop was comprised primarily of repository researchers, developers, and managers. the workshop addressed both experiences and novel concepts with a technological and/or organizational stance. the workshop covered a variety of practical digital library development issues and how their resolution can (or cannot) be carried out in the context of the digital object repository at hand. the full-day workshop was comprised of three sessions: "applications", "architectures", and "search", each with an invited speaker. these sessions were followed by a concluding discussion. the first session ("applications") was opened by the invited speaker sandy payette (executive director, fedora commons, usa). in her presentation "repositories: disruptive technology or disrupted technology?", sandy considered the role of digital object repositories in the context of cloud computing and highly distributed systems. after a first generation of systems focusing on institutional repositories and digital library applications, and a second generation embracing web 2.0 techniques like annotations or collaborative filtering, we are currently on the verge of a third generation, which will cover the data-intensive aspects of e-science and e-research. this requires a shift from "repository islands" towards distributed, web-oriented, open, and interoperable infrastructures. in such an environment, "the repository" might no longer be a well-defined place in the library, but rather an entry point to a highly distributed fabric of storage and services. the recently released fedora 3.0 software provides a logical next step in this direction. the next speaker was lodewijk bogaards (data archive and networked services, the netherlands). in his presentation "easy on fedora – using escidoc; turnkey access?", lodewijk pointed out that current repository systems lack a kind of middleware layer with higher-level services that ease the implementation of more complex applications. his use case, an archival software for research data sets from the arts, humanities and social sciences, showed that adding middleware to the software stack helps developers to concentrate on their domain-specific business logic instead of having to "re-invent the wheel". the middleware of their choice, the freely available escidoc infrastructure, provided them with a set of predefined content models, integration into their authentication system, elaborated search capabilities, and more. the last speaker in this session was elsebeth kirring (the state and university library, denmark). in her talk "building a user oriented automatically generated custom user interface", she described the challenge to ingest and describe non-textual objects into their repository in an efficient and adaptive process. based on the latest release of the fedora commons software, her team created a content-model driven software that generates user interfaces for capturing metadata. each content model represents one of the different object types in the repository. the second session ("architecture") began with the invited speaker herbert van de sompel (los alamos national laboratory) with a talk titled "what to do with a million books: adore for storage and access". herbert presented the impressive work done in los alamos to combine components based on standards to a system that was hugely scalable and extensible. the fact that the system is based on open standards made the system very well suited for interoperability with other systems, and thus the talk tied in well with the talks from the previous sessions. in today's world of ever-increasing amounts of data, the challenge of storing, referencing and retrieving data is becoming a huge challenge, which adore addresses well. at the end of the talk, ryan chute presented djatoka, an impressive open source jpeg 2000 image server developed for, but independent of, adore. the djatoka server provides on the fly dissemination of zoomed, scaled, and watermarked images, and much more. the next presenter was dave tarrant (university of southampton), speaking about "applying open storage to institutional repositories". he presented an architectural view on preservation. rather than looking at individual pieces of software, the preserv2 project tried to look at an overall architecture that facilitated preservation. this has led to the concept of "open storage", a self-checking, self-healing file-system, based on open components, both for hardware and software. having open storage under your repository, whatever software you choose, will provide the benefits of a reliable storage component, and will give you a single point where you can do your storage-related preservation tasks. the third speaker was asger blekinge-rasmussen (the state and university library, denmark) on "digital repositories and data models". he presented work done on fedora to be able to describe content models in a precise and machine-readable manner. fedora provides great possibilities for having data of very different kinds in the same repository, but has only just in fedora 3.0 scratched the surface of having a uniform way of describing what kinds of data you actually have. asger presented work on providing a description based on the owl and xml schema standards, which extends the fedora description of content models. use cases are validation, automated applications (like the user interface described by elsebeth kirring in a previous session) and exchange between institutions of content model descriptions. as the last speaker in the second session, alex wade (microsoft research) presented "an introduction to the microsoft research-output repository platform". the microsoft research-output platform is software in development for registering, describing, and accessing research output. research output was broadly defined, and included, for instance, publications, files, and datasets. perhaps the most interesting part of the software was the widespread use of rdf (resource description framework) for describing relations of various kinds citations, authoring, etc. this was indexed in a microsoft sql server with rdf-extensions (famulus) that was reported to perform much better than traditional triple stores, but providing much of the same functionality. this was used for interesting browsing of data and their relationships in a graphical fashion. the platform implemented many standards to facilitate interoperability, and implementation of more standards is planned, including oai-ore. the third session of the workshop ("search") began with the invited speaker robert tansley (google inc.), who talked about "science in the cloud: google and sharing huge datasets". he discussed how to bring large scale results from e-science to the attention of a wider audience, with an approach similar to cloud computing. christian kohlschütter (isearch it solutions gmbh) talked about "enhanced federated search for digital object repositories (dor)". the aim of this project is to enable users to perform searches in multiple dors simultaneously, with all the features of a single dor system, and to enable search between heterogeneous/incompatible dors without changing the underlying workflow. a reference implementation based upon lucene has shown good results with efficient faceted browsing functionality. finally, gert schmeltz pedersen (technical university of denmark) talked about solutions to filtering of search results by access constraints, as defined by xacml policies, in order to show only those search hits that the user is actually permitted to read. post-search filtering requires a request to the xacml mechanism for each hit, and the total number of permitted hits is only known at the end, a costly procedure especially when few hits are permitted out of a large number. in-search filtering requires additional index fields and query rewriting, that is, a logical partitioning of the index. pre-search filtering requires a physical partitioning of the index and selection of the pertinent index at query time. both in-search and pre-search filtering face the challenge of exact correspondence between the filtering mechanism and the xacml policies. a preliminary implementation within the fedora generic search service ("gsearch") facilitates further evaluation. the concluding discussion repositories evolve from stand-alone systems to nodes in an academic knowledge network. interoperability not only on the metadata level, but on the object level is becoming more and more important. protocols like oai-ore and standardized deposit interfaces will allow for the tight integration of repositories with known products. going from publications and dissertations to research data sets and other, more demanding data types, as well as large-scale repositories requires a more flexible storage strategy. content-addressable storage systems (e.g., sun's st5800 "honeycomb"), the grid, and cloud computing (e.g., amazon's ec2 and s3, but as well your local cloud in your university) are surfacing as options for repository setups. in the future, a repository may just be a service overlay to a distributed storage architecture. at the same time, it is getting more and more important to link objects across institutional and repository boundaries. the publication may be deposited in your institutional repository, whereas your visualized data is stored in a specialized image repository, and the raw data resides in the grid. allowing users to seamlessly navigate from one repository to another while maintaining the context of the objects and the meaning of the relations will be one of the next big challenges. digital object repositories have a bright future. however, tomorrow's systems will differ substantially from the systems we know today. copyright © 2008 gert schmeltz pedersen, kåre fiedler christiansen and matthias razum top | contents search | author index | title index | back issues previous conference report | next conference report home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/november2008-pedersen   a method for identifying personalized representations in web archives search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine november/december 2013 volume 19, number 11/12 table of contents   a method for identifying personalized representations in web archives mat kelly, justin f. brunelle, michele c. weigle, and michael l. nelson old dominion university {mkelly,jbrunelle,mweigle,mln}@cs.odu.edu doi:10.1045/november2013-kelly   printer-friendly version   abstract web resources are becoming increasingly personalized — two different users clicking on the same link at the same time can see content customized for each individual user. these changes result in multiple representations of a resource that cannot be canonicalized in web archives. we identify characteristics of this problem by presenting a potential solution to generalize personalized representations in archives. we also present our proof-of-concept prototype that analyzes warc (web archive) format files, inserts metadata establishing relationships, and provides archive users the ability to navigate on the additional dimension of environment variables in a modified wayback machine.   introduction personalized web resources offer different representations [8] to different users based on the user-agent string and other values in the http request headers, geoip, and other environmental factors. this means web crawlers capturing content for archives may receive representations based on the crawl environment which will differ from the representations returned to the interactive users. in summary, what we archive is increasingly different from what we as interactive users experience. web servers often provide lighter-weight representations to mobile browsers and the larger, full-feature representations to desktop browsers. content viewed from a mobile browser is often different than content viewed from a traditional desktop browser [9, 28]. this allows mobile devices to more easily and more quickly browse the web. with the increasing prevalence of mobile browsers on the web (50% 68% of sites have mobile representations [25]), it is becoming important to capture these mobile representations of resources. mobile pages often contain links to additional resources instead of embedded text and often reduce the number of images embedded in the page [9]. for example, the mobile representation of http://espn.go.com/ contains a section on espn videos, while the desktop representation does not. when http://espn.go.com (the "original resource", identified by uri-r), is accessed, it redirects to http://m.espn.go.com, effectively giving two separate but related uri-r values that go into the archive. subsequently, the uris of their archived versions ("mementos", identified by uri-m) are indexed separately and the relationships between uri-rs and uri-ms is not presented to the user. to quantify the differences, the desktop representation contains 201 links, while the mobile representation contains only 58 links. these link sets are mutually exclusive, with the mobile representation linking to specific resources (such as box-scores and gamecasts) while the desktop representation links to higher-level resources (such as narratives that include box-scores and may have links to gamecasts). a user may review news articles or other content on a mobile device and be unable to recall the article in an archive. to capture and record the complete set of content at http://espn.go.com, each of these different representations, both mobile and desktop, need to be stored in web archives. heritrix [18], the internet archive's crawler, offers archivists the ability to modify the user-agent string to simulate a variety of browsers during archiving. heritrix can crawl the mobile web by setting its user-agent string to a mobile browser. this can potentially lead to multiple representations of the same content being captured. if a desktop and mobile representation of the same resource are captured simultaneously, they will potentially collide in an archive. as archiving expands into the mobile and other dynamic domains, archives will contain representations generated with a variety of environmental influences. therefore, it is no longer sufficient to only navigate archives in the temporal dimension; archives must also provide users the opportunity to understand how a representation was crawled and navigate representations by their environmental influences. the factors influencing the representations seen by a crawler or user need to be recorded and presented to the user viewing the captured representation. in this work, we explore the issue of personalized representations in web archives, propose a framework to solve this problem, and present a proof-of-concept prototype that integrates personalized representations. we study live resources (identified by uri-rs) seen by users and proposed methods for mapping archived representations called mementos (identified by uri-ms) to a canonical representation. this prototype extends the description of mementos from only "when" they were archived (temporal dimension) to "where" and "how" (geoip and browser environments). users can then browse between mementos based on temporal or environmental dimensions.   personalized, anonymous representations dynamic and personalized representations of web 2.0 resources that are generated by technologies such as javascript can differ greatly depending on several factors. for example, some sites attempt to provide alternate representations by interpreting the user-agent portion of the http get headers and use content negotiation to determine which representation to return. we ran a pair of limited crawls of the cnn.com front page with heritrix 3.1 and then accessed the mementos captured by heritrix with a desktop mac and an android phone. the first crawl captured the cnn.com front page and specified a desktop version of the mozilla browser as the user-agent in the header string, as seen in figure 1. the resulting web archive (warc) file [26] is viewed in a local installation of the wayback machine [29] and is shown in figures 3(a) and 3(c). the second crawl captured the cnn.com front page and specified an iphone version of the mozilla browser as the user-agent string in the header, as seen in figure 2. the resulting warc, as viewed in the wayback machine, is shown in figures 3(b) and 3(d). the mobile and desktop representations differ in archives, but their relationship as permutations of each other is neither recorded nor seen by users; a user of the wayback machine may not understand how these representations are generated since they are identified by the same uri-r. we refer to these differing representations of the same uri-r built with differing environments as personalized representations of the resource r. the headers in figures 1 and 2 reference the user-agent string with http://yourdomain.com, which is a place holder for the uri for whom the crawl is being executed. for example, a crawl originating from old dominion university's computer science department would read http://www.cs.odu.edu/. warc/1.0 warc-type: request warc-target-uri: http://www.cnn.com/ warc-date: 2013-03-05t16:57:00z warc-concurrent-to: warc-record-id: content-type: application/http; msgtype=request content-length: 266 get / http/1.0 user-agent: mozilla/5.0 (compatible; heritrix/3.1.0 +http://yourdomain.com) connection: close referer: http://cnn.com/ accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 host: www.cnn.com cookie: cg=us:--:--; cg=us:--:- figure 1. http get request from heritrix with the desktop mozilla user-agent.       warc/1.0 warc-type: request warc-target-uri: http://www.cnn.com/ warc-date: 2013-03-05t16:38:08z warc-concurrent-to: warc-record-id: content-type: application/http; msgtype=request content-length: 400 get / http/1.0 user-agent: mozilla/5.0 (iphone; u; cpu iphone os 4_0 like mac os x; en-us) applewebkit/532.9 (khtml, like gecko) version/4.0.5 mobile/8a293 safari/ 6531.22.7 (compatible; heritrix/3.1.0 +http://yourdomain.com) connection: close referer: http://cnn.com/ accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 host: www.cnn.com cookie: cg=us:--:--; cg=us:--:- figure 2. http get request from heritrix with the iphone mozilla user-agent. these examples illustrate the potential for collisions of personalized representations with the same uri within web archives. the potential exists for a mobile and a desktop representation of a page (or constituent and embedded resources) to be captured simultaneously, and therefore be indistinguishable. the live web version of cnn.com is identified by http://www.cnn.com/ regardless of the user-agent string and resulting representation. while the distinction between representations could be accomplished with the vary http headers [8] which would alert caches and clients that the representation is personalized, cnn does not employ this header. the result is both the mobile and desktop versions use http://www.cnn.com/ for the uri-r values. figure a: the cnn.com memento when crawled by heritrix with a desktop mozilla user-agent accessed from a mac. figure b: the cnn.com memento when crawled by heritrix with an iphone mozilla user-agent accessed from a mac.   figure c: the cnn.com memento when crawled by heritrix with a desktop mozilla user-agent accessed from an android phone. figure d: the cnn.com memento when crawled by heritrix with an iphone mozilla user-agent accessed from an android phone. figure 3: mementos differ based on the parameters influencing the representations at crawl/capture time and the devices used to access the mementos. some sites provide local news and weather content depending on the geoip of the requester. for example, a user requesting http://www.nbcnews.com/ without an ip (via anonymous browser) will get news and weather from new york, ny with a request for the user to enter a local zip code (figure 4b). alternatively, a user requesting the page from suffolk, virginia will receive news and weather from neighboring portsmouth, virginia (figure 4a). figure a: when browsing from suffolk, va, nbcnews.com shows headlines from suffolk and portsmouth, va. figure b. when browsing anonymously, nbcnews.com shows headlines from new york city, ny. figure 4: the live versions of nbcnews.com differ based on the ability to interpret the geoip of the requester. figure 4 contains examples of personalized representations that pose problems for archives. the factors that influence the representations need to be documented, and users of archives should be able to not only browse mementos by time, but also by which representations are available. the representations in figure 4 demonstrate that environment variables do not have an impact limited to the look and feel or stylesheets of a resource. the content in the page (specifically, the local news stories and weather shown in the representations in figure 4) changes depending on the environment, in this case the geoip. for this reason, we need to link to the original request, which is not referenced in http response headers.   related work the prevalence of dynamic web 2.0 content has increased on the web [19, 22]. several papers have identified the roles of javascript and other web 2.0 technologies in a social and web environment that is becoming increasingly unarchivable [4, 6, 15, 23, 35]. the mobile web is also growing, with mobile devices contributing 18% of all web traffic. [13] additionally, the iphone has a version of the memento client [30], which has the potential to highlight the problems identified in this paper. several efforts have investigated how to capture dynamic content, although not with archival intent. these efforts, however, contribute greatly to the understanding of how dynamic resources can change on the client. livshits et al. at microsoft research have performed extensive research in the monitoring and capture of client-side javascript [10—12, 16]. most of this work has targeted debugging and security monitoring. other works have included the interpretation of client-side state for crawling purposes [1, 5, 14]. several efforts have focused on monitoring web 2.0 resources on the client [2, 3, 10, 11, 17, 21]. other work has studied the capture of dynamic content for search engine optimization or search engine indexing [24].   identification models it is important to understand the potential methods of identifying a memento that has personalized representations. there are three approaches that we considered: identification in the uri, content negotiation, and modifications to the client. the first option would modify the uri that identifies an archived version of a resource to include information about the representation using hashbang (#!) uris [7, 27, 34]. the information in the hashbang portion of the uri will identify the specific representation state, such as a mobile or desktop version. however, this uri does not generalize and does not integrate with existing archives which index resources by uri. the second option using content negotiation, similar to that done by the memento framework's memento-datetime [20, 31—33] would specify a uri-m to access and also specify the environment variables — like user-agent or geoip — to use when picking a representation of the resource. however, this shifts the responsibility away from the server, requiring a specialized client. the third option starts with a post-processing of the content captured by the crawler. the post-processor needs to provide metadata embedded in the memento that specifies the environment variables that went into creating the captured representation. the client that replays the representation needs to specify which representation is being shown to archive users. there are also navigation controls to allow for migration between representations captured with different environment variables, effectively giving the user the ability to navigate mementos based on the environment as well as traditional temporal parameters. this is the most complex of the options provided, but is also the most extensible and effective. this method also requires the most modification of existing tools, and is likely only effective for a subset of archives until widespread adoption takes place. however, this provides the highest degree of information to be provided to and controlled by the users. we selected this third option for implementation. we created a script to identify representations of resources that were generated with different environment variables, and modify warc files to identify these different related representations as personalized representations. then we modified a local installation of the wayback machine to allow for navigation between these personalized representations. the details of these efforts are explained in the remainder of this paper. during consideration of the impact these changes will have on services, we kept the memento framework in mind [32]. we can modify memento timemaps to include the relationships identified by web archives. this effectively utilizes the services that already consume information from the archives to carry related representations from the archives to the users. we propose a multipart http response that returns multiple related timemaps as part of a single request. when a user requests a timemap for http://www.example.com/, the archive discovers that there is a mobile representation, http://m.example.com/ of the resource and returns the timemap for the original representation and the mobile representation of example.com as a multipart message, with each separate timemap being sent as a separate part.   identifying personalized representations we identified related representations by analyzing the warc records' metadata and finding uris that are derivations of one another (such as http://m.example.com/ and http://www.example.com/). we read the warcs and from the warcinfo record we extracted the user-agent string and ip of the crawler for the warc file's records. then, we extracted each uri-r from the response records and wrote the metadata records. we wrote a post-processor script to discover personalized representations given a directory of warc files, simulating the analysis of an entire local archive. the script reads the individual warc records and extracts information about the mementos such as the user-agent used, the crawler settings, and the urn of each record. we then used uri guessing and user-agent analysis [25] to find related warc records based on the extracted information. a uri such as http://m.example.com/ is treated as a personalized representation of http://example.com/. similarly, http://www.cnn.com/ when accessed with a mobile browser (and associated user-agent string) is treated as a personalized representation of http://www.cnn.com/ accessed with a desktop browser. once we identified the personalized representations of the resources, we created a metadata record for each personalized representation. in the record (shown in figure 5), we used values to identify the user-agent (http-header-user-agent), submitting user's (or crawler's) ip address (contributor-ip), the urn of the personalized representation of this resource (hasversion), and the geographic location of the contributor (geo-location). warc/1.0 warc-type: metadata warc-target-uri: http://www.cnn.com/ warc-date: 2013-03-05t16:57:00z warc-concurrent-to: warc-record-id: content-type: application/warc-fields content-length: 216 http-header-user-agent: mozilla/5.0 (compatible; heritrix/3.1.0 +http://yourdomain.com) contributor-ip: 192.168.1.7 hasversion: geo-location: multicast description: figure 5. a sample metadata record that creates a link to the mobile representation of the desktop cnn.com resource. the metadata records are inserted into the warc file containing the personalized representation to identify all other representations of the resource available in the local archive. the metadata record in figure 5 identifies the warc record for the desktop representation of cnn.com — identified by urn:uuid:d338e6e5-6854-329b-adbb-de70a62e11f0 — as a personalized representation of the mobile warc record of cnn.com — identified by urn:uuid:cc7f75cc-fbaa-352a-8939-7cf5dd7792c7 — via the hasversion value. it also records the user-agent for the concurrent record and the ip address of the contributor — in this case, the host running the heritrix crawler — via the contributor-ip value. the geographic location of the ip address is also identified via the geo-location value. in this record, the contributor uses a multicast address that cannot be mapped to a geographic location. this metadata record can be expanded to include additional information about the representation being described through the description value. a complementing metadata record for the mobile representation of cnn.com identifies a link back to the desktop representation. this creates a bidirectional link for personalized representations so that each personalized representation identifies its counterpart. the examples in this paper have focused on 1-to-1 links between personalized representations (a single desktop and a single mobile representation). however, it is possible to have one-to-n relationships identified in the metadata (multiple desktop representations and multiple mobile representations). this would identify multiple personalized representations for a single resource. the post-processing of warcs will add time and complexity to the process of ingesting warcs into a repository. since records in warc files are not sorted, our post-processing script uses linear match between two sets of mementos, which runs in o(n2) time. the metadata records that we add only increase the warc file size by approximately 1 kb per personalized representation. this is a small size increase for warc files that range from hundreds of mb to two gb.   recognizing personalized representations to replay the warcs and provide a means to experience the relationship that we established between the mobile and desktop captures, we proposed an implementation consisting of a modification of wayback machine with an additional user interface element that allows the user to quickly toggle between the representations of the page captured. in this prototype (figure 6), we provide a drop-down menu that displays all versions available for the current uri as established by the metadata record (figure 5) that we added to the warc during post-processing. figure a. mobile and desktop versions of the cnn.com front page exist in archives. the user has the option of which version to access. figure b. two source locations of the cnn.com front page are available from washington, dc and portsmouth, va. the user has the option of which version to access. figure 6: modifications to local wayback allow for additional indexing and thus further reference for the uri currently being replayed to additional representations of the memento in the archive. in the wayback toolbar we have provided a means for a user to quickly view other representations available for the page being replayed. the dropdowns are conditionally displayed based on whether additional representations are available. by comparing the various user-agent strings available in combination with the ip of the crawler, we can allow the user to choose which dimension (e.g., geoip and user-agent) is most important (only the user-agent dimension is shown) and retain that setting in traversing the available mementos. when a user selects an option in our modified ui, this setting is set in the user's browser as a cookie. this cookie is retrieved and the settings read to get a temporally adjacent memento from the timemap for the uri. whereas replaying the archive is usually limited to the dimensions of what (uri) and when (memento-datetime), the ui extension adds the additional dimensions of where (geoip) and how (user-agent).   conclusion current technologies introduce the opportunity for collisions in web archives. the granularity of uri-ms makes differentiating between representations impossible with the current configurations in archives. temporal data alone is no longer sufficient to describe a memento; environmental variables must also be recorded and presented to the user, allowing the user to navigate between multiple dimensions of a representation. users can then decide whether browsing on the temporal dimension or the environmental parameters is more suitable for their goals. we present examples of the representation collision problem as well as a proof-of-concept solution. we use a post-processor to analyze warcs and insert metadata identifying related representations of mementos. through a modified wayback machine, we allow users to navigate personalized representations of mementos through their environmental parameters as well as on the temporal dimension. our future work will focus on allowing users to nominate candidates for merging as related personalized representations, effectively providing a tool for crowd-sourcing the problem with personalized representations.   acknowledgements this work is supported in part by nsf grant 1009392 and the library of congress.   references [1] k. benjamin, g. von bochmann, m. dincturk, g.-v. jourdan, and i. onut. a strategy for efficient crawling of rich internet applications. in web engineering, lecture notes in computer science, pages 74—89, 2011. http://doi.org/10.1007/978-3-642-22233-7_6 [2] e. benson, a. marcus, d. karger, and s. madden. sync kit: a persistent client-side database caching toolkit for data intensive websites. in proceedings of the 19th international conference on world wide web, www '10, 2010. http://doi.org/10.1145/1772690.1772704 [3] s. chakrabarti, s. srivastava, m. subramanyam, and m. tiwari. memex: a browsing assistant for collaborative archiving and mining of surf trails. in proceedings of the 26th vldb conference, 26th vldb, 2000. http://dl.acm.org/citation.cfm?id=758378 [4] e. crook. web archiving in a web 2.0 world. in proceedings of the australian library and information association biennial conference, pages 1—9, 2008. http://doi.org/10.1108/02640470910998542 [5] c. duda, g. frey, d. kossmann, and c. zhou. ajaxsearch: crawling, indexing and searching web 2.0 applications. proc. vldb endow., 1:1440—1443, august 2008. http://dl.acm.org/citation.cfm?id=1454195 [6] b. fleiss. seo in the web 2.0 era: the evolution of search engine optimization, 2007. [7] google. ajax crawling: guide for webmasters and developers, 2013. [8] i. jacobs and n. walsh. architecture of the world wide web, volume one. technical report w3c recommendation 15 december 2004, w3c, 2004. [9] a. jindal, c. crutchfield, s. goel, r. kolluri, and r. jain. the mobile web is structurally different. in infocom workshops 2008, ieee, pages 1—6, 2008. http://doi.org/10.1109/infocom.2008.4544648 [10] e. kiciman and b. livshits. ajaxscope: a platform for remotely monitoring the client-side behavior of web 2.0 applications. in the 21st acm symposium on operating systems principles, sosp '07, 2007. http://doi.org/10.1145/1294261.1294264 [11] k.vikram, a. prateek, and b. livshits. ripley: automatically securing web 2.0 applications through replicated execution. in proceedings of the conference on computer and communications security, november 2009. [12] b. livshits and s. guarnieri. gulfstream: incremental static analysis for streaming javascript applications. technical report msr-tr-2010-4, microsoft, january 2010. [13] t. macchi. 2012 mobile traffic report: how much did mobile traffic increase in 2013?, january 2013. [14] a. mesbah, e. bozdag, and a. van deursen. crawling ajax by inferring user interface state changes. in web engineering, 2008. icwe '08. eighth international conference on, pages 122 —134, july 2008. http://doi.org/10.1109/icwe.2008.24 [15] a. mesbah and a. van deursen. an architectural style for ajax. software architecture, working ieee/ifip conference on, pages 1—9, 2007. http://doi.org/10.1109/wicsa.2007.7 [16] l. a. meyerovich and b. livshits. conscript: specifying and enforcing fine-grained security policies for javascript in the browser. ieee symposium on security and privacy, 0:481—496, 2010. http://doi.org/10.1109/sp.2010.36 [17] j. mickens, j. elson, and j. howell. mugshot: deterministic capture and replay for javascript applications. in proceedings of the 7th usenix conference on networked systems design and implementation, nsdi'10, 2010. [18] g. mohr, m. kimpton, m. stack, and i. ranitovic. introduction to heritrix, an archival quality web crawler. in proceedings of the 4th international web archiving workshop (iwaw04), september 2004. [19] k. c. negulescu. web archiving @ the internet archive. presentation at the 2010 digital preservation partners meeting, 2010. [20] m. l. nelson. memento-datetime is not last-modified, 2011. [21] @nesbittbrian. play framework sample application with jwebunit and synchronous ajax. [22] netpreserver.org. iipc future of the web workshop — introduction & overview, 2012. [23] m. e. pierce, g. fox, h. yuan, and y. deng. cyberinfrastructure and web 2.0. in high performance computing and grids in action, pages 265—287, 2008. [24] s. raghavan and h. garcia-molina. crawling the hidden web. technical report 2000-36, stanford infolab, 2000. [25] r. schneider and f. mccown. first steps in archiving the mobile web: automated discovery of mobile websites. in jcdl '13: proceedings of the 13th acm/ieee-cs joint conference on digital libraries. http://doi.org/10.1145/2467696.2467735 [26] technical committee iso/tc 46. the warc file format (iso 28500), 2008. [27] j. tennison. hash uris, 2011. [28] p. timmins, s. mccormick, e. agu, and c. wills. characteristics of mobile web content. in 1st ieee workshop on hot topics in web systems and technologies, 2006, hotweb '06, pages 1—10, 2006. http://doi.org/10.1109/hotweb.2006.355263 [29] b. tofel. 'wayback' for accessing web archives. in proceedings of the 7th international web archiving workshop (iwaw07), 2007. [30] h. tweedy, f. mccown, and m. l. nelson. a memento web browser for ios. in jcdl '13: proceedings of the 13th acm/ieee-cs joint conference on digital libraries. http://doi.org/10.1145/2467696.2467764 [31] h. van de sompel, m. l. nelson, and r. sanderson. http framework for time-based access to resource states — memento draft-vandesompel-memento-06, 2013. [32] h. van de sompel, m. l. nelson, r. sanderson, l. l. balakireva, s. ainsworth, and h. shankar. memento: time travel for the web. technical report arxiv:0911.1112, 2009. [33] h. van de sompel, r. sanderson, m. l. nelson, l. l. balakireva, h. shankar, and s. ainsworth. an http-based versioning mechanism for linked data. in proceedings of the linked data on the web workshop (ldow 2010), 2010. (also available as arxiv:1003.3661). [34] w3c staff and working group participants. hash uris, december 2011. [35] d. f. zucker. what does ajax mean for you? interactions, 14:10—12, september 2007. http://doi.org/10.1145/1288515.1288523   about the authors mat kelly is a phd student of computer science at old dominion university. he is employed by nasa langley research center through science systems and application, incorporated (ssai) of hampton, virginia; blade agency of gainesville, florida and old dominion university research foundation of norfolk, virginia. his expertise lies in finding new ways to reinvent the wheel, introducing needless complication and overcoming data protection schemes. he is a scorpio and is happily married with two dogs and resides in portsmouth, virginia.   justin f. brunelle is a computer science graduate student at old dominion university. his work involves the impact of multi-state, client-side representations on the archives and how it is preserved in the archives. justin is also a senior application developer at the mitre corporation where he performs research in the cloud computing and big data domains. more information on justin can be found at http://www.justinfbrunelle.com/.   michele c. weigle is an associate professor of computer science at old dominion university. her research interests include digital preservation, web science, information visualization, and mobile networking. she received her phd in computer science from the university of north carolina at chapel hill.   michael l. nelson is an associate professor of computer science at old dominion university. prior to joining odu, he worked at nasa langley research center from 1991-2002. he is a co-editor of the oai-pmh, oai-ore, memento, and resourcesync specifications. his research interests include repository-object interaction and alternative approaches to digital preservation. more information about dr. nelson can be found at: http://www.cs.odu.edu/~mln/.   copyright © 2013 mat kelly, justin f. brunelle, michele c. weigle and michael l. nelson enduring access to rich media content: understanding use and usability requirements search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine september/october 2015 volume 21, number 9/10 table of contents   enduring access to rich media content: understanding use and usability requirements madeleine casad, oya y. rieger and desiree alexander cornell university library {mir9, oyr1, dca58}@cornell.edu doi: 10.1045/september2015-casad   printer-friendly version   abstract through an neh-funded initiative, cornell university library is creating a technical, curatorial, and managerial framework for preserving access to complex born-digital new media objects. the library's rose goldsen archive of new media art provides the testbed for this project. this collection of complex interactive born-digital artworks are used by students, faculty, and artists from various disciplines. interactive digital assets are far more complex to preserve and manage than single uniform digital media files. the preservation model developed will apply not merely to new media artworks, but to other rich digital media environments. this article describes the project's findings and discoveries, focusing on a user survey conducted with the aim of creating user profiles and use cases for born-digital assets like those in the testbed collection. the project's ultimate goal is to create a preservation and access practice grounded in thorough and practical understanding of the characteristics of digital objects and their access requirements, seen from the perspectives of collection curators and users alike. we discuss how the survey findings informed the development of an artist questionnaire to support creation of user-centric and cost-efficient preservation strategies. although this project focuses on new media art, our methodologies and findings will inform other kinds of complex born-digital collections.   1 introduction despite its "new" label, new media art has a rich 40-year history, making obsolescence and loss of cultural history an imminent risk. as a range of new media are integrated in art works, these creative objects are becoming increasingly complex and vulnerable due to dependence on many technical and contextual factors (delve, et al., 2012). the phrase "new media art" denotes a range of creative works that are influenced or enabled by technological affordances. the term also signifies a departure from traditional visual arts (e.g., paintings, drawings, sculpture, etc.). another characteristic of new media art that adds further complications to the preservation process is its interactive nature. works in this genre often entail, and indeed rely on, interactions between artists and viewers/observers. in 2013, cornell university library received a research and development grant from the national endowment for the humanities to design a framework for preserving access to digital art objects. the preservation and access frameworks for digital art objects (pafdao) was undertaken in collaboration with cornell university's society for the humanities and the rose goldsen archive of new media art, a collection of media artworks housed in the library's division of rare and manuscript collections. the project aims to develop scalable technical frameworks and associated tools to facilitate enduring access to complex, born-digital media objects, working primarily with a test bed of nearly 100 optical discs from the holdings of the goldsen archive. the preservation model developed will apply not merely to new media artworks, but to other rich digital media environments (for instance see kirschenbaum, et al., 2010). many of the issues we have been addressing within the framework of this project apply to other rich digital contents, not limited to artistic productions. from the beginning, the project team has recognized that both metadata frameworks and access strategies would need to address the needs of future as well as current media art researchers. toward that end, we developed a survey targeting researchers, artists, and curators to expand our understanding of users and use cases. this article summarizes key findings of the survey and describes their impact on our current preservation and access frameworks and future plans.1   2 about the collection the ultimate aim of the pafdao project is to create generalizable new media preservation and access practices that will be applicable for different media environments and institutional types. the nature of the project's test collection, a set of cd-rom artworks from cornell's rose goldsen archive of new media art2, has meant that the project provides a case study in new media preservation that may be informative to library and museum contexts alike. rose kohn goldsen (1917-1985) was a professor of sociology at cornell university and an early critic of commercial mass media's impact on social and ethical imagination. named in her honor, the rose goldsen archive of new media art was founded in 2002 by professor timothy murray (director, society for the humanities, cornell university) in the cornell library division of rare and manuscript collections as an international research collection for scholars of new media and media art history (murray, 2008). since its founding, the goldsen archive has grown to achieve global recognition as a prominent research collection that documents more than 60 years of the history of aesthetic experimentation with electronic communications media. these collections span the two most crucial decades in the emergence of digital media art, from 1991 to the present, tracing the historical shift in emphasis within media culture from disc-based to networked and web-based applications. they also mark the early stirrings of a networked, interactive digital culture that has subsequently become the global norm. the goldsen archive constitutes a vital record of our cultural and aesthetic history as a digital society. the pafdao project focused on a subset of born-digital media artworks on cd-rom. these artworks were created for small-screen, single-user experience, and dated back as far as the early 1990s. the cultural significance of such artworks is great. among other things, they represent the early development of interactive interfaces that are now a major part of our everyday life. and artists' exploration of the expressive possibilities these new multimedia interfaces have to offer. despite their cultural value, and their relatively recent production, such artifacts present serious preservation challenges and obsolescence risks. to begin with, no archival best practices yet exist for preserving such assets. many are stored on fragile storage media like optical discs, meaning that physical damage as well as data degradation or "bit rot" pose serious dangers to the integrity of the information. in the case of the pafdao project's test collection, many of these discs were artist-produced and irreplaceable. interactive digital assets are, furthermore, far more complex to preserve and manage than single, uniform digital media files. a single interactive work can comprise an entire range of digital objects and dependencies, including media files in different types and formats, applications to coordinate the files, and operating systems to run the applications. if any part of this complex system fails, the entire asset can become unreadable. this danger is especially acute in the case of artworks. in most cases, interactive digital artworks are designed to create unique, multimedia experiences for users. an even relatively minor problem with an artwork's rendering—for example, an obsolete media player that no longer operates as expected—has the potential to significantly compromise an artwork's "meaning." simply migrating information files to another storage medium is not enough to preserve their most important cultural content. when the pafdao project began, approximately 70 percent of the artworks in the test collection could not be accessed at all without using legacy hardware—a specialized computer terminal that runs obsolete software and operating systems. the project's objective was to provide "best-feasible" access to artworks, and document the distance between "feasible" and "ideal," as well as we could understand it. very soon after beginning pafdao, the project team realized that, contrary to our initial assumptions, operating system emulation would be a viable access strategy at scale for our complex digital media holdings (for information about emulation, see lange, 2012). embracing emulation as an access strategy meant that the team could provide better access more easily to more artworks in the collection. though increasingly feasible, however, emulation is not always an ideal access strategy: emulation platforms can introduce rendering problems of their own, and emulation usually means that users will experience technologically out-of-date artworks with up-to-date hardware. this made it all the more important for the team to survey media art researchers, curators, and artists, in order to gain a better sense of the relative importance of the artworks' most important characteristics for different kinds of media archives patrons.   3 about the survey we developed a questionnaire that presented users of media archives with a number of open-ended, largely qualitative and non-restrictive questions about their needs, goals, and preferences. in january 2014, we circulated the questionnaire on several preservation, art, and digital humanities mailing lists. the pafdao team initially hoped that survey results would support the identification of "personas," or broad profiles of media archives users who shared similar needs and preferences. we hoped that these profiles would direct both metadata framework and access provisions. as it happened, no such clear classifications emerged, yet questionnaire results were still vastly informative, and shaped the development of the pafdao project in integral ways. in the remainder of this paper, we offer an overview of noteworthy trends and comments, then discuss the conclusions we draw from these results and their impact on the pafdao workplan and preservation framework.   4 survey results a total of 170 people responded to the questionnaire. respondents came from disparate geographical locations, including the us, germany, france, uk, australia, and argentina. of 170 respondents, 122 responded as an individual researcher or practitioner and 48 responded on behalf of an archive, museum, or other cultural heritage institution.3 we did not observe any significant differences in the responses of these two groups (personal and institutional responses), possibly due to the fact that even at an institutional level, new media projects and collections are led by small, specialized teams of committed individuals. respondents often held multiple roles and characterized themselves non-exclusively as artists (48%), researchers (47%), educators (25%), curators (20%), collection specialists (24%). the scope of digital media art collections respondents worked with was also broad, and included digital installation, video and images, interactive multimedia, audio, 3-d visualization, and websites. the key impetus behind the survey was to understand what kind of research questions and needs were motivating users to search for and use media works. this information is critical for the research team to identify and assess the nature and extent of viewing experience that needs to be preserved. in aggregate, respondents gave almost equal weight to artistic, social, historical, cultural, aesthetic, and technical research frameworks. several described pedagogical uses and how they use media works in teaching and learning. some sample research questions include: how are technologies assisting the exploration of political issues by artists? how do you bring the work to the viewer through the interactive power of technologies? do digital works explore something further than the analog approaches can do? how do technologies support and stimulate community engagement? how are access issues for individuals with lower economic backgrounds being addressed? what are the possible implications of gender in digital media artworks? what does it mean to view an art work that is designed for an old tv set in a larger installation? the respondents cited a number of serious impediments they had encountered in conducting research involving new media art. for example, they mentioned the lack or insufficiency of documentation and metadata, discovery and access provisions, and technical support. ones who use new media collections in support of teaching and learning listed several impediments such as vanishing webpages, link rot, poor indexing, gap for works from the 80s and 90s, and the lack of quality documentation. also often underscored were the complexity of legal issues and access rights. one respondent pointed out that, due to a widespread "disinterest in preserving the cultural artifacts of the digital age," there is a lack of understanding of the importance of these objects for cultural history. another comment noted infrequent access requests and therefore difficulties in justifying institutional investment in preservation efforts for future use. one of the respondents wrote, "in a society that is rushing headlong into the future, it is vital that we preserve the efforts of those who have early works in this new culture." another one commented that as technologies evolve, some works become very easy to create and therefore some users don't understand the significance of a work and how it was a complicated piece to produce at the time. such sentiments underscore the importance of documenting cultural context to situate the work from artistic, historic, and technical perspectives. for practicing artists, there were several concerns about the longevity of their creative work. some expressed concern about the difficulty of selling works that may become obsolete within a year. many worried that it was difficult to store or archive immersive installations, interactive pieces, and work with dependency on external files. they also mentioned copyright issues as a significant challenge. many emphasized the importance of historical contexts, usability, and discovery. one of them pointed out that archiving has become a part of his practice and he feels the pressure to consider future uses as he is going through a creative process. for curators of new media art, many indicated that they don't include born-digital interactive media in their holdings because either such materials fall outside of collecting scope or the procedures for providing access are too complex or unsustainable. for those who collect this genre, the biggest concerns were trying to identify which aspects of interaction experiences to preserve and how to capture as much information as possible to assist future users. out of the twenty survey respondents who answered on behalf of an educational or cultural institution, only one organization could claim a sophisticated and integrated web-based discovery, access, and preservation framework. the others indicated that access needed to be arranged through a special arrangement such as setting an appointment. they cited a range of preservation strategies they rely on, including migration, creation of search and discovery metadata, maintaining a media preservation lab, providing climate controlled storage, and collecting documentation from the artists.   5 content authenticity and authentic user experience as mentioned above, the pafdo survey of users of media archives did not, as we had hoped, result in the definition of clear user profiles or personas. however it had several important effects on the pafdao project. first, we noted a significant concern among our respondents for "authenticity"—understood as a cultural rather than technical concept. the international research on permanent authentic records in electronic systems (interpares) project defines an authentic record as "a record that is what it purports to be and is free from tampering or corruption" (macneil, et al., 2001, referenced in dietrich & adelstein, 2015). verifying the bit-level self-identity of a digital object over time can be accomplished relatively easily with checksums, automated fixity checks, and collection audits. when working with cultural artifacts, however, "authenticity" becomes a more nebulous and controversial concept. conservation measures undertaken to restore an artwork to some approximation of its original appearance may, in fact, alter its original form in ways that can affect its meaning. this is especially true in the case of artworks conceived to be ephemeral or experiential, or works that involve "contemporary" technologies that become obsolete, even obscure, over time. our questionnaire respondents seemed to respect this difficulty. reading across the complete pool of responses, we noted that the desired sense of "authenticity" derived not from some naïve sense of the object's pristine originality, but rather from a sense that the archiving institution has made a good-faith commitment to ensuring that the artist's creative vision has been respected, and providing necessary context of interpretation for understanding that vision—and any unavoidable deviations from it. we had excellent models for addressing these concerns. within the last ten to fifteen years, many arts organizations have joined forces to develop shared practices for the conservation of technology-based media, but also difficult-to-document arts such as performance, video art and multi-media installations. examples include independent media arts preservation (imap); the variable media network; matters in media art (a collaborative project between the tate, the new art trust (nat) and its partner museums—the museum of modern art (moma), the san francisco museum of modern art (sfmoma)); and incca (international network for the conservation of contemporary art).4 the most significant commonality of these initiatives is their shared emphasis on appropriate documentation. while some complex time-based artworks can never be authentically replicated, it is generally agreed that, with proper documentation, many can be reinterpreted, adapted and revived for modern audiences. in cultural heritage organizations, this documentation can take the form of technical and descriptive metadata tailored for the breadth and specificity of new media, detailed installation instructions, detailed exhibition histories, and so forth. above all, practices for working directly with artists have been especially important conservation tools, and the initiatives cited above provide excellent models for how artist interviews can aid efforts to preserve complex artworks; see, for example, the variable media questionnaire (depocas, et al., 2003). in response to these considerations raised by our user survey, we developed a conservation-oriented artist questionnaire and interview process, pushing the integration of archival protocols as far upstream as possible, to the point of content creation and initial curation. enlisting the help of our project advisors, we worked with existing models, but adapted these models significantly. we streamlined and simplified our artist questionnaire to address specific aspects of our emerging preservation and access framework. we were particularly concerned about communicating with artists and enlisting their input about our decision to rely on operating system emulation as a default access strategy. though easy and readily scalable, emulation introduces variations into the rendering of artworks that artists might not have anticipated; it was clear that we would need to work with artists wherever possible to ensure that artworks' most significant properties and interpretive contexts were preserved, and not obscured, by our access measures.   6 artist questionnaire the pafdao questionnaire is designed to be a first step in a two-part process, gathering essential information but also laying the groundwork for a more conversational interview process where possible. first and foremost, the questionnaire elicits artists' input in identifying the most significant properties of individual media artworks by asking about the artists' initial vision for the work, and by posing open-ended questions about the relationship between artistic vision, technology, and historical contexts. the questionnaire also asks fundamental technological questions. (e.g., "what software or programming language was used to create this artwork?" "what hardware and software were optimal for running this artwork when it was new?") we inquire as to whether artists still have the working files they used in creating the artwork, including source code; these would constitute a deep technological and historical context for the works, and also an invaluable resource for future conservation work (engel & wharton, 2014). we also ask about related artworks or websites, and whether any of these materials may have been archived by another person or institution. networks of collaboration between archiving institutions will become more and more important in preserving cultural, historical, and technological contexts of reference that will be essential to understanding these artworks. the questionnaire also discloses foreseeable problems in our chosen access frameworks, including specific rendering issues that might come about with different emulation platforms: we have found virtual machine emulation to be an effective strategy for providing research access to interactive digital artworks. running older artworks in an emulation environment may involve changes to the look and feel of the original artwork. our default access strategy is likely to involve: current, commercial-grade hardware and peripherals (mouse, screen, keyboard, etc.) color shift associated with the change from crt to led monitor screens possible alterations to the speed of animation and interactive responsiveness possible changes to audio quality presentation of digital surrogates rather than original physical materials that may have accompanied the artwork (discs, booklets, cases, etc.) we ask artists to describe how such changes might affect their initial vision for the work. we also request permission to provide works in emulation, outline the kinds of documentation we expect to provide archive users, and invite artists to work with us on supplementary or alternate forms of documentation if they choose: we expect to present users with a general statement about the effects of our emulation environments on the rendering of an artwork. if you would like to author or co-author a more specific statement about how these changes may affect your work, we can provide researchers with this information as well. in some cases, we may be able to provide additional documentation of original rendering conditions. please let us know if you would like to discuss these possibilities further. finally, the questionnaire furthermore provides us with an opportunity to revisit rights agreements, which must be updated in light of new access technologies, and an opportunity to invite further conversation (a follow-up interview) and collaboration with the artist.   7 concluding remarks a reoccurring theme in our findings involved the difficulties associated with capturing sufficient information about a digital art object to enable an authentic user experience. this challenge cannot and should not be reduced to the goal of ensuring bit-level fixity checks or even providing technically accurate renderings of an artwork's contents as understood on the level of individual files. as rinehart & ippolito (2014) argue, the key to digital media preservation is variability, not fixity. the trick is finding ways to capture the experience—or a modest proxy of it—so that future generations will get a glimpse of how early digital artworks were created, experienced, and interpreted. so much of new media works' cultural meaning derives from users' spontaneous and contextual interactions with the art objects. espenschied, et al. (2013) point out that digital artworks relay digital culture and "history is comprehended as the understanding of how and in which contexts a certain artifact was created and manipulated and how it affected its users and surrounding objects." for a work to be understood and appreciated, it is essential for the archiving institution to communicate a cultural and technological framework for interpretation. as one user survey respondent noted, some works that come across as mundane now may have been among the highly innovative trailblazers of yesterday. given the speed of technological advances, it will be essential to capture these historical moments to help future users understand and appreciate such creative works. the pafdao survey of users of media archives affirmed the importance of institutions like the rose goldsen archive, which is able to provide a breadth of media technological, historical, and cultural contexts to researchers and educators through its extensive and accessible collections.5 it also underscored the need for archiving institutions to be in contact with one another, and to be conscious of the need for greater integration of discovery and access frameworks across multiple institutions as they move forward in developing new preservation plans and access strategies for their collections. providing appropriate cultural and historical contexts for understanding and interpreting new media art is part of each institution's individual mission, but also a matter of collective importance, given the rarity of such collections, the numerous challenges of establishing preservation protocols, and the overall scarcity of resources. as we conclude, we must emphasize that, as artists have increasing access to ubiquitous tools and methodologies for creating complex art exhibits and objects, we should expect to see an increasing flow of such creative works to archives, museums, and libraries. it is nearly impossible to preserve these works through generations of technology and context changes. therefore, diligent curation practices are going to be more essential than ever in order to identify unique or exemplary works, project future use scenarios, assess obsolesce and loss risks, and implement cost-efficient strategies.   acknowledgements we would like to express our gratitude to the national endowment for the humanities for supporting this project, to the project advisory board, to consultants chris lacinak, kara vanmalssen, and alex duryee of avpreserve, and to the pafdao project team, including timothy murray (co-pi), dianne dietrich, desiree alexander, jason kovari, danielle mericle, liz muller, michelle paolillo.   notes 1 an early version of this report is available at dsps press: the blog of cornell university library's division of digital scholarship and preservation services. see "interactive digital media art survey: key findings and observations: dsps press". 2 the goldsen archive's holdings range to include media formats such as reel-to-reel videotape, floppy disk, database artworks housed on external hard drives, and works of net.art. all of these formats pose unique and significant preservation challenges. for more information, please see the goldsen archive website. 3 out of 170 respondents, 80 fully and 32 partially completed the survey, and 58 took a quick look without responding. we suspect that the incomplete survey indicates a combination of curiosity and unfamiliarity with the program area, as media art research, curation, and practice still constitute fairly specialized fields. only twenty-four respondents indicated that their institutions include born-digital interactive media artworks and artifacts in their holdings. several respondents who identified as curators indicated that born-digital interactive media would fall outside the scope of their collections. in some cases, they also noted that procedures for providing access to such materials are prohibitively complex or unsustainable. 4 for further information and documentation please see http://imappreserve.org/, http://variablemedia.net/, http://www.tate.org.uk/about/projects/matters-media-art, and http://www.incca.org/ 5 cornell university library's commitment to provide broad and democratic access to its special collections was a key reason why founding goldsen archive curator timothy murray located the goldsen collections within the library. cornell's division of rare and manuscript collections has notably open policies for user access; see http://rmc.library.cornell.edu/ for more information.   references [1] delve, j. et al. (2012). the preservation of complex objects. volume one: visualizations and simulations. [2] depocas, a., ippolito, j., jones, c., eds. (2003). permanence through change: the variable media approach. guggenheim museum publications, ny & daniel langlois foundation, montreal. [3] dietrich, d., adelstein, f. archival science, digital forensics, and new media art. volume 14, supplement 1, august 2015, proceedings of the fifteenth annual dfrws conference. http://doi.org/10.1016/j.diin.2015.05.004 [4] espenschied, d., rechert, k., valizada, i., von suchodoletz, d., russler, n. (2013). "large-scale curation and presentation of cd-rom art", ipres 2013. [5] engel, deena, and glenn wharton (2014). reading between the lines: source code documentation as a conservation strategy for software-based art. studies in conservation 59(6): 404—415. http://doi.org/10.1179/2047058413y.0000000115 [6] kirschenbaum, m. et al. (2010). digital forensics and born-digital content in cultural heritage collections. clir. [7] lange, a. (2012). keep strategy paper. [8] macneil, h. wei, c., duranti, l., authenticity task force report. interpares. [9] murray, t. (2008). thinking electronic art via cornell's goldsen archive of new media art. neme: the archival event. [10] rinehart, richard, and jon ippolito, (2014) re-collection: art, new media, and social memory. leonardo. cambridge, massachusetts: the mit press.   about the authors madeleine casad is curator for digital scholarship at cornell university library. as associate curator of the rose goldsen archive of new media art, she manages an exciting collection of media objects that present a wide range of preservation and access challenges. she coordinates many of the library's digital humanities initiatives, and plays a leading role in education and outreach programs to promote the innovative use of digital collections in humanities scholarship. she holds a phd in comparative literature from cornell university.   oya y. rieger is associate university librarian for scholarly resources and preservation services at cornell university library. she provides leadership for full lifecycle management of scholarly content, including selection, creation, design, maintenance, preservation, and conservation. she is interested in current trends in scholarly communication with a focus on needs assessment, requirements analysis, business modeling, and information policy development. she holds a phd in human-computer interaction (hci) from cornell university.   desiree alexander is the pafdao collections analysis assistant and has worked with the goldsen archive since 2012, assisting with the goldsen's experimental video and digital media preservation projects. she is also co-lead in surveying cornell's a/v assets to locate at risk materials campus-wide in an effort to develop preservation and access strategies. she holds a ms in information studies and an ma in public history from suny albany, and an undergraduate degree in art history from ithaca college.   copyright © 2015 madeleine casad, oya y. rieger and desiree alexander who is reading on-line education journals? why? and what are they reading? search   |   back issues   |   author index   |   title index   |   contents d-lib magazine december 2002 volume 8 number 12 issn 1082-9873 who is reading on-line education journals? why? and what are they reading?   lawrence m. rudner eric/university of maryland marie miller-whitehead tvee.org jennifer s. gellmann eric/university of maryland one thoughtful examination of the literature estimates that a typical article published in a scientific journal in the u.s. is read about 900 times [tenopir, 2000]. in contrast, some of the electronic journals in education appear to be having a far greater impact. it is not uncommon for an article in education policy analysis archives (epaa) to be read more than 10,000 times; several articles have been viewed more than 30,000 times. the 100 articles in practical assessment, research & evaluation (pare), a more specialized electronic journal, had averaged more than 7,000 views per article as of february 2002. in september 2002, pare readership reached the one million mark. this large difference between print and e-journals indicates that the readership of electronic journals is clearly not the same as the readership of traditional print journals. the questions addressed by this article, then, are the three posed in the title: who is reading selected electronic education journals? why? and what are they reading? the answers to these questions should prove useful to authors as well as future editors of on-line journals, as both authors and editors usually want to be responsive to the needs and desires of their readers. the present study sought to answer these questions by compiling and analyzing the following data sources: (a) the results of an on-line survey of readers of the journals, (b) access statistics for the two journals, and (c) a content analysis of the most popular articles of each of the journals. related literature much of the research into the use of electronic scholarly journals has examined the publication process and various technological aspects (for example, [edwards, 1997; peters, 2000]). there have been fewer studies of the actual users of electronic journals. one major study [eason, et al., 2000]examined user logs to determine readership and use. based on the frequency, breadth, and depth of visits to particular journals and to specific articles, eason and his colleagues defined a variety of user categories: "enthusiastic users", who viewed many journals and articles. this group was small (0.9%), composed primarily of social scientists and post-graduate students. "focused regular users" (4.9%), who viewed few journal titles, but accessed them frequently. this group consisted mostly of research scientists in the "hard" sciences and post-graduate students. "specialized, occasional users" (11.6%), who infrequently accessed a few specific journal titles. they were divided between scientists and social scientists, and were again typically post-graduates or academicians. "restricted users" (23.1%), who were similar to the specialized, occasional users, but accessed journals even less frequently. they were primarily biological scientists. "lost", "exploratory", "tourists" or " searchers", who were non-repeat users. the users grouped under these terms either used the system only once, or simply registered for the superjournal project but didn't return to explore system again. they crossed all academic disciplines. some liked the service but did not have time to use it thoroughly, while others simply did not like the project. one of the major findings of this study was that enthusiastic users represented a very small percentage of total users. most were "restricted users" or "tourists". another major user study [liew, et al., 2000] presented an in-depth survey to a purposive sample of 83 graduate students to study their use and perceptions of e-journals. the results of the study revealed that a vast majority of graduate students (73%) preferred e-journals over print journals. commonly cited reasons were links to additional resources, searching capability, currency, availability, and access ease. citation analysis is commonly used to evaluate the impact of an article, author, or journal. examining citation data for 39 scholarly, peer-reviewed e-journals, harter [harter, 1996] found that the great majority had essentially no impact on scholarly communication in their respective fields. few articles were cited and those that were cited were not cited frequently. citation analysis, however, may not be fully appropriate here. the citation rate in education is terribly low. rudner, burke, and rudner [rudner et al., 2001] found that articles in four educational measurement journals were cited an average of only 1.2 to 2.5 times within three years by authors of other articles in the same journal. further, the two journals under investigation here—education policy analysis archives (epaa) and practical assessment, research and evaluation (pare)—are relatively new and are not fully included in citation databases. an alternative to citation analysis is link analysis. as described by ng, zheng, and jordan [ng et al., 2001], link analysis has been successfully applied to infer "authoritative" web pages. in the academic citation setting, it can also be used to identify "influential" articles. thus, citations may not be as revealing as links. in march 2002, google.com reported 2,360 links to epaa and 686 links to pare. these values are quite respectable for academic journals. logically, it follows that the more links there are to a journal or a journal article, the more likely it is that a potential reader will find it on the web and access it. the potential effect on readership is similar to that which might be expected if a print journal were made available in 10,000 libraries compared to one available in only 10 libraries. the "link effect" may, in fact, help to offset the rather poor search strategies of many of eric's patrons that can affect the number of times a journal article is accessed. hertzberg and rudner [hertzberg & rudner, 1999] and [rudner, 2000a] examined the quality of on-line eric database searches and found that while 95% of search strategies were relatively unsophisticated (or even "horrible"), even the most diligent searchers examined only about 5 or 6 of the citations resulting from a query. the results of their survey of all eric users indicated that about half used the resource for report preparation. methods to study the readership of two on-line education journals, we used two methods: 1) a brief readership questionnaire, and 2) an in-depth content analysis of the more popular articles in epaa and pare, including an analysis of key words and retrieval counts. the readership questionnaire built on a brief questionnaire used by rudner [rudner, 2000a]. for a few days, a short questionnaire popped up in a small window the first time a user accessed the home page of either epaa or pare. users could readily see that they were being asked only two short questions—one about their position and another concerning the purpose of their visit. to minimize the obnoxiousness factor, a cookie was left with the user's browser, regardless of whether he or she responded, to prevent the survey from being launched a second time from that computer. in the past, we have had extremely high response rates (> 80%) using this technique. the first author of this article conducted a content analysis of the publications to identify underlying constructs of the most-often-accessed articles of each journal. the research question of interest was, "what were common themes, topics, methodologies, or perspectives of the most-often-accessed pare and epaa articles?" the number of times the articles were accessed from the journal web site was taken as an indicator of readership. the content analysis also considered the following features: eric descriptors assigned to each article (when available) article titles words used frequently by the authors within the full-text version of the articles. using a count of the number of times an article has been accessed electronically as an outcome variable has several limitations, not least of which is that for each time an article is accessed or downloaded electronically, it may be printed, disseminated, and read numerous times, almost ad infinitum. however, the use of access statistics is a commonly accepted method of analyzing web site traffic and computer usage and so was included as an outcome variable or category of interest in this analysis, under the assumption that over a given period of time, the articles that are accessed most often are most reflective of the interests of a journal's primary readership. a preliminary step in conducting a content analysis is to become familiar with the data, in this case the journal articles. the next step is to identify and code "independent, mutually exclusive, and exhaustive" categories [stemler, 2001]. for this article, the analysis sought to identify themes, methodological approaches, policy perspectives, or topics in the most-often-accessed pare and epaa articles. in conducting a content analysis, the categories of interest may be specified a priori, such as those provided by the thesaurus of eric descriptors, or they may be identified by the researcher as categories specific to the research questions and texts of interest in the analysis. the current analysis of recent pare and epaa articles included eric descriptor categories as well as researcher-defined categories that emerged during the course of textual analysis and that were identified by (a) close readings and (b) computer-assisted parsing of the articles. although eric descriptors provided useful information in a content analysis, they did not provide the same information as a formal textual analysis. authors may not define their discussions in terms of eric descriptors; thus the connotative implications of word choice offer insight into an author's perspective on issues. according to hertzberg and rudner [hertzberg & rudner, 1999], overlap in descriptors often presents a challenge in the "social sciences in general as terms are less well defined, more fluid and less strictly hierarchical than in the physical sciences." because article titles and words and phrases from the eric abstracts provide on-line database users with information to assist them in selecting those articles most relevant to their criteria (and serve as motivators to access the articles), these data were also analyzed. in fact, journal titles and eric descriptors, identifiers, and abstracts provide a potential journal reader with a "first impression" that results in either motivating him or her to read the article or to continue the search for more relevant sources. the present study analyzed articles published on-line since september of 1999 because this was the date the e-journal pare came on-line. this decision was based partly on correcting for "time in publication" because epaa full-text articles have been available on-line since 1993. the outcome variable that emerged from this process was defined as the average number of accesses per month of pare and epaa articles published from september 1999 through december 2001, thus providing data for articles available electronically for a period of time ranging from several months to two years. while we recognize that readership often peaks shortly after publication, we felt this was a convenient and useful metric for this study. the following section presents results based on both highest number of accesses since publication as well as the highest average accesses per month. a little more than half of the most-accessed articles also had the highest number of accesses per month since publication. with 261 titles published in epaa and 100 in pare, the complete data set consisted of 361 titles. results who the first question on the pop-up survey asked the users to identify their primary role. as shown in table 1, there was a great deal of commonality across journals and across time. graduate students comprised the largest group of users, followed by teachers, researchers, and undergraduate students. table 1. self-reported primary roles of pare and epaa visitors   february 2002 november 2000 role % pare (n=1964) % epaa (n=617) % pare (n=200) % epaa (n=590) researcher/prof 23.5 33.9 26.5 23.1 researcher 17.0 26.4 15.0 17.3 college professor 6.5 7.5 11.5 5.8 k-12 teacher/admin 20.5 17.6 20.5 19.7 k-12 teacher 14.9 10.9 12.5 12.9 k-12 librarian 0.8 0.5 7.0 4.9 k-12 staff 1.0 1.5 0 0.2 k-12 administrator 3.8 4.7 1.0 1.7 college students 45.3 42.4 47.0 46.9 undergraduate student 12.2 13.9 19.5 16.6 graduate student 33.1 28.5 27.5 30.3 other 10.7 6.2 6.0 10.3 college librarian 1.1 6.2 0.5 2.7 parent 3.7 1.0 1.0 1.2 journalist 0.2 0.5 0 .3 other 5.8 3.6 4.5 6.1 the researcher category is somewhat anomalous because there are simply not that many non-academic positions in education bearing the title "researcher." perhaps this was a bias introduced by the fact that it was the first response option. there are some 3,000,000 k-12 teachers [mcdowell, 2001], 986,000 undergraduate students in colleges of education [usde, 2000], 604,000 graduate students in colleges of education [usde, 2000], and 89,000 professors in colleges of education [market data retrieval, 2002]. if readership were proportionate to audience type, one would expect approximately 1% of the readers of electronic journals to be college professors, 40% to be teachers, 13% to be undergraduate students, and 8% to be graduate students. thus, electronic journals appear to be disproportionately attractive to graduate students and college professors. based on the numbers, teachers would appear to be under-represented. however, the journal reading behavior of teachers in the classroom cannot be expected to be the same as the reading behavior of professors and students. thus, the fact that approximately 20% of readers self-identify as k-12 teachers and administrators may be viewed as a relatively high percentage. in terms of the eason, richardson, and yu [eason et al., 2000] study, it appears that pare and epaa readers can be categorized as "enthusiastic users", "focused regular users", or "specialized, occasional users". why the second survey question asked the users to identify the purpose of their visit. as shown in table 2, there was again a great deal of commonality across journals. the two most common reasons for reading pare were research report preparation and class assignment. given the scopes of the two journals, it is not surprising that relatively more epaa readers are looking for information to inform policy and relatively more pare readers are looking for teaching resources. it is encouraging to note that approximately 18% of the readers are visiting the sites for personal or professional interest. table 2. self-reported reasons for visiting the pare and epaa web sites[1] purpose % pare (n=1964) % epaa (n=617) background for policy making 3.2 9.2 class assignment 28.0 30.3 research report preparation 37.9 36.5 professional interest 13.0 14.7 personal interest 5.7 3.2 find teaching resources 8.6 4.4 other 3.7 1.6 what pare and epaa require that authors adhere to various technical specifications, including article length. pare articles in the content analysis ranged in length from approximately 1,700 to 4,100 words and averaged 2,500 words. in comparison, the most-accessed epaa title was nearly 6,500 words long and several were more than 9,000 words in length. therefore, the journals were considered separately for several of the following analyses. in studying the pare titles most often accessed, several dominant themes emerged. first, a "rubric-standard-criteria" triad dominated the word count of articles that were most often accessed. second, although the categories were not mutually exclusive, the most often accessed titles were discussions of teachers/teaching, scores/scoring, grades/grading, students, evaluations, and assessment. in a sample of 18 pare articles selected from those most and least accessed since publication, the following word counts were noted: table 3. word counts of pare sample n = 18 word n mean sd assess, assessment(s) 423 25 27 evaluate, evaluation(s) 224 12 16 grade(s), grading 82 5 6 reliability/validity 151 8 15 rubrics/criteria/standards 561 31 50 score(s)/scoring 396 23 38 teacher(s)/teaching 356 20 26 test(s)/testing 506 28 40   table 4. comparison of average frequency of word usage in pare articles word or phrase most-accessed least-accessed assess, assessment 29 4 evaluate, evaluation 15 4 grade(s)/grading 3 9 performance(s) 12 3 rubrics/criteria/standards 38 7 reliability/validity 10 2 teacher(s)/teaching 24 3 although they were not included in table 4, there was no meaningful difference between mostand leastaccessed articles in the average number of uses of the words/phrases "test/tests/testing" and "score/scores/scoring." there was a positive correlation between the average number of times a title was accessed each month and the number of times "rubric(s)," "standard(s)," or "criteria" appeared in the article (p = .67, < .01, n = 18). table 5 provides the bivariate correlations of article retrievals and frequently-used words in pare's mostand least-often-accessed titles. "evaluation methods" and "student evaluation" were the most frequent eric descriptors of the pare titles that had been retrieved the most times on average since publication, followed by "elementary secondary education." eric descriptors for titles accessed least often included "adaptive testing," "computer assisted testing," "multiple regression," "item response theory," "item banking," "difficulty level," and "limited english proficient." eric descriptors that appeared in both most-accessed and least-accessed titles were "test construction" and "test scores." "assessment(s)" was a title word in 6 of 13 of the most-accessed pare articles. words that appeared in more than one of the most-accessed pare titles included, "classroom," "evaluate," "implement," "portfolio," "rubrics," "scoring," "teachers," "tests," and "when." the average number of words in the titles of most-accessed articles was 5.5. table 5. correlations between pare word counts and retrievals n = 18   1 2 3 4 5 6 7 8 9 1 retrievals — .45 .16 .67** -.31 .39 .55* .49* -.14 2 teacher(s)/teaching .27 .06 -.04 -.01 .70** .01 -.19 3 student(s) - .39 .12 .36 .62* -.12 -.13 4 rubrics/standards/ criteria - .05 .87** .37 .52* -.16 5 grade(s)/grading - .28 -.01 -.19 -.09 6 score(s) - .25 .35 -.18 7 assessment(s) - .02 -.03 8 evaluate/evaluation - -.23 9 test(s) - note. ** p < .01, * p < .05. sample included mostand leastaccessed articles to minimize restriction of range in the correlation.   pare articles addressing issues about teachers and students in the classroom were more often accessed than articles focusing on statistical procedures and measurement, i.e., psychometrics. as noted by the author of another widely read pare title on assessment fundamentals, "it is important to understand the difference between measurement evidence (differentiating degrees of a trait by description or by assigning scores) and evaluation (interpretation of the description or scores)" [mcmillan, 2000]. there are, after all, more k-12 classroom teachers, administrators, and students of education who are concerned about teaching to criteria, standards, and rubrics than there are statisticians and measurement professionals who hold the same concerns. with one exception, the pare articles did not emphasize educational policy or politics; however, one of the most often accessed titles addressed accountability, content standards, reform, and policymaking relative to high-stakes testing and the assessment of all students [linn, 2001]. this article offered recommendations for safeguards in the system and concluded that unintended negative effects of the high-stakes accountability uses often outweigh the intended positive uses. the number of times this article has been accessed provides compelling testimony of the educator concern about student achievement and high-stakes testing in an environment of increasing federal, state, and district accountability. table 5 provides simple correlations of frequently used words in pare articles. these words were taken out of context; a thorough analysis should give some consideration to the context within which they were most often used. "characteristics of performance standards" and "strengths and weaknesses of content standards" [linn, 2001] provide richer contextual clues to the various meanings of standards than is possible with a single identifier. "rubrics are descriptive scoring schemes," "scoring rubrics describe what is expected" [moskal, 2000], and "a rubric is a rating system by which teachers can determine at what level of proficiency a student is able to perform a task or display knowledge of a concept," [brualdi, 1998] are more explanatory of the authors' intention when using the term "rubrics" than can be indicated by simply counting the number of times these words appeared in pare articles. another pare author restricted a discussion on "assessment" to "authentic assessment" and then contextualized the somewhat ambiguous phrase, "authentic assessments require students to be effective performers with acquired knowledge" [wiggins, 1990]. although the occurrence of the words "test(s), testing" did not differentiate between articles that were or were not likely to be accessed, an article on criterion and norm-referenced testing that concluded, "as long as the content of the test matches the content that is considered important to learn, the crt gives the student, the teacher, and the parent more information about how much of the valued content has been learned than an nrt" [bond, 1996] had been accessed more than 21,000 times. portfolio assessment was another topic popular with pare readers. portfolios may assess teacher work or student work: thus, "a teacher portfolio is designed to demonstrate ...talents...knowledge and skill in teaching" [doolittle, 1994] or a teacher may encourage "students to organize their work and compare various items within their portfolios using rubrics ... checklists, and award stickers." [forgette-giroux & simon, 2000]. another popular pare article explored teacher motivation from the perspective of two behavioral psychology models widely used in organizational management [gawel, 1997]. the epaa article most frequently accessed since the journal's inception in 1993 was published in march 1999. as of march 2002, this one article had been accessed more than 52,000 times and had elicited several published responses on epaa that had together been accessed a total of nearly 20,000 times. these three titles and responses were detailed examinations of home-schooling, student achievement, and the interaction effect between home-schooling, church affiliation, and student achievement on standardized tests [arai, 1999; rudner, 1999; welner & welner, 1999]. for titles in publication since september 1999, the most-often-accessed epaa title was about teacher quality and student achievement [darling-hammond, 2000]. that article, prepared by a widely recognized leader in the field and on a hot topic concludes "that improving the quality of teachers in the classroom will do more for students than other strategies designed to raise student achievement." eric descriptors of most-often-accessed pare and epaa articles were equally likely to include: "student achievement," "elementary secondary education," "standardized tests" and "accountability" while most-accessed epaa titles were more likely to have eric descriptors such as: "state programs," "politics of education," and "educational policy" and to include discussions of education equity litigation at the state level. while pare articles were more likely to provide guidelines and standards for the use of evaluation methodologies or for conducting various types of assessments, epaa articles were more likely to present evidence (including case studies) of the efficacy of state testing and accountability programs on academic standards or of state policies on student and teacher testing on academic standards. most-read epaa articles were more likely to address such policy-related issues as the effect of block scheduling on student achievement, disparities by ethnicity and poverty in access to technology and the use of technology in the classroom, and the necessity for designing curricula specifically for the technology of the internet. interestingly enough, although articles about statistics were less often accessed than articles on other topics, the most popular epaa articles [darling-hammond, 2000; rudner, 1999] made very effective use of statistics and included statistical analysis summary tables and graphs, as well as an explanation about the choice of statistical methods used for the analyses. discussion using a short readership survey and content analysis of most frequently accessed articles from two education journals provides insight to the on-line journal readership and their needs and interests. survey response information can be used by editors to encourage submissions in high-interest areas and assure potential contributors of the high visibility of their contributions. we found that these journals appear to be reaching a larger and wider audience than many print journals. articles are downloaded thousands of times compared to fewer than 1,000 requests for scholarly print journals. a large number of readers are teachers and others from the k-12 community. most readers reported their primary role as college student, researcher, or college professor. readers reported that their primary purpose of the visit was to assist with class assignments and report preparation. large numbers also visited for personal or professional interest. even though certain topics drew a substantially larger readership than others, it is important to note that some diversity in titles provides journal audiences with the opportunity to acquire information on unfamiliar topics or to expand their perspectives on an issue of interest to the professional education community. thus, it is important to provide the educational community not only with titles of general interest, but also with well-written articles that address more specialized evaluation, assessment, and policy analysis topics. our examination of the most-read topics—home-schooling, rubrics, standards, politics—revealed a keen interest in currency. these topics are not well covered in traditional print journals. print publication lag time significantly limits the usefulness of print articles on current topics. electronic journals, however, are able to publish on current topics, and readers apparently readily consume articles on current topics. the inherent ability of electronic media to provide immediacy of response is important, particularly for those who must be cognizant of changes not only in educational policy, but in the evolution of terminology that often accompanies such discussions. the association of learned and professional society publishers (alpsp) recently conducted a large-scale survey, the purpose of which was to discover what motivated researchers to publish in journals, and how they decided where to publish, as well as their concerns about the current system, and what they wanted or expected in the future [swan & brown, 1999]. questionnaires were sent to 11,500 contributors to journals published in the u.k., the u.s., and elsewhere. swan and brown found that the main aim of contributors was to reach the widest possible audience, with the quality of peer review and the impact factor of the journal the main factors of importance in achieving their overall publishing objectives. in deciding where to submit their work, the perceived reputation of the journal, its impact factor, subject area, international reach and coverage by abstracting and indexing services are extremely important. with their wide and diverse readership, pare and epaa clearly achieve the first goal the widest possible audience. both of these journals provide live usage statistics to clearly document impact. for aspiring faculty, journal reputation is extremely important. there is a fear that tenure committees may under-value on-line journals relative to more established print journals. in a may 1999 article in the chronicle of higher education, kiernan wrote "scholars are worried...that electronic publication will not carry much credit toward tenure, or that electronic journals might fail, carrying prized papers with them into oblivion...that electronic journals are likely to be less permanent than printed journals." the authors obviously feel tenure committees that undervalue electronic journals are misguided. these journals are often as rigorous as print journals and have a much greater impact in terms of educating readers. based on this study, we offer the following suggestions to editors and publishers of on-line journals: emphasize and solicit papers on current topics. select materials that recognize the diverse nature of the audience. provide for permanently archiving published articles. get the journal indexed by education index and the current index to journals in education. gather and publish usage statistics. educate potential authors and tenure committees on the impact and value of your on-line journal. a question that arises based on rudner's survey of percent of k-12 readership is the extent to which the vast majority of educators, i.e., the classroom teachers, have ready access to computers with internet connections. the goal of having at least one internet-connected computer in each school is close to becoming a reality, but for many teachers, having an internet-connected computer in the classroom or at home is still a dream [hoffman, novak, & schlosser, 2000]. we close with the following quote from an epaa article on the use of technology, the dream scenario is that the information age will help bring about the kinds of educational change that reformers have pushed for all century, with schools becoming sites of critical collaborative inquiry and autonomous constructivist learning as individuals and groups work with new technologies to solve authentic problems under the guidance of a facilitative teacher [warschauer, 2000]. references [arai] arai, a. (1999). homeschooling and the redefinition of citizenship. educational policy analysis archives, 7(27). available on-line: . [bond] bond, l. a. (1996). normand criterion-referenced testing. practical assessment, research & evaluation, 5(2). available on-line: . [brualdi] brualdi, a. (1998). implementing performance assessment in the classroom. practical assessment, research & evaluation, 6(2). available on-line: . [darling-hammond] darling-hammond, l. (2000). teacher quality and student achievement: a review of state policy evidence. education policy analysis archives, 8 (1). available on-line: . [doolittle] doolittle, p. (1994). teacher portfolio assessment. practical assessment, research & evaluation, 4 (1). available on-line: . [eason et al.] eason, k., richardson, s., & yu, l. (2000). patterns of use of electronic journals. journal of documentation, 56 (5), 477-504. [edwards] edwards, j. (1997). electronic journals: problem or panacea? ariadne, 10. available on-line: . [epaa] epaa log (2002). usage statistics for education policy analysis archives. available on-line . last checked: 13-mar-2002. [forgette-giroux], forgette-giroux, r., & simon, m. (2000). organizational issues related to portfolio assessment implementation in the classroom. practical assessment, research & evaluation, 7(4). available on-line: . [gawel] gawel, j. e. (1997). herzberg's theory of motivation and maslow's hierarchy of needs. practical assessment, research & evaluation, 5(11). available on-line: . [harter] harter, s. p. (1996). the impact of electronic journals on scholarly communication: a citation analysis. the public-access computer systems review, 7 (5). available on-line: . [hertzberg] hertzberg, s., & rudner, l. (1999). the quality of researchers' searches of the eric database. 7(25). available on-line: . [hoffman et al.] hoffman, d., novak, t., & schlosser, a. (2000). the evolution of the digital divide: how gaps in internet access may impact electronic commerce. journal of computer-mediated communication, 5(3). [kiernan, v. (1999). why do some electronic-only journals struggle, while others flourish? the chronicle of higher education, 45(37) p. a25. [liew et al.] liew, c.l., foo, s., & chennupati, k.r. (2000). a study of graduate student end-users; use and perception of electronic journals. online information review, 24(4), 302-315. [linn] linn, r. l. (2001). assessments and accountability (condensed version). practical assessment, research & evaluation, 7(11). available on-line: . [market data] market data retrieval (2002). education mailing lists: interactive catalog, faculty, department chairs, and deans by discipline. available on-line: . accessed oct 16, 2001. last checked: 13-mar-2002. [mcdowell] mcdowell, l. (2001). early estimates of public elementary/secondary education survey, 2000-01. education statistics quarterly, 3(1). u.s. department of education. available on-line: . [mcmillan] mcmillan, j. h. (2000). fundamental assessment principles for teachers and school administrators. practical assessment, research & evaluation, 7(8). available on-line: . [moskal] moskal, b. m. (2000). scoring rubrics: what, when and how? practical assessment, research & evaluation, 7(3). available on-line: . [ng et al.] ng, a. y., zheng, a. x., & jordan, m. i. (2001, august). link analysis, eigenvectors, and stability. proceedings of the seventeenth international joint conference on artificial intelligence, seattle, wa. available on-line: . [pare] pare log (2002). user statistics for practical assessment, research and evaluation. [peters] peters, s. (2000). epress: scaling up electronic journal production. ariadne, (23). available on-line: . [rudner, 1999] rudner, l. (1999). scholastic achievement and demographic characteristics of home school students in 1998. educational policy analysis archives, 7(8). available on-line: . [rudner, 2000a] rudner, l. (2000a). who is going to mine digital library resources? and how? d-lib magazine, 6(15). available on-line: . [rudner, 2000b] rudner, l. (2000b). who is using some web resources. available on-line: . [rudner, 2001] rudner, l.m. (2001). how many people search the eric database each day? available on-line at: . [rudner, et al.] rudner, l. r., burke, j., & rudner, l. (2001). is something happening to measurement scholarship? newsletter of the national council on measurement education, 9(2) p. 1. available on-line at . [stemler] stemler, s. (2001). an overview of content analysis. practical assessment, research & evaluation, 7(17). available on-line: . [swan & brown] swan, a., & brown, s. (1999), what authors want, west sussex, uk: the association of learned and professional society publishers. available on-line: . [tenopir] tenopir, c. (2000). towards electronic journals. psycoloquy, 11(084). available on-line: . [u.s. department of education] u.s. department of education (2000). "table 214, enrollment in postsecondary education, by major field of study, age, and level of student: 1995-96." table source: digest of educational statistics. chapter 3. post secondary education. data source: u.s. department of education, national center for education statistics, "the 1995-96 national postsecondary student aid study," unpublished data. [warschauer] warschauer, m. (2000). technology and school reform: a view from both sides of the tracks. educational policy analysis archives, 8(4). . [welner] welner, k. m., & welner, k. g. (1999). contextualizing homeschooling data: a response to rudner. educational policy analysis archives, 7(13). available on-line: . [wiggins] wiggins, g. (1990). the case for authentic assessment. practical assessment, research & evaluation, 2(2). available on-line: . copyright © lawrence m. rudner, marie miller-whitehead, and jennifer s. gellmann top | contents search | author index | title index | back issues previous article | conference report home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/december2002-rudner   reconstructing the past through utah sanborn fire insurance maps: a geospatial approach to library resources search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine march/april 2015 volume 21, number 3/4 table of contents   reconstructing the past through utah sanborn fire insurance maps a geospatial approach to library resources justin b. sorensen j. willard marriott library, university of utah justin.sorensen@utah.edu doi: 10.1045/march2015-sorensen   printer-friendly version   abstract cartographic maps have the ability to convey information and ideas in ways text cannot. the utah sanborn fire insurance maps are one such resource, depicting detailed information on buildings, layouts, compositions and boundaries of cities and towns. as time has progressed, the interest in these resources has continued to grow, opening the door for the creation of an updated method for viewing and examining this valuable collection. through the incorporation of gis and geospatial technology, the printed and scanned materials have been converted into georeferenced raster datasets, allowing viewers the ability to geospatially interact with the information and apply the information to their research in new and exciting ways. this article describes the digital scholarship lab's endeavor to convert these valuable resources into research driven geospatial datasets, providing a new format for how the library information is presented as well as a new method for interacting and examining the information in detail.   introduction sanborn fire insurance maps are a highly requested resource in libraries. each collection of hand-made maps contains detailed surveyor information on commercial, industrial and residential sections of cities and towns ranging from the mid 1800s to the mid 1950s (figure 1). while the j. willard marriott library has scanned and created digital versions of the original printed maps, the digital scholarship lab established a goal to further develop, enhance and utilize the information contained within these maps. as a result, an innovative project was created that not only offers these resources openly to students, staff, faculty and visitors of the university of utah, but also creates a method for displaying and examining each map within physical space through the incorporation of geospatial software and 3-dimensional technology. figure 1: portion of original scanned 1908 mount pleasant sanborn map (close-up detail). [ view larger version of figure 1. ]   project development and process beginning in 1867, sanborn fire insurance maps (founded by daniel alfred sanborn) were created for the purpose of assessing fire insurance liability within urbanized areas of the united states, depicting detailed information on cities and towns consisting of: building compositions and structural information, layouts, business and street information, property boundaries and much more ("sanborn maps", 2014). author kim keister describes this collection well, stating "the sanborn maps survive as a guide to american urbanization that is unrivaled by other cartography and, for that matter, by few documentary resources of any kind" ("sanborn maps", 2014). as time has progressed, each published volume and its descendent updated versions have become highly requested resources in fields such as historical research, planning, preservation and the study of urban geography ("sanborn maps", 2014). while many of these collections are available at academic institutions, most are archived in their original printed format for preservation, thus limiting the ability to openly obtain information contained within each detailed map. in 2000, the j. willard marriott library's digital technologies staff began the sizable task of scanning each utah sanborn fire insurance map contained within the special collections department (arlitsch, 2002). this effort produced high-resolution scanned tiffs of the entire collection (numbering over 1,700 maps in total) made viewable through a webpage specifically designed to present the entire scanned map collection (figure 2). as time progressed and the interest in this resource continued to grow, an updated method for viewing this valuable collection while interacting and examining the information closely was in order. what better way to accomplish this goal than to incorporate gis and geospatial technology, thereby converting the printed and scanned materials into geospatial raster data while offering a new method and format for the j. willard marriott library to offer these resources and a convenient way for individuals to access and examine the detailed information remotely. figure 2: screenshot of scanned utah sanborn fire insurance map portal located in the j. willard marriott library's digital library. [ view larger version of figure 2. ] gis stands for geographic information system, which is an assemblage of computer hardware, software and data designed to examine and present geospatial data by combining spatial mapping and analysis with database technology. with software such as this, it becomes possible to turn ordinary printed data and imagery into geospatial data, identifying geographic features, locations or boundaries on the earth. for the utah sanborn fire insurance maps, this process would involve converting each scanned map into a georeferenced overlay, a process by which a 2-dimensional printed map or photographs existence is defined within physical space by giving the image latitude and longitude coordinates dependent upon a particular map projection system, resulting in the digital map's alignment to its appropriate geographic location within a virtual environment. the original scanned tiffs created by the j. willard marriott library's digital technologies staff were first converted from large file sizes (approximately 148 mb per map) to smaller web-sized jpeg images (2 -3 mb per map) by digital scholarship lab staff using photoshop to allow for easy remote access while preserving image quality. the map collection was georeferenced using arcgis software, aligning each map to its appropriate geospatial location through established georeferencing protocols to maintain consistently projected digital overlays for all of the historic maps. these protocols include the use of reference layers composed of satellite imagery, street centerlines and parcel data as well as the utilization of a nad83 projected coordinate system (north american datum, 2014). the georeferenced maps were then converted to kmz files ("keyhole markup language", 2014) using global mapper software, allowing each map to be viewed in detail using free and openly available google earth software (figure 3). while great care has been taken to maintain georeferencing accuracy, it is important to note that many map features have changed or vanished over time, resulting in the georeferencing of each historic map as close to its appropriate geographic position as possible. figure 3: digital aerial view of the georeferenced 1884 salt lake city sanborn map collection overlaid within google earth on present-day satellite imagery for comparison. [ view larger version of figure 3. ] with the completion of the georeferencing process, all georeferenced map files were delivered to staff members of uspace (the university of utah's institutional repository), who uploaded each georeferenced map individually as compound objects using contentdm software while entering geospatial metadata for each of the georeferenced maps, opening access to the newly created materials via url links accessible within the j. willard marriott library catalog. as the geospatial components were becoming openly available to students, staff, faculty and visitors of the university of utah, a method for quickly accessing the information and datasets contained within the new collection was in order. this process would lead to a number of brainstorming sessions on how to best present the information and ideas for different display components visitors would utilize prior to download. as a result, a library study guide hosted on the j. willard marriott library's website was created (figure 4), conveniently navigating visitors throughout the entire map collection. each city tab within the study guide represents one of several geographic locations depicted in the utah sanborn fire insurance map collection, displaying each geographic set by the year in which it was created, ranging from the mid 1800s to the mid 1950s. links to each year's collection are available for the original scanned tiffs as well as the newly georeferenced materials available for download while embedded google earth gadgets offer visitors the opportunity to interact with each map collection within a 3-dimensional environment prior to download. figure 4: screenshot of the project study guide hosted on the j. willard marriott library's website including links to original scans, georeferenced materials and embedded interactive google earth gadgets displaying each collection by location. [ view larger version of figure 4. ]   benefits of this project adding a geospatial component to library materials allows the information to be displayed, expressed and presented in ways standard printed or scanned information cannot. by utilizing gis and geospatial technology with the incorporation of geospatial datasets such as these, a new realm for library research is opened, allowing research institutions a new method for sharing information in a world more and more reliant on digital information. ranging from the creation of historical reconstruction models (figure 5), interactive 3-dimensional model overlays (figure 6), planning analysis or the study of change over time, individuals from multiple disciplines are now able to utilize the georeferenced utah sanborn fire insurance maps in new and exciting ways. figure 5: detailed 3-dimensional model created by caitlyn tubbs (digital scholarship lab) based on information contained in a georeferenced 1950 salt lake city sanborn fire insurance map. [ view larger version of figure 5. ]   figure 6: interactive 3-dimensional model created by justin sorensen (digital scholarship lab) based on information contained in a set of georeferenced 1950 salt lake city sanborn fire insurance maps. [ view larger version of figure 6. ]     project results this project has resulted in the creation of an innovative portal for all utah sanborn fire insurance maps hosted at the j. willard marriott library complete with links to individual maps contained within our collection, georeferenced maps available for download in both kmz (google earth) and zipped geo-raster jpeg (arcgis) formats for access by students, staff, faculty and visitors of the university of utah, as well as interactive google earth gadgets embedded within each geographic page of the project study guide, "reconstructing the past through utah sanborn fire insurance maps", displaying each collection of georeferenced maps while overlaying them on a virtual 3-dimensional model of the earth.1   conclusion this project demonstrates not only how printed, scanned and highly-requested library resources such as the utah sanborn fire insurance maps can be converted into research driven geospatial datasets, but also one of the many ways in which gis can be beneficial in sharing library collections while taking library research to a new level. geospatial technology is an amazing resource available and within a world continually converting towards a digital realm, gis will be one of the many tools libraries will have available to assist them in geospatially sharing their resources with others.   acknowledgements the author would like to acknowledge the work of fellow digital scholarship lab staff member caitlyn tubbs (geospatial data & visualization intern) for her georeferencing assistance and uspace staff members donald williams (ir coordinator) and cindy russell (ir workflow specialist) for their work uploading each of the map files and applying metadata.   notes 1 the gadgets contained on each city page of the study guide require a google earth plug-in to operate. if you experience trouble loading the interactive map windows, please verify that the plug-in is installed on your browser (recommend browser: firefox).   references [1] "sanborn maps." wikipedia. wikimedia foundation. [2] arlitsch, kenning. "digitizing sanborn fire insurance maps for a full color, publicly accessible collection." d-lib magazine, vol. 8, no. 7/8, july 2002. http://doi.org/10.1045/july2002-arlitsch [3] "north american datum." wikipedia. wikimedia foundation. [4] "keyhole markup language." wikipedia. wikimedia foundation.   about the author justin sorensen is the gis specialist for the j. willard marriott library's digital scholarship lab. a graduate of the university of utah, justin has a strong background in geography and geospatial technology and has been creating, developing and managing geospatial projects for the digital scholarship lab since 2011.   copyright © 2015 justin sorensen the music information retrieval evaluation exchange (mirex) search   |   back issues   |   author index   |   title index   |   contents d-lib magazine december 2006 volume 12 number 12 issn 1082-9873 the music information retrieval evaluation exchange (mirex)   j. stephen downie graduate school of library and information science university of illinois at urbana-champaign 1. introduction the music information retrieval evaluation exchange (mirex) [1] is a community-based formal evaluation framework coordinated and managed by the international music information retrieval systems evaluation laboratory (imirsel) [2] at the university of illinois at urbana-champaign (uiuc). imirsel has been funded by both the national science foundation and the andrew w. mellon foundation to create the necessary infrastructure for the scientific evaluation of the many different techniques being employed by researchers interested in the domains of music information retrieval (mir) and music digital libraries (mdl). for the past two years mirex participants have met under the auspices of the international conferences on music information retrieval (ismir). the first mirex plenary convened 14 september 2005 in london, uk, as part of ismir 2005. the second plenary of mirex 2006 was convened in victoria, bc on 12 october 2006 as part of ismir 2006. table 1 summarizes the wide variety of mir/mdl tasks that have been formally evaluated over the past two years. some of these tasks, such as "audio onset detection," represent micro level mir/mdl research (i.e., accurately locating the beginning of music events in audio files, necessary for indexing). others, such as "symbolic melodic similarity," represent macro level mir/mdl research (i.e., retrieving music based upon patterns of similarity between queries and pieces within the collections). table 1. task lists for mirex 2005 and 2006 2005 2006 audio artist identification audio beat tracking audio drum detection audio cover song identification audio genre identification audio melody extraction (2 subtasks) audio melody extraction audio music similarity and retrieval audio onset detection audio onset detection audio tempo extraction audio tempo extraction audio key finding query-by-singing or humming (2 subtasks) symbolic genre classification score following symbolic key finding symbolic melodic similarity (3 subtasks) symbolic melodic similarity   the tasks run for each mirex were defined by community input via a set of topic-based mailing lists and wiki pages [3]. in this sense, mirex, is similar to the text retrieval conference (trec) [4] approach to the evaluation of text retrieval systems. both mirex and trec are built upon three basic components: a set of standardized collections; a set of standardized tasks/queries to be performed against these collections; and, a set of standardized evaluations of the results generated with regard to the tasks/queries. 2. some challenges mirex, however, does differ from trec in one very important aspect. unlike trec, the datasets used in the mirex evaluation process are not distributed freely among the participants. there are several overlapping reasons for this, including: the current litigious state of enforcement of music intellectual property; the reluctance of data contributors (i.e., the creators of some of the more labour-intensive ground-truth sets) to have their work distributed widely; and, the general-consensus that free distribution would create "overfitting" of submitted systems to specific collections of data. since moving the data to participants is problematic, mirex has the participants submit their algorithms to imirsel for running against the collections. this scenario makes imirsel personnel responsible for gathering (from various sources) and managing huge collections of music and ground-truth data (in a wide variety of formats). responsibilities also include verifying the integrity of the test data itself [note 1] and securing the data sets from malicious downloading. imirsel is also responsible for managing the massive amounts of intermediate data that are created during the evaluation process, usually in the form of very large feature sets and similarity matrices that are common to many audio-based techniques. as table 2 indicates, imirsel has run 164 algorithms over the past two years. for imirsel, the actual act of successfully running each of these algorithms has been one of its greatest challenges. during mirex 2006, for example, the submissions employed various combinations of 10 different programming languages (e.g., matlab, c, c++, max-msp, perl, python, java, etc.) and execution environments (e.g., *nix, windows, mac, etc.). notwithstanding constant reminders to the community to pay attention to defined input and output formats, not to dynamically link to non-existent specialized libraries, to have submissions return standard completion and error codes and to die gracefully when necessary, the vast majority of imirsel personnel time has been spent debugging submitted code and verifying the validity of the output sets. table 2. summary data for mirex 2005 and 2006   2005 2006 number of tasks 10 13 number of teams 41 46 number of individuals 82 50 number of countries 19 14 number of runs 72 92 2. some advances mirex 2006 marked the introduction of two important enhancements to the mirex framework: tests of statistical significance (i.e., freidman test); and the use of human evaluators (i.e., evalutron 6000). 2.1 friedman test the friedman test, also known as friedman's anova is a non-parametric test (i.e., does not assume normal distribution of the underlying data). since many retrieval result sets have non-normal distributions, the friedman test has been used in the trec domain for a number of years. it is used to determine whether there truly exist significant differences in system performances. for example, it helps determine whether system a with a "score" of "72" is really performing better than system b ("68") and/or system c ("65"), etc. properly set up, it also allows for the statistically valid pair-wise comparison of each of the system results to help researchers better understand system differences. in this regard, is it much superior to the commonly misused multiple student's t-tests. several mirex 2006 tasks underwent friedman's anova testing. these included "query-by-singing/humming," "audio cover song identification," and "audio music similarity and retrieval." similar to the results found in past trec evaluations, the friedman data for mirex 2006 indicate that most mir/mdl systems generally tend to perform on par with their peers (with a few outlying exceptions) and that most of the variance in the results appears across the various queries rather than between the systems themselves. 2.2 human evaluations and the evalutron 6000 a common complaint among mirex 2005 participants was the lack of any human ex post facto input in evaluating the various tasks. all mirex 2005 tasks had their ground-truth data determined a priori to the evaluation runs. this a priori system is adequate for such tasks as "onset detection," "audio key finding" and "audio cover song identification" as the "answers" to these tasks are not really subject to human interpretation. however, mirex 2006 participants wanted to take on two "real world" tasks ( "audio music similarity and retrieval" and "symbolic melodic similarity") that required human evaluation of the results in order to best judge whether the results retrieved were truly similar in some way to each of the input queries. since there exists no a priori data on the similarity of query music pieces to all possibly returned music pieces, an ex post facto human evaluation system was developed for mirex by imirsel called the "evalutron 6000". after running the mirex 2006 "audio" and "symbolic" similarity tasks, the top-x results ("candidates") for each query ("seeds") from each system were collated into "seed/candidate" sets with all source information removed (i.e., to make the evaluations "blind"). these "seed/candidate" sets were then mounted within the evalutron 6000 which is a web-based relational database system that presented randomly selected "seed/candidate" lists to the evaluators and recorded their evaluation scores for each "seed/candidate" pair. evaluators were drawn from members of the mir/mdl with their identities not disclosed to the participants. to minimize evaluator fatigue, a system of presenting subsets of the results was devised. table 3 presents the summary data concerning the distribution of "seed/candidate" sets among the evaluators. table 3. summary data for the "audio similarity" and "symbolic similarity" human evaluations   audio symbolic number of evaluators 24 20 number of evaluators per query/candidate pair 3 3 number of queries per evaluator 7.5 15 size of the candidate lists 30 15 number of queries 60 17 number of evaluations per evaluator ~210 ~225 figure 1 illustrates the evalutron 6000 interface as seen by an evaluator for one "seed/candidate" listing. the left-most column has the "seed" embedded within an audio player that allows the evaluator to listen to the "seed" song and to start, stop and rewind it at will. the next column has an embedded audio player (with similar functionalities) for each of the "candidate" songs to be evaluated for similarity with respect to the "seed" song. the third column (second from right) takes the "coarse" similarity score for each of the "seed/candidate" pairs. the "coarse" scoring options include "not similar," "somewhat similar," and "very similar". the left-most column takes the "fine" similarity score for the same "seed/candidate" pair recorded on a scale from 0 (not similar) to 10 (highly similar). while more formal correlation analyses are currently underway, preliminary data do indicate a strong consistency both across the different evaluators (i.e., inter-rater reliability) and strong correlations between the "coarse" and "fine" scores. imirsel also plans on analyzing the log data associated with the evaluator interactions with the system to determine what improvements can be made to the mirex 2007 iteration of the evalutron 6000. figure 1. the evalutron 6000 interface. for a larger view of figure 1, click here. 3. future developments if mirex is to grow and thrive, it is obvious that more robust mechanisms need to be put into place to alleviate the intensive commitment of labour resources mirex places each year on the imirsel team debugging code and result sets. recent collaborations between imirsel and uiuc's, automated learning group (alg) are opening up new opportunities for meeting this labour intensity challenge. imirsel has worked with alg before in the development of the music-to-knowledge (m2k) [5] music mining and evaluation framework. m2k is a java-based data-flow environment built upon the foundation of alg's modular data-to-knowledge (d2k) and text-to-knowledge (t2k) data mining systems [6]. imirsel has been using m2k to help simplify the in-house running of the mirex 2005 and 2006 evaluation tasks. over the past several years, alg has been developing a web service implementation of the d2k/t2k framework called d2kws (d2k web services). the tomcat/java-based d2kws technology has matured enough for related *2k projects to begin experimenting with independent, domain-specific (e.g., music retrieval evaluation) prototype deployments. imirsel has set up an experimental d2kws system to begin demonstration and proof-of-concept work on several "do-it-yourself" (diy) mirex evaluation frameworks [7]. 4. mirex "diy" frameworks: benefits and challenges a principal benefit to be realized by the creation of prototype diy web service frameworks for mirex is the labour shift from the imirsel team back to submitters themselves. if implemented correctly, this labour shift actually provides tangible benefits to the submitters in exchange for their added effort. for example, a properly realized prototype would be available to the community 24/7/365. the time constraints imposed by constant debugging of code have made the rerunning of previous evaluation tasks difficult (which hinders meaningful comparisons across years). also, because imirsel plans on having the "diy" system store all previous result sets, research productivity within the mir.mdl community should improve along two fronts. first, submitters intending to participate in a given mirex will have the ability to see how their algorithms are performing in near real-time with respect to their peer participants. currently, the participating labs only see the "final" results sets that are made available shortly before the mirex plenary making the de facto research cycle a year long. second, non-participants who have novel mir/mdl techniques will be able to submit and evaluate on their own anytime during the year to quickly determine whether or not their techniques are reaching state-of-the-art effectiveness. notwithstanding the important benefits to be derived from the establishment of diy mirex services, there remain several significant challenges that need addressing. first, imirsel must ensure that only results data are transmitted from the diy system back to the participants. we are currently experimenting with several data choke/filtering schemes to make the transmission of music data impossible. second, imirsel needs to develop a permission system/policy that effectively shields the music data from malicious access attempts made by the submitted code sets. at this point, this remains an open question, so in the early days we will be opening the prototype to select external labs with which imirsel has built up a high trust relationship. third, imirsel needs to make formal assessments of the computational resources that will need to be dedicated to a 24/7/365 service. these are non-trivial as, for example, the current music collections are roughly 1 terabyte in size (and growing) and the "feature sets" generated by many of the algorithms can be larger than the underlying music they represent and can take hundreds of cpu hours to compute. third, and finally, we need to make the diy mirex service package easily transportable so other mir/mdl labs can take on some of the administrative work and to make their own special collections available as standardized evaluation resources. 5. acknowledgements dr. downie and his imirsel team are supported by the andrew w. mellon foundation and the national science foundation (nsf) under grant nos. nsf iis-0340597 and nsf iis0327371. 6. note note 1: we are constantly surprised by the amount of "corrupted" data that makes its way into carefully collated music collections including damaged audio files, empty midi files, mislabeled file headers, etc. since some task runs (e.g., "audio music similarity and retrieval") can take four days per algorithm to process, it is very important that the input data not cause a system crash during day three of a run. 7. references [1] downie, j. stephen, kris west, andreas ehmann and emmanuel vincent (2005). the 2005 music information retrieval evaluation exchange (mirex 2005): preliminary overview. in proceedings of the sixth international conference on music information retrieval (ismir 2005), london, uk, 11-15 september 2005. queen mary, uk: university of london, pp. 320-323. available: . [2] downie, j. stephen, joe futrelle and david tcheng (2004). the international music information retrieval systems evaluation laboratory: governance, access, and security. in fifth international conference on music information retrieval (ismir 2004), 10-14 october 2004, barcelona, spain. barcelona, spain: universitat pompeu fabra, pp. 9-14. available: . [3] see . [4] see . [5] see . [6] see . [7] see . copyright © 2006 j. stephen downie top | contents search | author index | title index | back issues previous conference report | next conference report home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/december2006-downie   the 4th international conference of asian digital libraries (icadl 2001) digital libraries: dynamic landscapes for knowledge creation, access and management, december 10-12, 2001, bangalore, india search  |    back issues  |    author index  |    title index  |    contents d-lib magazine january 2002 volume 8 number 1 issn 1082-9873 the 4th international conference of asian digital libraries (icadl 2001) digital libraries: dynamic landscapes for knowledge creation, access and management, december 10-12, 2001, bangalore, india   suzie allard university of kentucky [email protected] edward a. fox virginia tech [email protected] highlights: the biggest dl conference so far, with delegates from india, asia and many other parts of the world! a broad program of tutorials, keynotes, invited talks, papers, panels, exhibits, and social events. tasty and diverse food served in a friendly outdoor setting. talented local performers, rich ceremony, incredible generosity, and warm indian hospitality. over 450 pages of proceedings plus 500 pages of tutorial notes.   although icadl 2001 had a distinctly asian perspective, it was a digital library conference of global proportions. more than 600 attendees from 18 countries across four continents spent a packed three days. the conference offerings included: 6 tutorials, 21 invited talks by international experts, 27 refereed paper presentations, 2 plenary sessions, a fascinating panel discussion ("should digital libraries be open?"), and a session with 24 posters. delegates also could browse three exhibition halls filled with vendor demonstrations from 18 companies, and a display of 12 active dl initiatives in india. the conference provided ample opportunities for informal socializing including many excellent meals as well as an entertaining cultural evening featuring traditional indian dances. the icadl conference series was established four years ago to encourage and support digital library development in asia. this year's conference highlighted asian achievements and demonstrated the high level of interest in developing more dl initiatives throughout the asian region to promote education, support research efforts in many disciplines, increase information accessibility for citizens, and preserve cultural treasures. the icadl conference focused on three overall themes. the first theme was centered on the technical issues associated with dl development and implementation. speakers discussed general technical issues associated with creating dls and introduced conference attendees to emerging technologies in the areas of information retrieval, architecture for interoperability, and user-centered systems. additionally, 9 vendors showcased new products in extended demonstrations. the second theme revolved around the role of digital libraries in addressing social issues. several sessions focused on specific topics such as medical applications of dls, e-governance and the promise of electronic theses and dissertations in promoting asian scholarship worldwide. other key areas were identified as multi-lingual access, and utilizing multimedia capabilities to address the special needs of illiterate people, particularly to capture the oral traditions of indigenous cultures. the third theme centered on the asian digital library experience. these sessions directly addressed issues of particular interest to countries in the region, especially those that have developing economies. presenters showed how active dl initiatives have begun to form answers to these special needs and how the dl's role may be extended through the utilization of other technologies. these challenges include creating content, improving infrastructure, increasing connectivity, protecting intellectual property rights, developing pricing models and encouraging an environment for a knowledge sharing culture. the first day of the conference began with six two-hour tutorials presented in two tracks, and although scheduling allowed delegates to attend only three sessions, they received a bound volume featuring materials from all six tutorials. the first track focused on taking the steps to establish a digital library. sessions discussed applications of dls (edward a. fox), how to build dls with open source software (ian witten), and user interfaces and information seeking (gary marchionini). the second track featured specialized dl topics such as employing a knowledge management system (hsinchun chen), personalization and filtering on the web (mike shepherd), and digital information services in enterprises (t.b. rajashekar). all the invited talks and paper presentations were scheduled for the second and third days of the conference and were organized in three concurrent tracks. the vendor exhibition and poster sessions also took place those two days. each of these sessions provided ample time for questions and discussions, some of which were based on delegates' study of the full proceedings of papers. bangalore was an excellent host city and a fitting locale for the conference. historically known as india's "garden city," bangalore has more recently become recognized as "india's technology capital" because it is the center of the nation's thriving information technology industry. icadl 2001, for the first time, brought together large numbers of conference participants from india's library and information science community, representing a diverse range of information technology professionals. as noted by the media who attended the conference, icadl 2001 provided an opportunity to launch a dl industry in the region. icadl 2001 was organized by the university of mysore and the indian institute of information technology in bangalore along with their partners: the government of india's nissat, dsir and council of scientific and industrial research; the government of karnataka's department of information technology; and unesco. the chair of the program organizing committee was shalini r. urs. icadl 2002 will be held in singapore with schubert foo serving as chair of the organizing committee. copyright 2002 suzie allard and edward a. fox top | contents search | author index | title index | back issues previous article | in brief home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/january2002-fox   developing best practices in digital library assessment: year one update search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine november/december 2015 volume 21, number 11/12 table of contents   developing best practices in digital library assessment: year one update joyce chapman duke university libraries joyce.chapman@duke.edu jody deridder university of alabama libraries jody@jodyderidder.com santi thompson university of houston libraries sathompson3@uh.edu doi: 10.1045/november2015-chapman   printer-friendly version   abstract in the face of limited resources and increasing demand for online access to digital library content, we need to strategically focus our efforts and better understand users, impact, and associated costs. however, methods for assessment of digital libraries are not standardized. in an effort to address this crucial gap, the digital library federation assessment interest group has engaged the community over the past year in the development of best practices and guidelines. with this article, the authors provide an update on progress to date and solicit participation in an evolving effort to develop viable solutions.   1 introduction while research and cultural heritage institutions have had an increased focus on providing online access to special collections in the past decade, methods for assessing digital libraries have yet to be standardized. at the same time, assessment has become increasingly important due to limitations in resources and increasing demand from patrons for online access to materials. as discussed in our may 2015 d-lib in brief publication1, the majority of existing research findings in the field cannot be effectively generalized from one software system and institution to another and as a result, many staff in digital libraries are at a loss as to how to begin to assess areas such as costs, impact, use, and usability. in an effort to address this crucial gap and to strategically focus efforts, the digital library federation assessment interest group (dlf aig) has engaged the community in an effort to develop best practices and guidelines in digital library assessment over the past year. this article provides both background information and an update on progress made to date.   2 assessment needs and goals the dlf aig aims to actively develop documentation, tools, and suggested best practices around various areas of digital library assessment. the goals of this endeavor are both to assist those digital libraries that are unsure of how to assess their assets, and to provide a baseline across institutions to aid in the collection of interoperable metrics for comparative purposes. the areas that have been chosen as foci this year are based on membership interest, and do not cover all areas of assessment. currently, working groups have formed around the key areas of analytics, cost, user studies, and citations. these groups are working on efforts as diverse as developing white papers that discuss gaps in assessment research, creating tools that calculate costs for digitization workflows, and outlining best practices for collecting google analytics data. the dlf aig is currently using matusiak's definition of a digital library as "the collections of digitized or digitally born items that are stored, managed, serviced, and preserved by libraries or cultural heritage institutions, excluding the digital content purchased from publishers."2 the aig began its work by considering two basic questions: what strategic information do we need to collect in order to make intelligent decisions? how can we best collect, analyze, and share that information effectively? the first question is more complex than it initially appears. in 2000, saracevic3 famously divided the context for evaluation into two camps: user-centered context and system-centered context. in his vision, the user-centered levels of criteria were the needs of the community ("social" level), the needs of the organization ("institutional"), the needs of the individual users or groups ("individual"), and the "interface." the system-centered levels were the "content," the software ("processing"), and the hardware, networks and underlying support ("engineering"). to date, however, a majority of the digital library evaluations have been largely focused on the interface, the software, and to some extent the needs of the users. research from the user and usability dlf aig working group has shown that when information professionals have published on the needs of users, they largely address user behavior (29% of articles reviewed), user perceptions (31% of articles reviewed), and the usability of digital library interfaces (32% of articles reviewed).4 while our field's growing dependence on networking and underlying support has become so commonplace as to be understandably overlooked, critical aspects such as content, organizational needs, and the needs of the community have received little attention. moreover, most of the criteria used during evaluation are merely borrowed from the domains of traditional library and information retrieval systems, and may not be effective or appropriate for digital libraries developed for largely unpublished materials. in 2010, ying zhang5 analyzed saracevic's levels, finding multiple aspects, and reviewed the literature to determine which of these aspects had not yet been incorporated into published studies of digital library evaluations. the thirteen aspects not yet covered in published studies included: ease of use, reliability, integrity, usefulness, collaboration, managerial support, network effect, productivity, interoperability, security, and comprehensiveness. of these unexamined aspects, the first six were rated as top criteria by groups of developers, administrators, librarians, users and researchers zhang interviewed. by following the methods of evaluation used for traditional library and information retrieval systems, aspects critical to evolving digital libraries and the evolution of user needs are being overlooked. yet digital libraries are no longer in their infancy, and as funding models have increasingly moved from one-time sources for unique projects to continuous funding for sustainable programs, effective assessment is critical to making informed choices with limited resources. in 2004, saracevic stated that: "...there are no more or less standardized criteria for digital library evaluation. several efforts that are devoted to developing digital library metrics have not produced, as yet, generalizable and accepted metrics, some of which may be used for evaluation. thus, evaluators have chosen their own evaluation criteria as they went along. as a result, criteria for digital library evaluation fluctuate widely from effort to effort."6 unfortunately, not much has changed in the past decade, particularly with regards to digitized primary source materials and institutional repositories. development of best practices and guidelines requires a concerted engagement of the community to whom the outcome matters most: those who develop and support digital libraries. with this article, the authors hope to share what progress we have made to date, as well as to increase awareness of this issue and solicit participation in an evolving effort to develop viable solutions.   3 assessment interest group the dlf aig's effort began at the dlf forum in fall of 2013.7 a working session at the forum entitled "hunting for best practices in library assessment"8 was so successful that over 50 participants volunteered to continue the discussion after the conference. the collaborative google document9 created for taking notes during this session was 16 pages long, filled with ideas for how to move forward in three topical areas: demonstrating impact, meeting user needs, and assessing costs and benefits. a second working session the following day on altmetrics10 also drew a crowd. the following spring, dlf hosted a conference call with the presenters of both sessions, and together they established the new dlf aig.11 to facilitate asynchronous discussion, a digital library assessment google group12 was established to provide a space for colleagues to discuss and organize ongoing assessment efforts, which at the time of this writing has over 150 members. over the next few months, the authors developed a digital library assessment framework13, which groups the focus of efforts into three major categories: meeting user needs, assessing benefits, and assessing costs. this framework would be used to guide discussions and working groups in the ensuing months. in the fall of 2014, the co-presenters of the 2013 altmetrics presentation shared a white paper14 on their work and asked for comments and feedback. a few days later at the 2014 dlf forum, a panel of presentations15 by representatives of niso, duke university, university of california at san diego and the university of alabama highlighted the new niso initiative to develop standards for altmetrics16, a new web-based cost estimation tool for digitization17, and both qualitative18 and quantitative19 results from digital library user studies. these presentations were followed by community engagement to further the development of best practices and guidelines for assessment. participants self-selected for small group discussions on one of three topics: altmetrics, cost assessment, and user studies. two questions were posed to each small group: what are the critical aspects that we need to address? what are the next steps we can take? these discussions were collaboratively documented online20 and in the wake of this continued interest, four working groups were formed in november 2014 within the dlf aig to work on the development of best practices. working groups are currently centered on analytics, cost assessment, user studies, and citations. these topics were chosen purely based on community interest and not because we believe they encompass all aspects of assessment; the dlf aig welcomes the formation of more working groups on other topics of interest. dlf has established a wiki site21 that is being used by the aig working groups to document resources, best practices, and guidelines as they develop.   4 working groups the primary purpose of the working groups is to develop best practices and guidelines that can be used by the larger community to assess digital libraries in each area; the initial goal for each group was to have viable progress to report at the dlf forum22 in october 2015. below we report on the progress made by the four working groups.   4.1 analytics the analytics working group23 is coordinated by molly bragg and joyce chapman. the goal of the group is to develop best practice guidelines around analytics — primarily google analytics — for assessment in digital libraries; they chose to limit the scope in year one to google analytics because many libraries use this tool, and because the task needed to be scoped in order to be attainable.24 after distributing two drafts of a white paper to the larger dlf aig for feedback and comments in july and august 2015, the working group released a paper entitled "best practices for google analytics in digital libraries"25 in september 2015, which recommends 15 core metrics for baseline collection in a digital library program. the white paper also includes a literature review, theoretical and structural methods for approaching analytics data gathering, examples of platform specific implementation considerations, google analytics set-up tips and terminology, as well as recommended resources for learning more about web analytics. in the paper, each metric includes a definition and explanation of importance, as well as library-centric examples for how to work with the metric in google analytics. future directions for the analytics working group could include standardizing methods for sharing metrics across institutions, clear decision-making around allowing or disallowing web crawler traffic from access counts, reaching further consensus on definitions of access and use, and widening the scope beyond google analytics to include other recommended tools and methods.   4.2 user studies to date, the user studies working group26, coordinated by santi thompson, has compiled resources and drafted literature to assist those who are interested in evaluating users of digital repositories and their needs. the group's work began during the assessment breakout session at the 2014 dlf forum. feedback from the session identified three core areas for the group to focus on over the course of 2015: making usability studies more accessible to librarians; tracking the return on investment for digital libraries; and understanding the reuse of digital library materials. the group's first goal was to produce a white paper on the current state of research regarding who is using digital library content and why they seek these materials, organized into the three areas cited above. they began by compiling a bibliography of sources27 that highlight research around usability, return on investment, and reuse. next, they analyzed and synthesized these works to address gaps and assess future needs, developed research terms and "tagged" each article in the bibliography with one or more of these terms, and wrote brief summaries for each article to catch important areas not covered by the tagging process. once the tagging and summarizing concluded, the group analyzed results and used them to identify the strengths and gaps in the current literature in each of the three defined areas and made recommendations for next steps toward the development of best practices. the first draft of the white paper28 was released for comments in october 2015, and a final version will be released in december 2015.   4.3 citations the citations working group is comprised of a single member, elizabeth joan kelly (loyola university new orleans, monroe library). kelly focused on the task of producing a white paper entitled, "guidelines for citing library-hosted, unique digital assets."29 kelly consulted the style manuals for three major citation styles — apa, chicago, and mla — to assess whether they provide direction for citing digitized special collections and institutional repository items. existing citation formats, such as those for archival materials and for digitized web files, were analyzed along with recent developments in data citation standards in order to create recommended citation styles for digital library objects. kelly proposes that the use of uniform citation formats for unique digital assets will "lead to better tracking of use of these assets by hosting libraries" and continues on to state that "in recommending these formats ... it is intended that both traditional citation metrics and altmetrics will better track the use of these digitized special collection and cultural heritage materials and institutional repository content."30 kelly gathered feedback on drafts of the citation guidelines in march and april 2015; a final draft was completed in may and circulated in june 2015. final edits were completed on the document in october 2015 in preparation for dlf.   4.4 cost assessment coordinated by joyce chapman, the cost assessment working group's31 tasks are slightly different than the other working groups. the group seeks to aggregate and make freely available a large set of data on the time it takes to perform various tasks involved in the digitization process, in order to assist organizations in digitization project planning and benchmarking. they are also building a digitization cost calculator32 that takes advantage of the contributed dataset to provide digitization cost estimates based on input parameters. the group began by determining the scope of processes for which time data would be defined and reported via the calculator. they performed a review of existing literature around relevant areas33, including collection of time and cost data for digitization and existing best practices in quality control and metadata creation. they then authored a set of guidelines34 to guide the collection of time data for 20 digitization processes, including eight processes in the original materials preparation phase (i.e., fastener removal, rights review), six processes in the post-processing phase (i.e., cropping images, color correction and tonal adjustment), three processes in the post-preparation phase (i.e., re-binding) and the additional three processes of image capture, descriptive metadata creation, and quality control. three levels were defined for both metadata creation and quality control, based on a review of the literature and existing resources.35 the guidelines were released to the community for comments, and finalized in july 2015. while the original digitization cost calculator was built by chapman as a proof of concept and presented at the 2014 dlf forum, the data definitions authored by the cost assessment working group will inform a modified structure of the calculator with expanded capabilities.36 the group put out a call for data submissions37 in august 2015 via the new data submission form38, as well as wireframes39 for the envisioned redesigned calculator. before the new calculator can be built, however, at least one set of data must be submitted for each of the 20 data fields, as well as for each type of image capture device and each level of quality control and metadata creation. the working group encourages each of you to get involved and submit data from your institution!   5 going forward further updates on the dlf aig's progress will be presented at both the dlf annual meeting in vancouver, canada, in october 2015 and at the southeastern library assessment conference in atlanta, georgia in november 2015. colleagues who are interested in digital library assessment or would like to participate in the continuing best practice development work of the dlf aig are urged to contact the authors or join the digital library assessment google group40 and express their interests. the dlf aig hopes that the work it is undertaking will help the community establish best practices for digital library assessment, which in turn will lead to increased sustainability and effectiveness of digital libraries in the future.   notes 1 "a community effort to develop best practices in digital library assessment". 2 matusiak, k. (2012). perceptions of usability and usefulness of digital libraries. international journal of humanities and arts computing, 6(1-2), 133-147. http://dx.doi.org/10.3366/ijhac.2012.0044 3 tevco saracevic, "digital library evaluation: toward an evolution of concepts," library trends 49, no. 2 (2000): 350-369. 4 see the use and usability working group's topical tagging of 147 articles on digital library assessment (94 articles once irrelevant articles were deleted) published in the past five years here. 5 ying zhang, "developing a holistic model for digital library evaluation," journal of the american society for information science and technology 61, no. 1 (2010): 88-110. 6 tevko saracevic. "how were digital libraries evaluated?" presentation at the delos wp7 workshop on the evaluation of digital libraries in padova, italy, october 2004: 6. 7 digital library federation. "2013 dlf forum: austin, texas." 8 jody deridder, sherri berger, joyce chapman, cristela garcia-spitz, and lauren menges. "hunting for best practices in library assessment" presentation at the digital library federation forum in austin, tx, 4 november 2013. 9 "hunting for best practices in library assessment," a collaborative google document generated during the presentation by the same name, at the digital library federation forum in austin, tx, 4 november 2013. 10 david scherer, stacy konkiel, and michelle dalmau. "determining assessment strategies for digital libraries and institutional repositories using usage statistics and altmetrics." presentation at the digital library federation forum in austin, tx, 5 november 2013. 11 joyce chapman, "introducing the new dlf assessment interest group." blog post on the digital library federation blog, 12 may 2014. 12 "digital library assessment," google group. 13 "digital library assessment framework," 2014. 14 stacy konkiel, michelle dalmau, and dave sherer. "determining assessment strategies for digital libraries and institutional repositories using usage statistics and altmetrics" (white paper). october 2014. http://dx.doi.org/10.6084/m9.figshare.1392140 15 jody deridder, joyce chapman, nettie lagace, and ho jung yoo. "moving forward with digital library assessment." presentation at the digital library federation forum in atlanta, ga, 29 october 2014. 16 national information standards organization. "niso alternative metrics (altmetrics) initiative." 17 joyce chapman. "library digitization cost calculator." 2014. 18 jody deridder, "did we get the cart before the horse? (faculty researcher feedback)," presentation at the digital library federation forum in atlanta, tx, 29 october 2014. 19 ho jung yoo and suhui ho, "do-it-yourself usability testing: a case study from the uc san diego digital collections." presentation at the digital library federation forum in atlanta, ga, 29 october 2014. 20 "moving forward with digital library assessment," a collaborative google document generated during the session of the same name at the digital library federation forum, 29 october 2014. 21 dlf, assessment. 22 digital library federation. "2015 dlf forum: vancouver." 23 members include molly bragg (co-coordinator of working group, duke university), joyce chapman (co-coordinator of working group, duke university), jody deridder (university of alabama), martha kyrilidou (association of research libraries), rita johnston (university of north carolina at charlotte), ranti junus (michigan state university), eric stedfeld (new york university). 24 over 60% of all websites use google analytics: see "piwik, privacy." 25 the white paper can be viewed or downloaded from the dlf aig analytics working group's wiki page here. 26 members include: santi thompson (coordinator of working group, university of houston), joyce chapman (duke university), jody deridder (university of alabama), elizabeth john kelly (university of loyola new orleans), martha kyrillidou (association of research libraries), caroline muglia (university of southern california), genya o'gara (the virtual library of virginia), ayla stein (university of illinois at urbana-champaign), rachel trent (state library of north carolina), sarah witte (columbia university), liz woolcott (utah state university), tao zhang (purdue university). 27 dlf user studies in digital libraries bibliography. 28 the white paper can be viewed or downloaded from the dlf aig user studies working group's wiki page here. 29 the white paper can be viewed or downloaded from the dlf aig citations working group's wiki page here. 30 elizabeth joan kelly. "guidelines for citing library-hosted, unique digital assets," (2015): 9. 31 members of the working group include joyce chapman (coordinator of the working group, duke university libraries), kinza masood (university of utah), chrissy reissmeyer (university of california at santa barbara), dan zellner (northwestern university). 32 see the beta calculator here. the calculator works by combining and averaging the available data from each contributing institution for a given data field. for example, if three institutions have contributed time data for image capture using an overhead scanner, the calculator will average the three numbers and use that in calculators provided to the user. if only one institution has provided data, that institution's data will be provided in calculators for the user. each contributing institution's data is made available in tabular format on the "notes on data" tab to support transparency and ease of use. 33 see the bibliography produced by the working group here. 34 the guidelines and definitions can be viewed or downloaded from the dlf aig cost assessment working group's wiki page here. 35 these levels and definitions can be found in the larger guidelines and definitions document linked from the dlf aig cost assessment working group's wiki page here. 36 see wireframes for the new input and output of the calculator here. 37 see call for data submissions: digitization cost calculator. 38 the form for data submission can be found here. 39 the wireframes can be viewed and downloaded from the dlf aig cost assessment working group's wiki page here, or via a google drive folder. 40 "digital library assessment," google group.   works cited [1] sherri berger, joyce chapman, jody deridder, cristela garcia-spitz, and lauren menges. "hunting for best practices in library assessment." presentation at the digital library federation forum in austin, tx, 4 november 2013. [2] joyce chapman, jody deridder, nettie lagace, and ho jung yoo. "moving forward with digital library assessment." presentation at the digital library federation forum in atlanta, tx, 29 october 2014. [3] joyce chapman, "introducing the new dlf assessment interest group." blog post on the digital library federation blog, 12 may 2014. [4] michelle dalmau, stacy konkiel, and david scherer. "determining assessment strategies for digital libraries and institutional repositories using usage statistics and altmetrics" (white paper). october 2014. [5] michelle dalmau, stacy konkiel, and david scherer. "determining assessment strategies for digital libraries and institutional repositories using usage statistics and altmetrics." presentation at the digital library federation forum in austin, tx, 5 november 2013. [6] jody deridder, "did we get the cart before the horse? (faculty researcher feedback)." presentation at the digital library federation forum in atlanta, tx, 29 october 2014. [7] tevco saracevic, "digital library evaluation: toward an evolution of concepts," library trends 49, no. 2 (2000): 350-369. [8] tevko saracevic. "how were digital libraries evaluated?" presentation at the delos wp7 workshop on the evaluation of digital libraries in padova, italy, october 2004: 6. [9] ying zhang, "developing a holistic model for digital library evaluation." journal of the american society for information science and technology 61, no. 1 (2010): 88-110. [10] ho jung yoo and suhui ho, "do-it-yourself usability testing: a case study from the uc san diego digital collections." presentation at the digital library federation forum in atlanta, tx, 29 october 2014.   about the authors joyce chapman is the assessment coordinator at duke university libraries and co-founder and co-leader of the dlf assessment interest group. she holds an msis from the university of north carolina at chapel hill.   jody deridder is the head of metadata & digital services at the university of alabama libraries and a co-founder of the dlf assessment interest group. she holds an msis and an ms in computer science from the university of tennessee.   santi thompson is head of digital repository services at the university of houston libraries and a co-leader of the dlf assessment interest group. he holds an mlis and ma in public history from the university of south carolina.   copyright © 2015 joyce chapman, jody deridder and santi thompson bittorrent and libraries: cooperative data publishing, management and discovery search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine march/april 2014 volume 20, number 3/4 table of contents   bittorrent and libraries: cooperative data publishing, management and discovery chris markman, clark university cmarkman@gmail constantine zavras constantine.zavras@gmail.com doi:10.1045/march2014-markman   printer-friendly version   (this opinion piece presents the opinions of the authors. it does not necessarily reflect the views of d-lib magazine, its publisher, the corporation for national research initiatives, or the d-lib alliance.)   abstract the evolution of open data depends on the use of new technologies that not only allow equal access to information, but equal access to the distribution and redistribution of public knowledge. an open api offers only the illusion of transparency—for data to truly be free, librarians must look towards their audience as digital collaborators, rather than simply end users. thankfully, the tools to create a global, decentralized, peer-to-peer information network for massive amounts of data has been hiding under our nose the entire time. in this opinion piece we explore the opportunities afforded by the bittorrent protocol. we also discuss what happens when libraries adopt a distributed, grassroots approach to data management that saves money and lays the groundwork for online community.   1. introduction the evolution of open data depends on the use of new technologies that not only allow equal access to information, but also to the distribution and redistribution of public knowledge. an open api offers only the illusion of transparency and does not replace the need for direct access to data by humans acting as digital collaborators rather than end users. in this opinion piece we will explore why we believe this is true, as well as several opportunities afforded by the bittorrent protocol your library could be implementing today—not tomorrow. to fully understand how this technology could benefit libraries, the terminology and concepts must be clearly understood. at the basic level bittorrent is a communication protocol that allows for the transfer of files from multiple sources and is known as "peer-to-peer" (p2p). this type of file sharing protocol allows any computer, known as the "peer," to act as a client or server for other computers, via the internet. what this means is that files can be shared directly between computers without the need for a central server. all that is required to join one of these systems is a connection to the internet and peer-to-peer software, a very low bar for implementing the technology. [detail image of infographic designed by e.j. fox, visual.ly 1] bittorrent is one of the fastest and most efficient of the peer-to-peer systems. it differs from other p2p networks in that it has a central server for some of the data transfer. a small portion of the file is maintained at the central server, which is called a tracker. anyone looking for a file using bittorrent merely has to click on a file link in a browser. the bittorrent software communicates with the tracker in order to find other computers that are running bittorrent, that have either the complete files or have a portion of the file. the computers with complete files are known as "seed" computers, and those with a portion of the file are typically other peers in the process of downloading the file. [detail image of infographic designed by e.j. fox, visual.ly 2] the tracker then identifies all the computers that have pieces of the file, which is called the swarm. these pieces are traded with other computers that are connected, allowing you to receive multiple pieces of the file simultaneously. by downloading the file in segments from multiple systems the overall speed is greatly improved, and the more computers are involved, the faster the file is downloaded. if many libraries or computers in a library were running bittorrent it would make the movement of data more easily attainable. with such an effective system of transferring information why haven't libraries looked into it yet as a way to manage data? a major reason has been the recent rise of negative publicity in social media and technology circles. this has become particularly relevant due to the spotlight on pirate bay and new legislation such as the "six strikes system," and other similar initiatives.3   2. the p2p controversy with the roll-out of the "six strikes" initiative that went live in mid-february of 2013, isp's are cracking down on illegal file-sharing more than ever, primarily when it utilizes bittorrent. programs such as this are problematic in that bittorrent has many legitimate uses for the dissemination of information. many legal trackers exist, which we will discuss later, but this legislation stifles the use of this technology no matter what the purpose. pirate bay is the most recognized purveyor of bittorrent trackers, legal and illegal. many of their founders are currently embroiled in legal issues on a global scale4. the sensationalism that has been traveling the news circuit has given rise to a stigma associated with conversations about bittorrent. while pirate bay may be the face of bittorrent for the general public, the technology behind it is much older and has many applications beyond the sharing of illegal files. separating the technology from the misinformation that has come to surround it is important when assessing the uses bittorrent can have in libraries. beyond the technical capabilities of the bittorrent protocol, which prove to be robust, there is another dimension to p2p exchange that has yet to be explored. that is, the fundamental shift which takes place between library and patron in this data exchange—the "invisible" transaction taking place between the end user's computer hardware, monthly internet and electric bill, and your collection's total bandwidth capability. introducing bittorrent into your library's information ecosystem is not only a potential cost saver, but the first step toward building an online data community as well.   3. content and discovery the web-based bittorrent tracker interface has much in common with a typical online library catalog. be aware that the adoption of bittorrent should not be viewed as a replacement for a catalog but rather as an inherently multiplatform tcp/ip based data service that operates at the "data link layer"5. it does not change anything you know and love about digital libraries and online collections—it just delivers content faster and more efficiently by bypassing network bottlenecks through the utilization of "client" systems6. adapting bittorrent technology to suit the needs of libraries provides several opportunities outside this context (discussed in the next section of this article) but the important point to consider in all cases is for bittorrent's ability to suit the needs of all types of data in a network-enabled environment. not only that, but as the amount of digital content grows, bittorrent is a scalable way of pushing out content. also consider the concept of "small data" as opposed to "big data" in this context7. while it may seem that bittorrent is more conducive to sharing extremely large data sets in the hundreds of gigabytes, it can be easily used to pass small data around quickly to millions of people. while it makes sense for the initial set up and ongoing maintenance of your bittorrent tracker to fall under the responsibilities of your neighborhood systems librarian, there are some benefits in joining a preexisting public tracker because an online community is most likely already there. this is important to consider because high-traffic public trackers often have faster download speeds (exposing your content to a larger pool of potential seeds and peers = more bandwidth). there are, of course, many reasons to go the opposite route and deploy a dedicated bittorrent for your library as well8. an often overlooked bittorrent feature that is embedded within a tracker allows it to easily double as a discovery system for librarians and patrons alike. that is, bittorrent "swarm" statistics, which measure individual user upload/download ratios and/or the "popularity" of a given torrent file are often captured automatically by the tracker software—functioning as a sort of "google analytics" for your digital library collection9. with the recent addition of "magnet links" in the bittorrent toolset, an enterprising systems librarian wouldn't even need to set up a tracker at all10. such an implementation could easily be automated to insert magnet links into a given set of catalog records. this is not to say you should immediately begin uploading files to the pirate bay, even though it is arguably the epicenter of legal and illegal bittorrent activity online. there are in fact many bittorrent trackers that deal exclusively with legal content. one popular example, etree.org11, deals exclusively with live concert recordings (when the artist or musical group permits it). another, clearbits.com12 trades only "high quality, open-licensed (creative commons) digital media, datasets, and artwork for content creators"13. [the most notable example of bittorrent and libraries currently exists in the digital public library of america's index of material from the internet archive.]     4. opportunities for libraries while the benefits of bittorrent have long been known to its users (they are, after all, both ps in p2p) there has been little discussion in the it world about the possible benefits of utilizing the unique characteristics of the bittorrent protocol in heterogeneous information ecosystems. the fact that bittorrent software is both multi-platform and open source creates stability for long-term sustainability. as a technology with over a decade of development behind it, bittorrent has reached a maturity level ripe for specialization in library systems,s and when viewed through the lens of a librarian, bittorrent also offers many unique opportunities14.   virtual teams online collaborations, especially those taking place between members of different institutions, can be difficult to efficiently maintain in terms of data management. as the size of your files become larger, the number of free online services like dropbox.com or google drive become harder to find. alternatively, with bittorrent your entire team could be sharing the same data in a few clicks, while the software actively utilizes your collective computing power and bandwidth. there are currently two bittorrent side projects in development under the "bittorrent labs" banner at bittorent.com that aim to do precisely this. soshare streamlines the bittorrent sharing process to the point where users can instantly share files with a single click from their desktop through the use of a browser plug-in—very useful for one-off file sharing needs and they claim to support files as large as a terabyte15. similarly, "bittorrent sync", which was recently launched as a public beta, can "automatically sync files between computers via secure, distributed technology"16. the major difference between bittorrent sync and traditional bittorrent sharing is the use of shared "secrets", similar to pgp encryption17.   software development and deployment milliseconds matter a great deal in the world of twitter. that's why murder, their bittorrent-based deployment system was created18. essentially, murder utilizes bittorrent by treating code as data19. although the speed of murder is a notable advantage that's not why libraries and archives might find this bittorrent use interesting. murder is not only completely customizable and open source20 but it's essentially designed to make the job of systems maintenance easier and is optimized, by design, for local area network architecture. [screenshot of slide image from http://vimeo.com/11280885.]   diy lockss the "lots of copies keep stuff safe" (lockss) mantra is another way of saying too many backups is better than too few21. with bittorrent any institution can grow their own lockss system with minimal up-front cost through crowd-sourcing or the strategic purchase of vpn servers (on multiple continents no less). coincidentally, the same systems that have enabled media piracy to evade copyright law could also enable cultural institutions to implement data disaster planning for a fraction of the cost. bittorrent tracker software also offers the opportunity for libraries to reward their more "dedicated" seeds/users. this can be done by tracking download/upload ratios, creating the potential for online community building, and marketing—a decentralized, organic, "virtual timeshare" system that runs entirely on donated bandwidth and hardware from users.   streaming live video bittorrent live, a new feature as of 2013, enables any seeder to broadcast video streams across the world in a way that avoids the pitfalls of many competing online streaming systems. where other systems might buckle under a heavy user load, bittorrent live scales instantly because each new peer is also a seed. bittorrent live is still very new, but it will undoubtedly play a critical role in conference events and library programming in the future. [screenshot of live traffic cam from http://live.bittorrent.com/.]   streaming old video although not legal under us copyright law, libraries in other parts of the world could easily create a netflix-like online streaming video service using bittorrent live. this is great news for public and academic libraries that have already invested in large dvd collections, but do not want to negotiate new streaming licenses with video distributors. this is unlike the previous example of self-publishing because the number of "live users" could be limited by software to never exceed the total number of dvd copies of a particular title owned by a given library—similar to other online reservation systems for digital objects like overdrive. it's important to note that while section 108 currently grants libraries and archives permission to break dvd encryption under special circumstances this does not cover the process of format shifting an entire work for online consumption22.   something completely different the greatest opportunity bittorrent presents for libraries may in fact be a unique combination of all the features and functionality previously discussed: lockss, self publishing, and live video depicting one example. these features could easily be combined to create a self-sustaining video archive that scales automatically to create a global broadcast hub—all in a way that takes full advantage of the inherently decentralized infrastructure of the web. [image from http://labs.bittorrent.com/index.html.]     5. conclusion in 2012 the internet archive began serving up over a million files from its collection via bittorrent. internet archive founder brewster kahle notes in this initial blog post that the "distributed nature of bittorrent swarms and their ability to retrieve torrents from local peers may be of particular value to patrons with slower access to the archive", especially for "those outside the united states or inside institutions with slow connections"23. this opinion piece has explored a number of ways that bittorrent could be useful to libraries. the integration of bittorrent, both internally and externally, could be a huge boon. many libraries currently belong to networks such as college circuits, public library consortiums, and other insular groups. the infrastructure and content is already in place and bittorrent could help bridge the informational and geographical gap between various libraries without extra cost to the institutions. many of the controversies that surround bittorrent result from its portrayal in the media and its illegal use, combined with the public's lack of understanding of the protocol. the "six strikes" initiative, and others, have treated the bittorrent protocol as a single application used solely for piracy in the way napster was, and disregard the fact that there are many legal uses that are being curtailed. the truth is that the technology itself is sound and has myriad legal applications in the library world, as has been discussed above. bittorrent is useful in areas of low and high technology. it allows the transfer of information slowly in places with limited resources, and the transfer of large quantities of data in high tech places that need quick and efficient movement of information. even in developing countries and parts of the us that do not have fast internet connections or large resources for libraries, bittorrent can be used to send information effectively. the us is rich in repositories of information, as are many other nations. using bittorrent in partnership with foreign institutions could help disseminate information to parts of the world with less access to cutting edge computer technology. the "patrons" computers could build their own virtual networks without the need for pre-existing setups. the global ramifications of this technology are staggering and it could begin a new era of information literacy for places with no national libraries or information sharing framework. bittorrent could provide libraries a way to quickly serve information to the public as well as making their infrastructure and operations more efficient. the ability to roll out updates and transfer information would prove invaluable, as well as the backing up of materials. all that is needed is for the library community to embrace the technology and all the good that could come with it. the bittorrent protocol is over a decade old and yet the internet archive is the only high profile library currently utilizing this distribution technology. libraries that adopt bittorrent can not only improve download speeds for patrons but in doing so cut bandwidth costs and enrich the online community at large. only when more institutions with vast digital collection and busy it staff become willing to take the time to transform this once stigmatized protocol will we be able to comprehend its full potential. the latest developments of bittorrent labs both confirms this and points towards several new applications that fully leverage the potential of decentralized network computing power.   6. notes 1 a history of bittorrent, e.j. fox, designer, visual.ly, may 2011. 2 ibid. 3 the popular bittorrent news blog, torrentfreak, is an excellent source for news on the us's "six strikes" policy and in june 2013 published the "copyright alert" materials sent to isp customers. 4 the pirate bay's legal battle(s) are the subject of a crowd-funded 85 minute documentary available to view for free online. 5 see osi model, wikipedia, the free encyclopedia, 8 february 2014 6 this is of course a misnomer because client systems in the bittorrent information ecosystem are also servers—the basis of any "p2p" system. 7 dr. rufus pollack of the centre for intellectual property and information law and the university of cambridge talks about the meaning of "small data" in this post on the open knowledge foundation blog. 8 the wide range of bittorrent tracker software and features can be viewed here. 9 to read more about google analytics and the software,s capabilities see introduction to google analytics. 10 magnet links are basically a link containing the hash value of a torrent file, originally used as a way to subvert online ant-piracy measures which often first target illegal bittorrent tracker websites (when a tracker is shut down, its .torrent files no longer work). for a basic overview see lifehacker, what are magnet links, and how do i use them to download torrents?. 11 etree's hosting is provided by ibiblio, a familiar name in world of libraries. ibiblio also manages another bittorrent tracker called terasaur in collaboration with the school of information science, the school of journalism and mass communication, and information technology services at the university of north carolina at chapel hill. 12 formally known as legaltorrents.com—there is an interesting article on their blog that explains why they made the name change. 13 as of january 23, 2014, while this article was in preparation for publication, clearbits.net shut down permanently. 14 many of the following examples are echoes of network security professional joe steward paper titled 'bittorrent and the legitimate use of p2p' which was presented in a panel discussion held by the forum on technology & innovation in washington, dc. steward offers similar findings under a more general context. a version of the presentation is available on his personal website. 15 soshare is no longer being actively developed, and may of its features have been replaced by bittorrent sync and another project, paddle over. 16 see bittorrent sync. 17 what is pgp? see wikipedia's pretty good privacy description, and the bittorrent sync faq for more information. it's also worth noting this is different than the traffic encryption already support by most bittorrent client software. 18 "murder" being an uncommon term for a flock of crows. 19 for the gory it details on the project look no further than twitter infrastructure engineer larry gadea's presentation in 2010 at the canadian university software conference. 20 the code and documentation is freely available on their github project page. 21 for those looking for a more official definition, see the society of american archivists website. 22 us copyright office. "revising section 108: copyright exceptions for libraries and archives". february 2013. 23 see internet archive blogs. "over 1,000,000 torrents of downloadable books, music, and movies". august 2012.   7. references [1] "comparison of bittorrent tracker software." wikipedia, the free encyclopedia. web, accessed march 2014. [2] fox, e.j. "a history of bittorrent." visual.ly. web, accessed march 2014. [3] kahle, brewster. "over 1,000,000 torrents of downloadable books, music, and movies." internet archive blogs, 7 august 2012. [4] klose, simon. tpb afk: the pirate bay away from keyboard. film, 2013 [5] "osi model." wikipedia, the free encyclopedia web, accessed march 2014. [6] pollock, rufus. "what do we mean by small data." open knowledge foundation. 23 april 2013. [7] stewart, joe. "bittorrent and the legitimate use of p2p." washington, d.c, 2004. web, accessed march 2014. [8] twitter — murder bittorrent deploy system. film, 2010. [9] "what are magnet links, and how do i use them to download torrents?" lifehacker. web, accessed march 2014.   about the authors chris markman is the resource library coordinator for the visual & performing arts department at clark university where he manages a multimedia research collection. his interests include 3d printing, intellectual property law, experimental video, and digital curation.   constantine zavras is a freelance technical writer, editor, and data specialist. he worked on data aggregation and cataloging projects at ita software as a data engineer and domain team lead. his interests include electronics, open source advocacy, library digitization, and information literacy.   copyright © 2014 chris markman and constantine zavras directories of libraries and related organizations search  |    back issues  |    author index  |    title index  |    contents d-lib magazine september 2000 volume 6 number 9 issn 1082-9873 directories of libraries and related organizations   judith pearce national library of australia [email protected] summary this paper describes data standards work currently in progress to support the deployment of directories of libraries and related organizations in a networked information services environment. several interconnected directory-related projects are described that are contributing to development of these standards through the implementation of services that address gaps in the australian resource sharing infrastructure. the relationship between this work and other data modeling initiatives is reviewed and some interesting conclusions drawn. introduction a quick glance under "directories" in the digital librarian [1] reveals that numerous directories of libraries and related organizations are available on the internet, ranging from lists of web sites and web-accessible catalogues to detailed entries that may have been collected originally for a print publication. their purpose is to enable the discovery and location of the organizations themselves, as well as information about their collections and services. directory information like this has an important role to play in a networked information services environment. however, for this role to be fully realized the information will need to be made accessible through standard protocols that can extract data relevant to a given process. moreover, the information itself will need to be shareable to minimize the need for the same updates to be reported by an organization to multiple directory services. currently, a library may need to register with and provide similar details to multiple online directories as well as to applications such as bibliographic utilities, union catalogues, interlibrary resource sharing systems, conspectus databases and directories of z39.50 targets. definitions in this paper, the term directory is used to mean a collection of information about parties. the term party is taken from the indecs data modeling initiative [2]. it is used here to mean a person, or organization that may perform a range of roles over time as an agent in the making and use of resources. the primary parties of interest are libraries and other information service providers, museums, archives, commercial document suppliers, resource sharing consortia and interlending utilities. the primary role of interest is that of resource provider. the term resource is used to mean any item that is a manifestation of an expression of a work as modeled in ifla’s functional requirements for bibliographic records (frbr) [3]. the term collection is used in the sense defined by heaney [4] to refer both to aggregations of resources and to aggregations of resource descriptions. the national library of australia’s manuscript collection is a collection of resources. the register of australian archives and manuscripts (raam) is a collection of resource descriptions. the term service is used in the sense defined by the pride project [5] to refer to a system (manual or automated) that provides access to a collection. the web gateway to the register of australian archives and manuscripts [6] is an example of an online service accessible through the http protocol. the raam z39.50 target currently under development will be another online service providing access to the raam collection through the z39.50 protocol. interlibrary loan / document delivery services providing access to resources described in raam are examples of offline services. union catalogues as directories raam is often thought of as a directory because it is a union catalogue. a union catalogue describes resources held by more than one organization, by linking resource descriptions to the holding organization through one or more holdings statements. in the z39.50 holdings schema [7] this is done through a sitelocation entity which is used to report identity and location information about the holding organization or site. in addition, data elements in the schema have been defined that allow collection use and access policy information (including service and copyright fees) to be included as part of the holdings statement where the data is available or appropriate. at the simplest level of reporting, the holdings statement need only be a link from the resource description to an organization identifier such as a union catalogue symbol. higher levels of reporting or display may require a client or server to be directory-enabled. by this is meant the ability to query collections of directory information that may or may not be maintained on the same system, in order to obtain information needed to support a given process. as well as populating holdings statements, there are other use cases needing to be supported in the union catalogue environment that depend on searching directory information. these include the need to limit a search of a union catalogue (or the display of holdings statements or a list of potential resource providers) to holding organizations in a given region or of a given type or belonging to a given consortium or supporting a given service level. australian directory projects in australia there is a well developed resource sharing infrastructure based on contribution of holdings information to a national bibliographic database and conformance with an australian interlending code. key players in the australian library community had been aware of the need to address the role of directory services in this infrastructure for some time. it emerged as a priority task following completion of several projects that have significantly enhanced the technical infrastructure. these include: implementation of an australian libraries gateway with an online update capability enabling libraries to take responsibility for maintaining their own entries. migration of the australian national bibliographic database to a z39.50 server (amicus). migration of the national interlibrary lending utility (kinetica document delivery) to a system supporting exchange of requests via the iso ill protocol (fretwell-downing's olib vdx). customisation for deployment within the australian university sector of a local interlending and document delivery administration system (liddas) based on the iso ill protocol and fretwell-downing's olib vdx. for more on these projects, see mclean [8]. work is also proceeding to address the access management aspects of information service delivery in the academic and research environment. the pride project has been testing a directory-enabled systems architecture that will support the development of local information service management systems providing unmediated access where possible to a range of local and external information services. kerry blinco [9], macquarie university's representative on the pride project, has developed an information services model which illustrates the role of directory systems in this architecture [figure 1]. macquarie university has successfully used this model as a tool for identifying and addressing gaps in the technical infrastructure in projects such as the jedds [10] and liddas projects [11]. to further the goals of the pride project, they are collaborating with the national library of australia to set up an x.500 test bed server containing directory entries for the libraries participating in the australian national resource sharing network. the information in this directory will be derived from entries maintained online by the organizations themselves through the australian libraries gateway [12]. the australian libraries gateway is being extended as part of this process to incorporate data currently stored in a separate interlibrary resource sharing (ilrs) directory [13] under the same framework. the australian libraries gateway will not itself be x.500-enabled at this time but the project will test the extent to which migration of the service to the x.500 suite of standards is feasible and the issues that would be involved. the pride project has developed a pride synchronization agent that will enable libraries with liddas systems to download directory information from the x.500 test bed server for use in local interlibrary lending and document delivery transactions. a similar synchronisation of directory data is planned between the x.500 test bed server and the national interlibrary lending utility (kinetica document delivery). the intention is to enable this information to be maintained once only through the australian libraries gateway and to be shared between applications engaged in resource sharing activities. the data modeling stage of this exercise has been completed and work has now begun on development of the new service. data modeling outcomes a key outcome of the data modeling stage has been development of a set of object classes within the pride x.500 directory schema that address the specific requirements of resource sharing directory systems to share data and to automate processes such as filtering of resource providers. in addition the data model for the australian libraries gateway has been re-engineered to provide an overarching directories framework that the national library of australia is planning to deploy for all of its directory services, both internal and external. a core set of entities are defined that enable organizations to be registered and described. this information is shareable between directory services. there is a linking entity that allows information relevant to a specific organizational role such as service administrator to be assigned to a directory collection. collections currently supported are the australian libraries gateway database and the national directory of australian oral history collections [14]. the new ilrs directory will form a third collection. important issues addressed during the data modeling stage included: the need to revise the existing data model for the australian libraries gateway to support a hierarchical directory structure. (business rules for parent-child relationships had to be strictly enforced to enable organizations to be identified uniquely by their distinguished name. this required detailed analysis of the other kinds of relationships between organizations that needed to be supported in an unambiguous way.) the need to separate information ubiquitous to an organization from information specific to a given role. (in the model, identity, relationship and type are considered to be properties of an organization. location is treated as a property of an organizational role, enabling different contact details to be recorded for each role.) the need to enable the ilrs directory to operate as a directory of national union catalogue symbols. (identity encompasses the distinguished name of an organization and a range of organizational aliases, including union catalogue symbols. union catalogue symbols are inherited by collections, locations and services through the organizational role.) the need to resolve ambiguities between collection use policies and access policies. (it was clarified that collection use policies are implicit properties of a collection. they specify whether resources of a given type are lent or copied and under what terms and conditions. access policies specify classes of user to whom services are provided and under what terms and conditions.) the need to express the complex service parameters and charges informing interlibrary lending and document delivery transactions in a machine-readable form that would support changes in business and service models over time. (a service parameter object class was defined that will enable conditions and charges to be recorded for a flexible range of service parameters, including request, copy, delivery, billing and payment methods, service level, requester location and copyright conditions.) the draft interlibrary loan directory developed for the north american interlibrary loan and document delivery (nailld) project [15] provided a valuable cross reference when developing the service parameter object class. iso 2146 project the modeling work for the pride / ilrs directory projects also informed the first working draft of the new version of iso 2146, a standard first published in 1972 to assist in compiling and publishing directories of organizations operating in the library and information domain. this standard is currently undergoing review by iso tc46/sc4 wg7 [16]. it is planned to issue a committee draft for comment by the end of the year. the iso 2146 review has its origins in the standards based modeling approach being practiced by macquarie university, ukoln and others. clearly articulated service models such as the one illustrated in figure 1 enable definition of the standards and applications to be used across boundaries. where gaps are identified, duplication of effort and national investment can be minimized and common tools provided by working to develop or enhance standards to address these gaps. in the case of australian resource sharing directory services, iso 2146 sounded like an appropriate standard on which to base application development, but in its current state, it was not suitable. it is tied to a specific syntax. there is no underlying data model. data element groups are repeated for organizational subunits. issues such as the difference between collection and service remain unresolved. in addition, substantial revisions are needed to the data elements themselves to take into account the new use cases needing to be supported in a networked information services environment. the scope and field of application of the existing standard is as follows: this international standard is intended to assist in compiling directories of libraries, archives, information and documentation centres, including a description of their database services. the rules of this standard concern international directories, national directories published in bior multilingual countries and national directories intended for international use and published in one or more foreign languages and national directories. the standard should be used for the collection of the appropriate data and for the publishing of directories as well as for the exchange of data. the proposed new scope and field of application for the standard is: this international standard is intended to assist in compiling directories of libraries and related organizations, including a description of their collections (including databases), services and activities. the rules of this standard concern directories operating in an international distributed network environment. the standard should be used for: collection of the appropriate data. publishing of international directories, national directories published in bior multilingual countries and national directories intended for international use. exchange of directory data. provision of access to directory data as part of resource discovery, access management or document delivery processes. archiving of directory data when the data exists only in electronic form. related projects resource sharing directory services are just one of the applications to which the iso 2146 standard will apply. the aim is to build on the directory entities defined to support resource sharing applications by testing the model against a range of other standards and projects. three such projects are described here. iso 8459 the iso 2146 revision is being conducted in partnership with a project to consolidate the iso 8459 series of standards into an open-ended object model based on the ifla frbr. it is planned to express the semantic and logical relationships between data elements as rdf classes and properties and their syntax and structure as xml structures. iso 8459 is a 5-part standard developed by iso tc46/sc4 wg7 to provide a comprehensive data element directory for library applications. it includes the data elements required to support interloan, acquisitions, information retrieval and circulation applications as well as data elements for the exchange of cataloguing and metadata between systems. earlier parts of the standard will be reviewed as part of the consolidation process. consolidation of the standard should give iso 8459 greater visibility as a conceptual framework for development of library applications. it is not yet clear whether the iso 2146 object classes will be an integral component of the iso 8459 model or a modular extension. ipig directory schema the ill protocol implementors' group (ipig) is currently engaged in revising the nailld draft interlibrary loan directory in the form of an xml schema. an important driver for this project is the need for a directory of ill protocol implementors. the ipig work should contribute significantly to development of the iso 2146 model in this area. there are similar needs emerging in relation to the data requirements for directories of z39.50 targets. in both cases, information supporting identification of the organizations themselves should be shareable with other directory services allowing modeling effort to be focused on data requirements unique to the protocol directory service. simple collection description heaney's analytical model of collections and their catalogues cited above [4] has made an important contribution to the conceptual framework within which iso 2146 is being revised by treating collections of resources and collections of resource descriptions as sub-types of collection. this was one of the early outputs of the rslp collection description project [17], which builds on earlier ukoln work to address the need for machine-readable descriptions of searchable resources. the rslp collection description schema bears an interesting similarity to the z39.50 holdings schema, for the good reason that it is addressing the same need to link resource descriptions to directory information. the schema defines a set of attributes "used to describe a collection, its associated location(s) and any associated agent(s)." the data elements used for agents are derived from heaney's model and parallel work in progress to develop an agent metadata set by the dublin core metadata initiative [18]. location is defined as "the physical or online (digital) place where a collection is held". a locator attribute enables direct access to digital locations. opportunities are provided in free text data elements to record any access restrictions placed on the collection and on the location. in the draft report from the elib collection description working group which contains an early draft of the schema [19], data elements are divided into those that describe a collection and those that describe a service providing access to the collection. this more closely reflects the pride project's distinction between collection and service. there is also more granularity in the data elements provided to record service access conditions. for the rslp collection description project a decision must have been made to dumb down the service aspects of the simple collection description. complex data structures would need to be supported to make this information sufficiently granular to address the pride project's goal of unmediated access. as "a schema for making relatively simple collection descriptions in a wide variety of contexts -a 'dublin core' for collection description" [20] it is not intended to be a replacement for richer directory schemas where they are needed; for example, as part of a database hosting or aggregation service. conclusion this paper has attempted to describe the directory data standards work currently in progress through the pride, ilrs directory and iso 2146 projects and to place this work within a broader context. the need for directory data standards has not been well understood outside the resource sharing community. a driver within that community has been the need to develop sound business models and performance standards for interlibrary loan / document delivery services. there is also a need for resource discovery services that enable the discovery of expressions of works regardless of format and that do not limit the user's options to digital manifestations. this depends on the development of gateway and provider systems that invoke appropriate underlying software and protocols to enable access to resources that are not immediately available online. the australian libraries gateway experience shows that libraries and related organizations are keen to take responsibility for the maintenance of information about their collections and services when there are efficient processes in place to do so. over 4,500 libraries now have entries in the australian libraries gateway database. within this framework it is relatively easy to add new data elements or object classes or to define new directory collections when there is a need to gather new information. it is still uncertain which protocols will prevail in the delivery of directory services. the x.500 suite of standards offers many advantages, particularly in the context of an organization's own directory service needs. directory searches can also be supported using z39.50 or whois++ protocols. xml query, once it is developed, may prove to be the protocol of choice for many applications because of its ability to treat directory entries as documents and to cope with complex document architectures. for this reason a key aim of the iso 2146 revision is to express the data structures in forms that are implementation-neutral and that can be easily mapped to other data modeling initiatives. it is clear that the data structures already developed through the pride and ilrs directory projects will need to be refined and extended as new functional and mapping requirements are identified. however, linking the iso 2146 revision to these two projects has provided a strong base for further development to meet the objectives defined in the proposed new scope and field of application for this standard. bibliographical references [1] "directories libraries". in digital librarian: a librarian’s choice of the best of the web. online. available: . 2000-09-13. [2] indecs home page. online. available: . 2000-09-13. [3] ifla. functional requirements for bibliographic records. final report. 1998. online. available: . 2000-09-13. [4] heaney, michael. an analytical model of collections and their catalogues. third issue revised. oxford. 14 january 2000. online (pdf). available: . 2000-09-13. [5] pride: people and resource identification for distributed environments. online. available: . 2000-09-13. the pride definition of service may be found in the pride service data preparation guidelines. online (pdf). available: reports . 2000-09-13. [6] national library of australia. register of australian archives and manuscripts (raam). online. available: . 2000-09-13. [7] z39.50 holdings schema. version 1.1. july 2000. online. available: . 2000-09-13. [8] mclean, neil. "the evolution of information resource sharing infrastructure: an australian perspective". library hi tech. vol 17(3). 1999. pp. 256-264. [9] blinco, kerry. "access management for interlending and document supply". australia/new zealand technical committee it/19 seminar. linking access to electronic library and information services. melbourne. 21 february 2000 / wellington. 21 june 2000. [10] jedds (joint electronic document delivery software). online. available: . 2000-09-13. [11] liddas (local interlending and document delivery administration system). online. available: . 2000-09-13. [12] australian libraries gateway. online. available: . 2000-09-13. [13] australian interlibrary resource sharing directory. online. available: . 2000-09-13. [14] australia's oral history collections: a national directory. online. available: . 2000-09-13. [15] north american interlibrary loan and document delivery (nailld) project. interlibrary loan directory. draft. december 1997. online. available: . 2000-09-13. [16] iso 2146 project. online. available: . 2000-09-13. [17] rslp collection description. collection description schema. v 1.11 2000/05/23. online. ukoln. available: . 2000-09-13. [18] dublin core metadata initiative. agents working group. online. available: . 2000-09-13. [19] simple collection description: a draft report from the elib collection description working group. online. ukoln. available: . 2000-09-13. [20] rslp collection description. relationship to isad(g) and the archives hub. online. available: . 2000-09-13. copyright© 2000 judith pearce top | contents search | author index | title index | monthly issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/september2000-pearce   acquiring high quality research data search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine january/february 2011 volume 17, number 1/2 table of contents   acquiring high quality research data andreas hense department of computer science, bonn-rhine-sieg university oas, sankt augustin, germany andreas.hense@h-brs.de florian quadt1 department of computer science, bonn-rhine-sieg university oas, sankt augustin, germany florian.quadt@h-brs.de doi:10.1045/january2011-hense   printer-friendly version   abstract at present, data publication is one of the most dynamic topics in e-research. while the fundamental problems of electronic text publication have been solved in the past decade, standards for the external and internal organisation of data repositories are advanced in some research disciplines but underdeveloped in others. we discuss the differences between an electronic text publication and a data publication and the challenges that result from these differences for the data publication process. we place the data publication process in the context of the human knowledge spiral and discuss key factors for the successful acquisition of research data from the point of view of a data repository. for the relevant activities of the publication process, we list some of the measures and best practices of successful data repositories.   1 introduction in the research community the culture of conveying knowledge by publishing papers has a long tradition, and in the last decade digital text publication has been fully established. in all research endeavours where data play a central role the expectations of verifiability of experiments have grown [1] and the need for reusing and recombining existing data sets for further investigations is growing, too [9]. the publication of data is still under development, and the state of the art differs a lot in the research communities and disciplines. in this paper we will take a look at data publication in the context of the knowledge spiral of research. looking at the different stages and activities of this spiral, we establish a list of key factors that are important for the acquisition of high quality research data. an overall study on the state of the art in data publication would be beyond the scope of this paper. the authors' experience comes from a dfg-funded project on the publication of meteorological data and an e-research study tour in australia.   2 challenges of data publication in comparison to traditional text publications, data publications present some new challenges. this section contains a selection of prominent aspects. whenever we refer to the term paper we think of all kinds of publications in text form like journal articles, presentations, books, etc. file formats for text publications, there are several widely accepted file formats such as plain text, html, rich text format, microsoft word, openoffice documents, adobe's portable document format (pdf), latex code, etc. all these formats are used by researchers worldwide in all disciplines. for research data, a variety of file formats exist, such as xml, spreadsheet files, database schemas, and many binary formats, which are optimised for the needs of a certain domain. individual formats cannot be opened with standard tools but need the installation of specialised software analysis and visualisation. contents it goes without saying that data files serve another purpose than research texts. a paper is written and consumed by a human being and thus necessarily has a subjective character. since its purpose is to transfer knowledge, the author develops a logical and didactical structure, highlights important aspects, interprets data, and uses tables, figures, and pictures to illustrate specific statements. on the other hand, data files are often directly or indirectly produced by a computer or an instrument. in general, the file formats used here have a strict structure and are optimised for being imported and processed by a computer. thus, data files are objective and plain, and not always adequate for direct human inspection. sqa scientific quality assurance (sqa) — especially in the form of a peer-review — has a long tradition. since sqa is a substantial process, the reviewer needs to have a deep understanding of the topic. in the case of a paper, the auditor reviews the propositions, its reasoning and references to other publications. these checks must be performed by a human expert and are time-consuming. since papers are optimised for human reception and hopefully have a restricted volume of pages, sqa is feasible in the majority of cases. by contrast, data can be huge and stored in formats that are not optimised for human reception. therefore, in the majority of cases, sqa on data cannot be done exhaustively and relies on the help of computers. it is still not clear in all cases how to systematically perform sqa on data. moreover, the procedure is heavily dependent on the kind of data and the domain. besides the primary data itself, its metadata needs to be checked. all in all, the reviewer must have the appropriate competence and software tools to perform scientific quality control. as a consequence of these difficulties, the quality level of data is not "peer-reviewed" but "approved-by-author". fqa and tqa in contrast to sqa, formal quality assurance (fqa) deals with aspects like word count, typesetting, and structure. in general, the fqa reviewer does not need to have a deeper understanding of the contents, and fqa is usually done in a fraction of the time needed for sqa. in the context of data files, a comparable approach to fqa is technical quality assurance (tqa). in the course of this procedure, reviewers check whether the data are complete and do not show any syntactical abnormalities (e.g. validation of an xml file against an xml schema). the comparison of checksums asserts an error-free transfer from source to target location. browsing and searching since research papers contain directly indexable words and are often accompanied by explicit metadata, the classification and indexing process is relatively straightforward. in practice, modern repositories support all of the formats for text files mentioned before. if research data are saved in file formats that are not recognised by the search engine or the data themselves are too large, the indexing must be restricted to the given metadata. therefore the metadata of data files are crucial for browsing and searching. storage site even a very comprehensive text publication can be saved as a single file and be stored in a repository. thus, the repository can serve as a long-term archive and can directly answer requests for access to a paper. when dealing with data files, we are easily confronted with volumes of gigabytes and terabytes. files of this size pose problems with respect to storage space, bandwidth, backup mechanisms, and costs. usually, text repositories do not meet the requirements for managing huge data files. instead they allow for linking to external resources — being accompanied by problems like dead links and synchronisation. data management and curation become an issue here [10, 7].   3 the data publication process 3.1 knowledge acquisition the acquisition of knowledge is characteristic of all human beings and especially of researchers. according to nonaka and takeuchi [8] knowledge exists in an implicit and an explicit form and transits between these forms by socalisation, internalisation, externalisation, and combination. the two relevant transitions in the context of publications are internalisation and externalisation. internalisation describes the transition from explicit to implicit knowledge, for example by reading documents or manuals. externalisation is the process of expressing implicit knowledge as explicit concepts. if we apply these two transition types to a researcher, we can regard internalisation as reading publications and watching presentations, thus receiving information of all kinds and learning from it. externalisation is the dissemination of personal knowledge like giving presentations at conferences and publishing papers. from a community point of view, we can observe that there are interactions between the internalisation state of one researcher and the externalisation states of others. this happens when the information demand of one researcher meets the information offerings of another. figure 1: the knowledge spiral for a researcher. a common metaphor for the acquisition of knowledge is a spiral which evolves continuously [8]. in figure 1 we show such a knowledge spiral in a research context. we suppose that in each winding, the researcher first is in an internalisation state, in which he has to learn, e.g. by reading books or by listening to talks (corresponding to the externalisation output of other researchers). after that he moves into the externalisation state, in which he writes documents or gives presentations. the arrows on the left and right of the spiral represent the scientific input and output respectively. 3.2 the data publication cycle if we focus on one single winding of the spiral introduced in section 3.1, we get a simplified publication process (see figure 2). since this article is about the acquisition of high quality research data, we have put this process in the context of data publication. in the following, we will briefly describe the six phases of this data publication cycle. we start with a researcher looking for data, thus with the phase 'search and browse' in the lower left section of the cycle. figure 2: the data publication cycle.   search and browse new research usually starts with searching and browsing for relevant data. the starting points and tools depend on the research domain and on the individual. the most obvious criteria for using a specific data repository are the quality and quantity of the items in its catalogues. these factors depend directly on the quantity and the quality of other researchers' input. whether a researcher commits or does not commit his work to a specific repository depends on the attractiveness of the repository, which is influenced by obvious qualities like functionality, design, performance, stability, usability etc., but also by factors like usage parameters, acceptance, and reputation in the community. inspect and learn when data from the previous activity are available, the researcher inspects, visualises, and processes the data. this activity comprises all efforts of the researcher to acquaint himself with the data and learn from them. this is an essential procedure in the internalisation stage since we have a straight transition from explicit to implicit knowledge at this point. in the context of text publications, this activity would be the reading part, but since data usually cannot be read in a traditional sense, we consider the term 'inspect' more suitable. experiment during this activity, the researcher applies the acquired knowledge to new problems. this can be done by conducting novel experiments or processing the data from a different point of view. as a result the researcher may gain new data that can be contributed to the community. interpret and annotate data this activity deals with the preparation of a data publication, and thus is the first activity of the externalisation stage. as data files are plain, the researcher needs to interpret and annotate the data to attract other researchers. this can be done by writing a paper which is based on the data and which highlights the interesting parts. for later use and registration in repositories the data file must be augmented with metadata. perform quality assurance quality assurance exists in a vast variety of procedures. as mentioned in section 2 we can differentiate between scientific and formal/technical quality assurance. the simplest form is quality assurance by the author himself, but it is well-known that it is hard to find all of one's own mistakes. a more advanced form is peer-reviewing, in which one or more researchers review the paper and the data and can reject the submission if it does not meet the formal or scientific quality standards. this method is used by a lot of repositories and scientific journals. with respect to data files which are to be published it is important to extend the quality assurance on data and metadata. metadata are of equal importance because they are crucial for registering the data in search engines and repository catalogues. quality assurance should be done thoroughly, particularly with regard to the immutability after publication (see section 4). publish and expose once data files, metadata, and additional documents have passed quality assurance, the files are published and registered. the term 'publish' refers to storing the data files in a publicly shared, long-term available space and assigning persistent identifiers such as doi or urn. under 'expose' we subsume the registration at search engines and repository catalogues. while it is technically possible to change data after publication it should be common practice to lock the data (immutability) and publish corrected data as new versions. after this activity, a publication cycle of another researcher can be initiated and we start over with the activity 'search and browse'.   4 key factors for high quality how can a data repository acquire high quality research data? the answer is simple: the repository has to be attractive for researchers, both in the externalisation and the internalisation phase. only if a repository attracts enough high quality submissions is a thorough selection possible, and many researchers will use the data sets in a repository if quality and quantity are right. the attractiveness of a repository can be seen as a high-level key factor. attractiveness itself depends on the following three key factors: reputation. if a repository has the reputation of accepting only high quality research data and the visibility of its publications is high, a publication there is of great value for the researcher. reliability. if the organisation which is running the repository is known to be well-financed and has a long tradition, researchers will believe that their data are stored safely and for the long term. process. if the submission and quality assurance process is well documented and user-friendly it will engender trust and lower the workload for potential submitters. according to the data publication cycle of section 3.2, the acquisition of high quality data mainly happens in the externalisation stage. we will focus on the associated activities 'interpret and annotate data', 'perform quality assurance', and 'publish and expose', and present factors on optimising these activities in order to acquire high quality data. 4.1 interpret and annotate data the first factor to be taken into consideration is whether a researcher is motivated to publish his data. data publication still is in a very early stage, but we are convinced that in the future, researchers of data-centric disciplines will not only be evaluated by their text publications but also by data publications (and the corresponding citations). from conversations with other researchers we have learned that they often do not know where to store their data permanently and that data publication appears difficult and costly to them. we found a very promising approach in the e-research community of australia. the national initiative called australian research collaboration service (arcs) aims among other things at providing australian researchers with tools and services around data storage, transport, access and sharing. these data services contain a service called data fabric. this service provides researchers with free data storage, which can also be shared. the idea is to put data on data fabric as early as possible to optimise collaboration, reduce data redundancies, and avoid later transfer steps. other interesting features of the data fabric are operating system integration and professional data backup in other sites. for solar, geophysical and related environmental data there exists the world data center system (wdc). researchers of these disciplines can approach the data centers, which are specialised and distributed all over the world, and submit their data for long-term archiving. besides simplified data storage, the annotation of primary data with metadata should also be easy. to avoid errors later, the metadata should be captured as early as possible, e.g. by entering the metadata in the field [5]. this way the researcher can enter metadata right at the moment that the primary data are gathered. the costs of metadata creation can greatly increase depending on its point of time: a project in the netherlands estimated that it costs approximately 333 euros for capturing the metadata of a batch of 1,000 records at creation, whereas the costs may rise to 10,000 euros if appropriate metadata is not created until 10 years later [2]. 4.2 perform quality assurance we distinguished between scientific and technical quality assurance. technical quality assurance procedures on data are driven by technical specifications such as accepted file formats or specified xml schemas. this makes tqa a systematic procedure which can be partially automated. we will therefore focus on scientific quality assurance. as we learned from section 2, sqa on data needs some kind of computer support in order to deal with huge amounts of data. from our experience, quality assurance of the primary data is mostly done by the author himself. there are efforts to support the author in this process by providing software tools which inspect the data, visualise the data and hint at abnormalities. examples of this approach are the activities in the research project publication of environmental data funded by the german research foundation (dfg). in the course of this project, a standalone software package for reviewing meteorological data was developed. this software detects outliers and other deviations based on user parameters. after the analysis, the software writes out an xml report which can be annotated by the author to justify the findings. to make the results reproducible, the system stores the used parameters and version information in the report file as well. then this report together with the data can be submitted to the long-term archive to document the quality measures. checking the metadata is another essential part of the sqa process and needs similar attention. this issue is also dealt with in the research project mentioned above. a web-based software reads existing metadata from the relevant long-term archive and presents the metadata as a series of thematically differentiated forms which the user can traverse, similarly to a software installation wizard. the individual input forms are as user-friendly as possible, utilising search functions for selections from comprehensive lists, drop-down menus for selections from controlled vocabularies, and map views for previewing geographical references, all with help texts available at all times. 4.3 publish and expose we have already defined the publishing process as storing and identifying the data, and the exposing process as registering at repositories and making data findable to search engines. in subsection 4.1 we presented some approaches for simplifying storing in a long-term archive. identification with persistent identifiers can be made easy as well: another australian national initiative called australian national data service (ands) [4, 6]) aims at helping researchers to publish, discover, access and use research data. some interesting services in this context are the publish my data service, which helps an individual researcher to publish a collection of research materials with basic metadata, the register my data service, which assists researchers and research organisations in publicising their research data collections as a whole, and finally the identify my data service, which provides persistent identifiers for researchers' data. science happens on an international level. this is why several national institutions joined forces in january 2010 and founded an international initiative called datacite, [3]). datacite is a not-for-profit agency which enables organisations to register research datasets and assign persistent identifiers. some of the benefits will be reduced infrastructure costs, better integration of national infrastructures, and advanced search capabilities for improving researchers' awareness of available datasets.   5 conclusion we have placed the subject of data publication in the context of the knowledge spiral in research. we have identified the key factors for data repositories for achieving the necessary quantity and quality of data submissions. we have given best-practice examples of measures taken to support the key factors in the different activities of the data publication process. we hope that placing this topic in the context of the knowledge spiral sheds some light on the priorities to be taken in optimising the knowledge spirals of individuals and the worldwide research community.   acknowledgements we are indebted to anthony beitz from monash e-research centre for his advice and help with his vast network of contacts. we thank paul bonnington and the people of his monash e-research centre for their kind hospitality in november 2009.   notes 1. florian quadt is funded by the dfg (german research foundation) under grant he-1966.   references [1] steve androulakis, ashley m. buckle, ian atkinson, david groenewegen, nick nichols, andrew treloar, and anthony beitz. archer — e-research tools for research data management. 2009. url http://eprints.jcu.edu.au/9194/1/9194_androulakis_et_al_2009.pdf. (access restsricted.) [2] neil beagrie, julia chruszcz, and brian lavoie. keeping research data safe (phase 1) — a study funded by jisc. technical report, charles beagrie, april 2008. url http://www.jisc.ac.uk/media/documents/publications/keepingresearchdatasafe0408.pdf. [3] jan brase. datacite — a global registration agency for research data (working paper 149/2010). german council for social and economic data (ratswd), july 2010. url http://www.ratswd.de/download/ratswd_wp_2010/ratswd_wp_149.pdf. [4] adrian burton and andrew treloar. publish my data: a composition of services from ands and arcs. in 2009 fifth ieee international conference on e-science, pages 164-170, oxford, united kingdom, 2009. doi:10.1109/e-science.2009.31. [5] andreas hense, florian quadt, and matthias römer. towards a mobile workbench for researchers. in proceedings of the 2009 fifth ieee international conference on e-science, pages 126-131. ieee computer society, 2009. isbn 978-0-7695-3877-8. doi:10.1109/e-science.2009.26. url http://portal.acm.org/citation.cfm?id=1724812. [6] stefanie kethers, xiaobin shen, andrew e. treloar, and ross g. wilkinson. discovering australia's research data. in proceedings of jcdl 2010, june 2010. url http://andrew.treloar.net/research/publications/jcdl2010/jcdl158-kethers.pdf. [7] jens klump. anforderungen von e-science und grid-technologie an die archivierung wissenschaftlicher daten, 2008. url http://edoc.hu-berlin.de/ docviews/abstract.php?id=29641. [8] ikujiro nonaka and hirotaka takeuchi. the knowledge-creating company: how japanese companies create the dynamics of innovation. oxford university press, usa, may 1995. [9] b. plale, d. gannon, j. alameda, b. wilhelmson, s. hampton, a. rossi, and k. droegemeier. active management of scientific data. internet computing, ieee, 9(1):27-34, 2005. issn 1089-7801. doi:10.1109/mic.2005.4. [10] andrew treloar, david groenewegen, and cathrine harboe-ree. the data curation continuum. d-lib magazine, 13(9/10), 2007. issn 1082-9873. doi:10.1045/september2007-treloar.   about the authors professor andreas hense received his ph.d. (dr. rer. nat.) in computer science at universitaet des saarlandes in 1994. he has more than ten years experience as a project manager and consultant in banking and the public sector. he has been professor of business information systems at bonn-rhine-sieg university oas since 2004, where he teaches in the area of development and management of it-systems. his research interests comprise workflow management systems, document management systems, and e-research. his interest in e-research was sparked in 2006 by the escidoc-project. since then he has studied the combination of mobile clients, workflow management systems and digital object repositories. since 2009 he has worked in the domain of data publication. in november 2009 he visited australia on an e-research-study-tour.   florian quadt studied computer science at the bonn-rhein-sieg university of applied sciences and received a master's degree in 2006. during his studies he focused on information retrieval and implemented a cross-lingual search engine in his master thesis. he worked for an it consultancy for two years, specializing in code analysis, architecture evaluation and verification. in april 2009 he returned to bonn-rhein-sieg university as project manager of a research project to develop a publication system for meteorological experiment data.   copyright © 2011 andreas hense and florian quadt beneath the metadata: some philosophical problems with folksonomy search   |   back issues   |   author index   |   title index   |   contents d-lib magazine november 2006 volume 12 number 11 issn 1082-9873 beneath the metadata some philosophical problems with folksonomy   elaine peterson associate professor / information resources specialist montana state university background people have been trying to classify and organize information for thousands of years. there are many examples of cataloged items in ancient repositories, including items in the library of alexandria in egypt. taxonomy arose as an attempt to organize information about plants and animals in the physical world, and aristotle is often considered the father of classification or taxonomy. in his categories, he names substances (nouns) and determines the nine distinctive things that can be said about a particular thing [1]. how we ultimately name something reflects the category to which we assign it. through the development of categories, one is trying to answer the question, "what is it?" taxonomic methodology has also become important in mathematical set theory through discussions of set, class, aggregate, and collection [2]. neo-aristotelian realists are as interested today in taxonomy as they are in ontology. accurate classification is important in most, if not all, disciplines. in today's networked world of digital information, classification has become very important. one gathers, collects, and shares resources, making the organization of databases and websites crucial. items that are different or strange can become a barrier to networking [3]. therefore, with the advent of the internet, structure and consistency of classification or indexing schemes has taken on a new relevancy. traditional classification although they do not construct ontologies, catalogers and indexers are the inheritors of the aristotelian tradition of categorizing things. catalogers work with information (books or journal articles) instead of biological entities like animals, but there are similarities in the processes of classification they use. when a cataloger applies a subject term to a book or a keyword descriptor to a journal article, he or she attempts to apply specific, relevant terms to the work. the cataloger is naming the work and distinguishing it from other works, yet is also grouping the work with similar entities. keeping in mind contraries, particulars, and categories, a cataloger applies basic aristotelian principles. a book on horses would receive the specific subject heading horses even though many of the horses described in the book may be different from each other, e.g., arabian horses vs. thoroughbred horses, or white horses vs. black horses. moreover, if one assigned the subject heading "white horse" to a photograph, it would be incorrect to also assign the subject heading "black horse". aristotelian contraries do matter in traditional classification systems. perhaps the most important philosophical underpinning of traditional classification is the phrase, "a is not b". even if a cataloger did not hold an underlying metaphysical stance that there is a particular way things are, the necessity of classifying and grouping physical objects has placed catalogers into that framework. book a might be related to book b, but a choice has to be made between them when classifying them. classical cataloging is restrictive rather than expansive. it is irrelevant that digital items can reside in more than one place, since one is talking about a classification scheme, not about the items themselves. another foundation of classical cataloging is the priority of the author's intent. "the cataloger must envisage the needs of the reader, endeavoring in every way to make it a simple process for him to find books. he should, like the librarian, adopt a neutral stand between the reader and his books, giving emphasis to what the author intended to describe rather than to his own views" [4]. this quote is timely since it recognizes the needs of the information seeker, but settles on a preference for the author's intent. the acceptance and prioritization of the author's intent as the way the item should be understood and therefore classified have traditionally been part of the practice of cataloging. recognizing an author's intent can sometimes be difficult; nevertheless, the goal is to recognize the author's intent over others' interpretations. folksonomy in the digital information world, folksonomy has emerged as an alternative to traditional classification. an article in wikipedia states: "a folksonomy is an internet-based information retrieval methodology consisting of collaboratively generated, open-ended labels that categorize content such as web pages, online photographs, and web links" [5]. the labels are called "tags", and they can make a search engine more effective in finding content because the vocabulary used for tags is user-generated rather than author-generated." it seems appropriate to define "folksonomy" using wikipedia, since wikipedia itself is a good example of a social network of individuals contributing to a work. wikipedia allows any person on the internet to contribute articles to it without judgment from others. (hammond and others have preferred to call folksonomies "social bookmarking", thereby emphasizing the social networking often achieved by a collaborative effort of assigning subjects and tagging an online piece [6].) what can folksonomies or collaborative tagging achieve? "in contrast to professionally developed controlled vocabularies (also called taxonomies), folksonomies are unsystematic and, from an information scientist's point of view, unsophisticated; however, for internet users, they dramatically lower content categorization costs because there is no complicated, hierarchically organized nomenclature to learn. one simply creates and applies tags on the fly" [7]. gray and others recognize the overall importance of folksonomies, but also note some problems with them, including typographical errors or spelling variations [8]. the overall usefulness of folksonomies is not called into question; just how they can be refined without losing the openness that makes them so popular. weinberger and others claim that folksonomies provide more benefits than simply cost benefits. they view the social aspect of tagging to be the most important feature of folksonomies to users. weinberger points out that "readers, not just authors, get to tag objects. an author is an authority when it comes to what she intended her work to be about, but not about when [sic] it means to others. when it comes to searching, what a work means to the searcher is far more important than the author's intention" [9]. weinberger also mentions as benefits financial savings and elimination of bureaucracies of catalogers and indexers, but the salient feature is the relevance of the subject terms to the searcher rather than to the author. hence, since there are multitudes of internet searchers, a multitude of tags is possible and indeed welcome. this candidness reveals more about the philosophy behind folksonomies. philosophical relativism appears to be the underlying philosophy behind folksonomies. because of those underpinnings, it is possible to jettison the limitations of a traditional classification statement such as "a is not b". in a folksonomy system, "a is relative to b", because each item's index terms will depend on the individual user and the tags he or she decides to use. a philosophy of relativism allows folksonomy to draw on many users with various perceptions to classify a document instead of relying on one individual cataloger to set the index terms for that item. thus, classification terms become relative to each user. certainly all individuals' perceptions are influenced by their own experiences and cultures, whereas the professional cataloger, even if trying to be unbiased, has only one viewpoint. yet to include all viewpoints opens up a classification scheme to the inconsistency that allows a work to be both about a and not about a. there is no question that an individual might have a personal, valid interpretation of a text. that is not the issue. the issue is that adding enough of those individual interpretations through tags can lead to inconsistencies within the classification scheme itself. take, for example, a dissertation entitled chemical effects of biofilm colonization on steel parts of automobiles in united states cities. based on the author's intention, the keywords assigned to the dissertation by a traditional cataloger might be: biofilms; corrosion. however, the same dissertation could be interpreted by an internet reader as a work on the destruction caused by rust in the rustbelt who might then use the tag: destruction; detroit for the dissertation. the folksonomy tag headings, destruction; detroit, would persist on the web as access points to the dissertation. a search of the web reveals sites, including many university websites, where folksonomy tags are now being employed, including an increase in the use of folksonomies for classifying electronic theses and dissertations (etds). although folksonomy tags began with bookmarking personal web sites, or grouping digital images in flickr, some are now using such tagging to index academic journal articles and university dissertations. although folksonomy practitioners point out some problems with the practice, they typically center on language problems. gray mentions linguistic issues. weinberger admits that differing terms might be applied when employing folksonomies, but that those will be minimal and eventually sorted out. for example, he refers to one user assigning a heading of "san francisco", while another uses "frisco". some advocates of folksonomies have recognized that a democratic approach to web cataloging also contributes to the abundance of irrelevant or inaccurate information, usually referred to as "meta noise". meta noise can be inadvertent (spelling white horse as whit horse), inaccurate (tagging white horse when the image is of a white cat), or irrelevant (using an esoteric tag known to very few). overall, many will view folksonomic classification of the web, as weinberger does, as "messy and inelegant and inefficient, but it will be good enough" [10]. if weinberger means that it might be good for allowing individual users to supply their own tags, he might be correct. however, if he means that it will be good for the average user, his claim is questionable, since folksonomies will not produce an efficient index. some of the problems with folksonomies can be traced to problems inherent with relativism. the first is that folksonomy tags are not merely "messy", they can be inaccurate. because they assume a non-aristotelian stance, the tags allow contraries to exist. if i tag an article with the subject "white horse" and you tag it "black horse", that is all right since both can coexist in a folksonomy classification scheme. the problem with relativism is the question: "relative to what?" each internet user is bringing to bear on the item a different linguistic and cultural background. although this is an inherent strength of folksonomies (since it recognizes many valuable individual perspectives), it can also lead to the existence of contraries. a folksonomy advocate might reply that this is not true since the tags are relative to each user. yet, within the database itself, tagging allows an inconsistency to exist. this situation is, perhaps, the strongest criticism one could make of folksonomies. a dissertation displayed on the internet could be assigned subject headings deemed true to some groups of readers, but those same headings could deemed false by other readers. therefore, a folksonomy universe allows both true and false statements to coexist. because tags are relativized, personal, idiosyncratic views can coexist and thrive in the form of tags, in spite of their inconsistencies. readers of texts on the internet become individual interpreters, despite the document author's intent. related to this is the problem of hermeneutics when multiple interpretations abound. as eco once observed, "while it is a principle of hermeneutics that there are no facts, only interpretations", this does not prevent us from asking if there might not be "bad" interpretations. because to say that there are no facts, only interpretations, certainly means that what appears to us as fact is the effect of interpretations but not that every possible interpretation produces something that, in light of subsequent interpretations, we are obliged to consider as fact [11]. as with the example given above in this article about the dissertation about rust, personal interpretations and judgments might be wrong if one is considering the author's intent. yet, a stated premise of advocates of folksonomy is that the searcher's interpretation of a document is actually more important than the intent of the document's author. even should all interpretations be of equal worth, if users can continuously add tags to articles, at some point it is likely that the whole system will become unusable. a folksonomic system threatens to undermine its own usefulness. a final criticism one could make of folksonomies as classification systems is that their advocates seem to assume everything on the internet needs to be organized and classified. anyone who has a home library knows that this is not necessarily true. everyday, individuals make critical assessments of information bits they encounter. their first decision is whether or not to retain the information, and if so, how to organize it. folksonomy advocates seem not to recognize that critical, first decision about retention. the free labor available to create folksonomies is appealing only to those who have already agreed that the entire internet needs some organization and cataloging. however, rather than being retained and organized, many internet items could be eliminated, ignored, or allowed to die off. most people put into the wastebasket (physically or online) flyers, ads and newsletters, and would not bother to organize ephemera. concluding comments the choice to use folksonomy for organizing information on the internet is not a simple, straightforward decision, but one with important underlying philosophical issues. although folksonomy advocates are beginning to correct some linguistic and cultural variations when applying tags, inconsistencies within the folksonomic classification scheme will always persist. there are no right or wrong classification terms in a folksonomic world, and the system can break down when applied to databases of journal articles or dissertations. folksonomists are confusing cataloging structure with personal opinions and subsequent social bookmarking. these are not the same thing, and they need to be separated. a traditional classification scheme based on aristotelian categories yields search results that are more exact. traditional cataloging can be more time consuming, and is by definition more limiting, but it does result in consistency within its scheme. folksonomy allows for disparate opinions and the display of multicultural views; however, in the networked world of information retrieval, a display of all views can also lead to a breakdown of the system. one is reminded of the borges story about the chinese emperor who wanted an accurate map of china [12]. the resulting map was very accurate, but it was exactly the size of china. with its inclusiveness, it was of no help, and it finally disintegrated. most information seekers want the most relevant hits when keying in a search query. folksonomy is a scheme based on philosophical relativism, and therefore it will always include the failings of relativism. a traditional classification scheme will consistently provide better results to information seekers. references [1] jonathan barnes, ed. the complete works of aristotle. princeton: princeton university press, 1984, p. 4. [2] john r. gregg. the language of taxonomy, an application of symbolic logic to the study of classification schemes. new york: columbia university press, 1954, p. viii. [3] lorcan dempsey. the library and the network in the changing research and learning environment. montana state university academic libraries symposium, sept. 29, 2006. [4] margaret mann. introduction to cataloging and classification of books. chicago: american library association, 1930, p. 3. [5] "folksonomy." in wikipedia. retrieved november 10, 2006 from . [6] tony hammonds, et al. "social bookmarking tools (i), a general review." d-lib magazine 11(4). november 10, 2006 from . [7] "folksonomy." in wikipedia. [8] marieke guy and emma tonkin. "folksonomies: tidying up tags?" d-lib magazine 12(1). retrieved november 10, 2006 from . [9] david weinberger. "tagging and why it matters." retrieved november 10, 2006 from . [10] ibid. [11] umberto eco. kant and the platypus, essays on language and cognition. new york: harcourt brace, 1997, p. 48. [12] jorge luis borges. everything and nothing. new york: new directions, 1988. copyright © 2006 elaine peterson top | contents search | author index | title index | back issues previous article | in brief home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/november2006-peterson   report on the 16th international conference on electronic publishing: social shaping of digital publishing search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine november/december 2012 volume 18, number 11/12 table of contents   report on the 16th international conference on electronic publishing: social shaping of digital publishing tomasz neugebauer concordia university, montreal, canada tomasz.neugebauer@concordia.ca doi:10.1045/november2012-neugebauer   printer-friendly version   abstract elpub2012, "social shaping of digital publishing: exploring the interplay between culture and technology", the 16th annual conference on electronic publishing, took place 14 15 june 2012 at the university of minho in guimarães, portugal. this report summarizes some of the arguments and results presented, and offers some review and reflection on the contents.   introduction the elpub conference has featured research results in various aspects of electronic publishing for the last 16 years, involving a diverse international community of researchers in computer and information sciences, librarians, developers, publishers, entrepreneurs and managers. an analysis of elpub paper keywords carried out in 2006 showed that the most frequent subjects were "users", "web", "metadata" and "xml". over the course of the last decade, open access, intellectual property rights, and institutional repositories are also favorite themes among elpub authors. elpub2012, "social shaping of digital publishing: exploring the interplay between culture and technology" was the 16th annual conference on electronic publishing. it took place 14 15 june 2012 at the university of minho in guimarães, portugal. the conference programme included a panel discussion, two keynotes and parallel sessions: future solutions & innovations digital texts & reading digital scholarship & publishing repositories & libraries special archives this conference report includes a review of the keynotes and the panel discussion as well as a review of a selection of two presentations from each of the parallel sessions. the purpose of this report is to summarize some of the arguments and results presented as well as offer some review and reflection on the contents.   kathleen fitzpatrick's keynote the conference opened with a keynote by the director of scholarly communication of the modern language association, kathleen fitzpatrick. the keynote was titled the same as her latest book, planned obsolescence: publishing, technology, and the future of the academy1. the budget slashing experienced by libraries and university presses during the dot-com bubble burst of 2001 was particularly devastating for university presses whose budgets decreased dramatically, while libraries managed their cuts through consortia agreements and improved interlibrary loans services. the unfortunate consequence is that marketability of content began to take precedence over high quality scholarly merit. kathleen fitzpatrick suggests that scholarly publishing has become obsolete and is in need of significant structural changes beyond a conversion from print to digital. while academic tenure and promotion ought to depend on peer review and scholarly merit, not on media format, online scholarship continues to pose evaluation difficulties. the in media res project and the media commons are examples of encouraging initiatives towards open peer review. although there are experiments in open peer review, such as the 2010 issue of shakespeare quarterly, the pace of change in academia remains 'glacial'. if reviewers can sometimes miss the point of a work, it may mean that the structures of peer review are broken and in need of critical examination. in the world of print media, there was a true scarcity of space that required strict gate keeping. with the internet and digital media, that scarcity is now recreated artificially only. instead of gatekeepers, kathleen fitzpatrick suggests the need for progress in coping with abundance through filters, that include post publication review and the public community. a necessary component of the change is in the perception of online publishing by the academy: publishing online has to be acknowledged as both legitimate and sufficient. online publishing and the use of post-publication review is more collaborative in nature than the single authorship of a traditional monograph. however, perhaps it is precisely the notions of creativity and originality that will need to change if the academy is to avoid institutional obsolescence. the challenge to develop more effective information filters as a way of dealing with abundance, rather than focusing on gate keeping, will hopefully inspire progress in information retrieval technologies, but the specifics of these new filtering methods remain a challenge. electronic publishing and the web has indeed grown at an astonishing pace due in part to the fact that the costs of sharing content online are calculated according to different criteria than in the world of print. however, the social systems that produce scholarly publications are currently dependent on the gate keeping function of pre-publication peer-review.   special archives jelle gerbrandy described the design, and lessons learned from, the development of the biography portal of the netherlands, a project in collaboration with els kloek from the institute for netherlands history. the lessons learned call for addressing copyright issues early on in a project; in the case of the biography portal, unresolved license issues with the data providers pose a fundamental challenge. the biodes xml format for representing biographical information that was developed for this project would likely be more widely adopted by other organizations if it didn't deviate from the text encoding initiative (tei) standard for purposes of simplicity. maria josé vicentini jorente (universidade estadual paulista júlio de mesquita filho) presented an analysis of the national archives experience digital vaults as an example of a novel paradigm of information design of archival collections. the interaction design of national archives experience digital vaults is based on the linking of documents indexed independently from physical spaces, institutions, chronologies and archival fonds. maria josé vicentini jorente alluded to a tension between the user needs of professional visitors and those of the general public. the presentation of archives as fonds serves the former more, whereas new technologies in information retrieval have helped to facilitate a new "post-custodial paradigm in which any individual is able to access, research in and rebuild virtual collections, creating unique paths to approach historical contents". the digital vaults interface is indeed innovative in its design, but confirming the effectiveness of this design seems to call for further empirical study.   digital scholarship & publishing pierre mounier, from the centre for open electronic publishing at l'école des hautes études en sciences sociales (ehess) presented openedition freemium as a new commercial model devised for libraries interested in open access humanities and social sciences content. the term "freemium" was popularized by the journalist chris anderson in his book free: the future of a radical price2. in this context, it means that the social sciences and humanities books and journal articles are available to all in html format, while subscribing libraries get access to premium services such as pdf and epub download, usage statistics, export in marc format, alerts, assistance, training, and more. openedition is a relatively new and promising platform with participation from 20 publishers and 32 subscribing libraries. the freemium economic model seems like an ideal solution for social sciences and humanities publishing. it results in basic open access to full-text in html formatted content. furthermore, subscribing libraries retain their role as mediators, purchasers and promoters of content while publishers retain the revenue stream from subscriptions that they depend on. caren milloy (jisc) spoke about oapen-uk project which aims to gather information on the potential for open access scholarly monograph publishing in the humanities and social sciences. in collaboration with five publishers, oapen-uk has set up an interesting comparative study on the impact of open access on a monograph's sustainability and profitability by measuring and comparing the usage, sales, citation and discoverability data of 60 monographs. the publishers proposed 30 pairs of monographs that are as similar as possible. this allowed the oapen project to create two groups of monographs to compare: an experimental group that will be made openly available on the oapen library under a creative commons licence while the control group will be available only as ebooks for sale under the publishers normal licensing. in addition to this experiment, the evidence gathering takes the form of focus groups comprised of institutional representatives, publishers, authors/readers, funders, learned societies, ebook aggregators, research managers and administrators. the results of the initial focus groups are already available on the oapen site, and the results of the comparative study will become a valuable source of evidence for the impact of open access on the sustainability of monograph publishing in the humanities and social sciences.   repositories & libraries the swedish study, "accessibility and self archiving of conference articles: a study on a selection of swedish institutional repositories", presented by peter linde (blekinge institute of technology) et al., on the accessibility of conference articles, confirmed the importance of open access subject and institutional repositories in providing access to and preservation of conference papers. a significant number of the articles in the study were found in some type of oa archive, confirming that repositories are currently used for this purpose. furthermore, a striking 15% of the conference papers were not available at all in any format or platform and thus represent potential candidates for inclusion in institutional repositories. one of the interesting recommendations by the authors of this study is the development of a copyright policy database for conferences, similar to sherpa/romeo for journal publishers. lydia chalabi (university of algeirs) points to the lack of research studies on the use of open archives and the open access movement's impact over the scientific production of developing countries, in the report of her study on open archives in developing countries "open access in developing countries: african open archives". using open access directories as a data source, lydia chalabi filters down to the 27 open archives used for scholarly communication in african developing countries. an analysis of these reveals that the open archives in african developing countries are limited in various ways. for example, more than half of the archives include content that requires a local login, only three offer usage statistics, and the existing content consists mostly of theses. the open access movement intends to improve access to scholarly communication by removing economic barriers faced by scholars. researchers in african developing countries can benefit from improved access to the outputs of research from other parts of the world, but it seems equally important to have sufficient support for contributing to the open access content through deposit/publication in open access archives.   panel discussion on academic e-books the topic of the panel discussion, moderated by peter linde, was "academic e-books —technological hostage or cultural redeemer?". kathleen fitzpatrick (modern languages association and pomona college), antónio câmara (universidade nova de lisboa and ydreams inc.), delfim ferreira leão (university of coimbra) and karin byström (uppsala university) were to discuss the positive and negative aspects of e-books. although e-books represent a sustainable opportunity for academic publishing, their readability and access/impact will need to continue to improve. academic libraries face difficulties in selecting and acquiring e-books: many monographs still don't have an e-book version, many have embargoes/delays, and e-book publishers often have business models that can prevent libraries from purchasing their e-books for ownership and allow only limited licensed access. kathleen fitzpatrick pointed out that e-books are currently in an early stage of what they will become through the addition of video and interactive components. antónio câmara inspired a lively discussion on the future of teaching by arguing that open access video courses represent a fundamental and transformative change for the future of teaching and the university. he argued that professors will become tutors that offer additional perspective and motivation to students that will increasingly choose to learn from the world's top professors through video lectures. antónio câmara predicts that books will continue to play an important role in teaching, while libraries will be involved in the development of sophisticated visualization tools.   antónio câmara's "publishing in 2021" keynote antónio câmara continued the "video will be king" theme in his keynote speech. he paints a picture of the future where technology allows publishers to "print" interactive digital displays on "anything", such as product packages. he described the electrochromic display technology developed by the ydreams spinoff company ynvisible. currently, printing electrochromic displays is expensive, but as the costs go down, we can expect to see the digital become an even more integral part of physical product experiences. printing technology today still uses dots, but with over 700 billion dollars in printed goods produced in 2011, the race to produce inexpensive electrochromic printing is ongoing. futurology is a speculative activity and only time will tell if the digital world will make its way onto the physical printing presses of the future. contemplating the feasibility of accessing digital information sources through a printed interactive display on a physical object such as a magazine, a coffee cup or a postcard does help to generate an image of one possible future of "publishing in 2021".   future solutions & innovations carlos henrique marcondes (university federal fluminense) presented a paper titled "knowledge network of scientific claims derived from a semantic publication system". the textual format currently used for scholarly publishing is a metaphor of the 20th century print text model and restricts computer programs from precise and meaningful semantic analysis of content. carlos henrique marcondes presents a prototype of an enhanced bibliographic record and author deposit interface that allows for the encoding of the conclusions of articles with the use of linked data principles and the national library of medicine's unified medical language system (umls). authors are asked to enter a conclusion and natural language processing libraries are used to represent this knowledge as antecedent-consequent relations between phenomena using structured umls. the accuracy and reliability of the semantic formalization of article conclusions remains dependent on authors' familiarity with umls since they are asked to validate the extracted relations and mapping to umls terms during the deposit process. a vision of the opportunities in the "future of digital magazine publishing" was presented by dora santos silva. she cites encouraging statistics published by the association of magazine media (mpa), that 50% of subscribers renew their magazine subscription. furthermore, 51% of young consumers are reading magazines electronically while 41% of those who have downloaded apps have paid for magazine content. almost every print magazine has an online presence, although most of these are merely digital pdf replicas of the print. dora santos silva outlines key features that define a magazine: it has a beginning, middle and end; it is edited and curated; it has an aesthetic treatment; it is date-stamped and periodic; its contents are permanent, suffering only minimal corrections. she outlines the potential of digital magazines using the following examples: ifly magazine, zoo zoom magazine, viv magazine, and all out cricket magazine. three magazines were profiled and critiqued in detail: flypmedia, magnética magazine and the new yorker — ipad edition. although many of the usability problems outlined by jakob nielsen can be found in the ipad editions of digital magazines, and only pdf replicas of print magazines exist for many magazines, dora santos silva's paper presents advantages of the digital formats over traditional ones that represent an opportunity for publishers.   digital texts & reading celeste martin (emily carr university of art + design) spoke from the point of view of a designer in the presentation co-authored with jonathan aitken titled "evolving definitions and authorship in ebook design". the multimedia and social interaction potential inherent in ebooks challenges traditional notions of authorship resting with the creator of the text. the impact of the design of the user experience and user participation in ebooks elevates the role of the designer. realizing the full potential for new "enhanced" ebooks that are designed for tactile use on a tablet requires a collaborative effort between writers and designers. celeste martin describes the results of such collaboration in her classroom, where five authors agreed to work with groups of design students in repurposing their books into digital format. the results included traditional book elements such as pages and linear navigation systems, but they also included features that offer an intentionally different experience from the original text. the results also included reader participation through annotation upload, content sharing with other users, game-like interaction, and creative "vertical" and random navigation and exploration. it seems likely that implementing some of these e-book designs would bring to light usability challenges with readers who experience difficulties in learning how to use them. the aesthetics of ebooks emerge from an interdependence of form and content. chrysoula gatsou (hellenic open university) reported the results of a usability study co-authored with anastasios politis (technological educational institute of athens) and dimitrios zevgolis (hellenic open university) on the use of visual metaphors from "text vs visual metaphor in mobile interfaces for novice user interaction". the study observed younger and older novice (i.e., inexperienced with computers) users as they interacted with an application interface with two types of interaction icons: visual metaphor and text. the intention of a designer to create clear layout and comprehensive visual metaphors is insufficient to guarantee that the user will perceive and appreciate it as intended. chrysoula gatsou advises that choosing visual metaphors for interface icons requires careful consideration of their comprehensibility to users. this study reports that the metaphor of a "home" as the navigation button to return to the main menu caused problems for older users. the older users in the study performed better when interacting with text buttons whereas younger users performed better in their interaction with icons. ideally, the interface designer can find universally comprehensible visual metaphors for icons, but the challenge is very difficult due to differences in users' age, experience, and culture. the authors of the study do not mention this, but it seems like the design strategy of icons that use a visual metaphor and the text alongside may be an effective compromise. the authors of this study state that "cultural differences may also be defined by age", but the results show only that age correlates with some interface design preferences, not that age is a dimension of culture.   conclusion the conference on electronic publishing has provided a valuable venue for the exchange of ideas between librarians, computer scientists, publishers and others since it was first organized in 1997. elpub's comprehensive programme includes an astonishing variety of research perspectives on electronic publishing. this year, the theme "social shaping of digital publishing: exploring the interplay between culture and technology" inspired fascinating discussions on the future of digital publishing and contributions of research results from a variety of perspectives including design, librarianship, archives, publishing and computer science. this year's presentation by carlos henrique marcondes on formalizing the conclusions of papers so that they can be published along with the article metadata seems to be indicative of the focus for elpub2013. next year, at elpub2013, the 17th international conference on electronic publishing, with the main theme "mining the digital information networks", we can expect a greater focus on text/data mining, machine processing and knowledge discovery. the ethics of text/data mining can be a particularly relevant aspect that will hopefully be addressed by some of the submissions. elpub2013 is scheduled to take place june 13 14, 2013 at blekinge institute of technology in karlskrona, sweden. in addition to the traditional themes of publishing and access, the main theme of extracting and processing data from digital publications as well as the use of this information in social contexts will be featured. all of the papers from this conference are available in elpub digital library and in proceedings published by ios press. in the spirit of antónio câmara's argument for the primacy of video, recordings for most of the presentations are also available at educast@fccn (https://educast.fccn.pt/vod/channels/r0i3amwrr and https://educast.fccn.pt/vod/channels/ks05u0khu).   notes 1 fitzpatrick, kathleen. planned obsolescence: publishing, technology, and the future of the academy. nyu press, 2009, new york university. 2 anderson, chris. free: the future of a radical price. chris anderson, 2009, isbn 978-1-4013-2290-8.   acknowledgements thanks to peter linde for his helpful comments and suggestions.   about the author tomasz neugebauer is the digital projects & systems development librarian at concordia university libraries and editor of photographymedia.com. he holds a bachelor's degree in philosophy and computer science and a masters in library and information studies from mcgill university.   copyright © 2012 tomasz neugebauer an overview of the 17th international conference on theory and practice of digital libraries (tpdl 2013) search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine november/december 2013 volume 19, number 11/12 table of contents   an overview of the 17th international conference on theory and practice of digital libraries (tpdl 2013) vittore casarosa institute for information science and technologies (isti), italian national research council (cnr), pisa, italy casarosa@isti.cnr.it ana pervan intern at the european organization for nuclear research (cern), meyrin, switzerland ana.pervan@cern.ch doi:10.1045/november2013-casarosa   printer-friendly version   abstract the 17th international conference on theory and practice of digital libraries (tpdl) took place in valletta, malta, during september 22-26, 2013. a diverse community of participants and their different research approaches gave an international and interdisciplinary feel to this year's conference. the general conference theme was "sharing meaningful information". approximately 300 delegates from more than 40 countries presented and discussed challenges and opportunities of digital library architecture, interoperability and information integration, digital library interfaces, user behavior, data re-use and open access, linked data, data visualization, long-term preservation, semantic web in digital libraries and digital curation.   introduction the 17th international conference on theory and practice of digital libraries (tpdl) took place in valletta, malta, during september 22-26, 2013. the general chairs were milena dobreva from university of malta (malta) and giannis tsakonas from university of patras (greece). they, along with the program chairs trond aalberg from norwegian university of technology and science (norway) and christos papatheodorou from the ionian university (greece) organized a very successful conference. sponsors and co-organizers of the conference were the university of malta, cost (european cooperation in science and technology) and the unesco national commission in malta, which for a few days transformed valletta from a tourist capital into a digital library-oriented capital. malta's government was also one of the active supporters of the conference. the minister for education and employment of malta, evarist bartolo, opened the conference by highlighting the importance of technology in a changing library world and confirming the attendees' interest in the development of the digital library disciplines. by receiving the "green light" from a representative of malta's government, the tpdl 2013 conference was officially opened.   conference highlights a diverse community of participants and their different research approaches gave an international and interdisciplinary touch to this year's conference. academics, practitioners, developers, students and users gathered in order to share new ideas and discuss the current "hot topics" in the field of library and information science. the general conference theme "sharing meaningful information" was divided into four broad areas, namely: digital library infrastructure, foundation, content and services. about 300 delegates from over 40 countries presented and discussed challenges and opportunities in digital library architecture, interoperability and information integration, digital library interfaces, user behavior, data re-use and open access, linked data, data visualization, long-term preservation, semantic web in digital libraries and digital curation. two keynote speakers captured the main tpdl objective, which was developing an interdisciplinary approach to digital libraries. christine l. borgman opened the conference by giving a speech on "digital scholarship and digital libraries: past, present and future". in her talk, borgman stressed the importance of writing and publishing scientific research papers while also keeping in mind data re-use. the second keynote speaker, sören auer, closed this year's tpdl conference by delivering a speech on "what can linked data do for digital libraries?". in his talk, digital libraries were presented as knowledge hubs whose main purpose is to create knowledge through the sharing of content by the means of linked data. the presentation started with a very concise introduction to linked data, showing how linked data can provide a semantic web on top of the existing "hyperlink web", allowing a more meaningful navigation and discovery of interesting information. two panel sessions covered some of the issues about the present state and the future of digital libraries. the first one, entitled "cost actions and digital libraries: between sustaining best practices and unleashing further potential" was focused on showing how multi-national cooperation could bring benefits to on-going digital library research. cost is a program of the european union aimed at strengthening europe's scientific and technical research capacity by supporting cooperation and interaction between european researchers. the second panel "e-infrastructures for digital libraries...the future" was focused on the ways in which new research methods, based on intensive computing and "big data", enable new means and forms for scientific collaboration. research and collaboration will be supported by e-infrastructures, allowing researchers to access remote facilities and manage and exchange large amounts of digital data. in addition to the technical sessions, with presentations and discussions of scientific papers, there was also what is becoming the "usual" minute madness session, in which participants were allocated one minute each to present the 21 posters and the 8 demos that could be seen during the conference. posters and demos covered a wide range of topics, from data curation and preservation to advanced search and retrieval, from recommender systems to semantic web and linked data. the two main topics in the closing session were the announcement of the winners of the "tpdl best paper award" and the venue and dates of next year's conference. there were three categories for the best paper award. the winners were: best paper award: "an unsupervised machine learning approach to body text and table of contents extraction from digital scientific articles" by stefan klamp and roman kern best student paper award: "who and what links to the internet archive" by yasmin alnoamany, ahmed alsum, michele c. weigle, and michael l. nelson best poster/demo: "country-level effects on query reformulation patterns at the uk national archives" by steph jesper, paul clough and mark hall. for the next tpdl conference, the important news is that in 2014, the jcdl conference normally held in the us and the tpdl conference normally held in europe will join forces and organize one single event, which will be held on 8-12 september in london. so save the date and plan to attend what is expected to be an extraordinary event, where innovative ideas, interdisciplinary approaches and novel results will be presented and discussed.   satellite events as in previous tpdl gatherings, a number of events related to the themes of the conference were held immediately before or after the conference. a day long doctoral consortium event gathered doctoral candidates and gave them a chance to present their research, discuss, share ideas and get guidelines for improving their current work. six contributions were presented and discussed, showing the diversity and breadth of the library and information science field. development of a methodology for automatically positioning electronic publications into "universal decimal classification" system; integration of life science resources; modeling of archival user needs; digital libraries exploration using automatic text summarization; folksonomy-based resource recommendations for databases and digital libraries; a knowledge organization approach to scientific trends exploration. the day before the conference was dedicated to tutorials, and this year there were six half-day tutorials, some of them offering hands-on experience. the topics offered were: linked data for digital libraries; from preserving data to preserving research: curation of process and context; state-of-the-art tools for text digitisation; mapping cross-domain metadata to the europeana data model (edm); resourcesync: the niso/oai resource synchronization framework; the role of xslt in digital libraries, editions, and cultural exhibits. as it can be seen, long term preservation and digitisation and management of data in the humanities continue to be two very relevant topics. continuing a tradition, at the end of the conference a number of workshops, bringing together academics and practitioners to discuss challenges, issues and opportunities on topics of actual interest were held. the workshops were: practical experiences with cidoc crm and its extensions; moving beyond technology: ischools and education in data curation. is data curator a new role?; the 2nd international workshop on supporting users exploration of digital libraries; the 3rd international workshop on semantic digital archives; networked knowledge organisation systems and services — 12th european networked knowledge organisation systems workshop; linking and contextualizing publications and datasets: paving the way towards modern scholarly communication. the novelty of this year's tpdl conference was the cooperation with the ischool community through a global workshop of ischools (g-wis). the main aim of g-wis was to support the collaboration between different ischools and foster the development of this movement on a global scale. students of dill (digital library learning) an international master started in the framework of the erasmus mundus program of the european union and presently carried on by three partner universities, were invited to actively participate and present their work during the ischool session. the covered topics were: tools for preserving digitized special collections, digital curator competences, data visualization and fostering learning in digital environment. the cooperation between tpdl 2013 and ischools was perceived as a good opportunity for students and young researchers to network, gaining new ideas and getting a broader view of the academics' and practitioners' research approaches. finally, the social events included an opening reception, which was held along with the poster and demo session. thanks to the very good weather the reception was held in the open air, on the massive ramparts surrounding the venue of the conference. of course, there was also a social dinner, held in a characteristic restaurant in the old town of mdina, malta's old capital. a number of tours to valetta (malta's capital) and to gozo, the nearby island, were also offered to the attendees' guests. proceedings of this year's tpdl conference were published by springer, in the lncs series, entitled: "research and advanced technology for digital libraries", and can be accessed here.   about the authors vittore casarosa graduated in electrical engineering at the university of pisa. after a few years spent as a researcher at cnr (the italian national research council), he spent many years in the r&d laboratories of ibm in italy, france and in the u.s., conducting and managing research mostly in image processing and networking. since 1996, he is senior research associate of the italian national research council at isti, where he is associated with the activities of the multimedia laboratory in the field of digital libraries; from 2000 to 2007 he was deputy director of delos, the network of excellence on digital libraries. from 2008 to 2012 he collaborated with hatii at the university of glasgow for training activities on long term preservation of digital objects. since 2007 he has taught courses on digital libraries at the open university of bolzano, at the university of parma and at the university of pisa.   ana pervan is enrolled in an international master's degree program digital library learning, which is offered in cooperation between oslo and akershus university college of applied sciences (norway), tallinn university (estonia), and the university of parma (italy). she holds a master's degree in knowledge management field of information science from faculty of humanities in osijek, croatia. her two main research interests include data curation and creation, representation and re-use of scientific data. currently she works as an intern in gs-sis department in cern.   copyright © 2013 vittore casarosa and ana pervan the value of flexibility on long-term value of grant funded projects search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine september/october 2015 volume 21, number 9/10 table of contents   the value of flexibility on long-term value of grant funded projects lesley parilla and julia blase smithonian institution {parillal, blasej}@si.edu doi: 10.1045/september2015-parilla   printer-friendly version   abstract the field book project is an initiative to increase accessibility to field book content that documents natural history — primary source documents that describe the events leading up to and including the collection of specimens or observations during field research. it is a partnership between smithsonian institution archives, national museum of natural history, and smithsonian libraries. the project began in 2010 with a grant from the council for library and information resources (clir) to identify, locate, and catalog field books across the smithsonian institution. since then, the project has cataloged more than 7,500 field books across 8 departments and divisions of the institution. field book catalog records were made available to the public for the first time in december 2012 on smithsonian's collection search center. the project is now digitizing the cataloged field books which are available on collections search center and biodiversity heritage library.   1 project description from its inception, the field book project has been a small entity in a big institution. permanent, full-time staff members are few and funding is derived solely from grant proposals, though the project goals are considerable and are supported by institutional technological resources and permanent staff who are able to act in an advisory capacity. instead of looking at staff and funding challenges as limits, the project has instead seen them as opportunities. full time project staff have developed and maintained a robust system of communications and workflows that enable the project to adapt and take advantage of current and future institutional efforts as they arise. it has developed to be flexible and able to take advantage of new opportunities as they arise. when the project started, staff anticipated what it might look like in five years and the project's current iteration varies in important ways from the original expectations. yet in most cases, by adapting to new opportunities and aligning itself with institutional goals, the project has met or exceeded original expectations in terms of the public access to field book content and how records are utilized. five years ago, at inception, the field book project set out to solve the problem of discovering, cataloging, and preserving an unknown (but known to be large) quantity of archival items across the institution1. some of the identified challenges were: finding a balance between cataloging quantity and cataloging quality. providing essential metadata while acknowledging and planning for the time needed to catalog at identified levels (e.g. up to one hour to catalog one item). cataloging with multiple access points such as taxonomic coverage, collection numbers, vessel names and expedition names. developing a consistent and logical method for describing content, such as geographic location, that is not fully governed by an authority source. the project uses multiple authorities for several descriptive elements. established authorities such as library of congress subject headings (lcsh), getty art & architecture thesaurus (aat), and getty thesaurus of geographic names (tgn), are not universal in their coverage for this type of content. it is described by different authors at varying levels of detail, and by different names, spellings, and languages over time. developing new workflows with an awareness of other institutional needs. creating workflows for the smithsonian project that would also be as useful as possible for partner organizations both large and small. five years later, with insights from consulting colleagues, the project has developed a flexible approach in data structure and has leveraged that flexibility to gain a leadership role in initiatives across the institution. it has allowed the project to be utilized in ways that were not part of the original goals but were made possible because: project data has been gathered and stored in a robust and flexible manner. the collections described are of a manageable size and sufficiently uniform format to use as a test bed for new initiatives and approaches, such as digitization and transcription. project staff have been able to develop a broader understanding of institution-wide goals and directions through their work across unit divisions. this understanding contributes to their capacity to proactively adjust project language, activities, and goals so that they are always in line with the most recent institutional goals and more easily supported and funded. by collecting robust and flexible data, staying in tune with long-term institutional goals, and volunteering project data as a test bed for new initiatives, the project has been able to act as a leader in developing workflows and standards for successful experiments and building knowledge and capacity for its own goals even when the experiments were not continued. to date, we have produced: flexible, efficient, replicable workflows relating to cataloging and digitization. documentation of business processes now utilized by other smithsonian units as well as non-smithsonian partner institutions. flexible, robust data that is contributed via a variety of online platforms. the first contributions to smithsonian's new transcription center and, through that and similar initiatives, expanded input into discussions of new approaches and tools for digitization, transcription, and online publication of archival materials. .   2 field book project workflows the workflows developed by the field book project were for cataloging, conservation, and digitization of archival materials at the item-level. cataloging: project staff were able to develop a cataloging workflow in which archival items are described on average in an hour or less, a fraction of the time required for cataloging archival items in traditional marc format or encoded archival description (ead) format, increasing potential throughput from two items to eight items per day. items are described in what might seem like an overly complicated blend of metadata object description schema (mods), encoded archival context (eac-cpf), and natural collections description (ncd) schemas. however, the custom schema addresses all internal institutional needs and, should marc or ead records be desired, all records can be and are regularly converted into the desired formats using metadata maps and standard workflows developed by project staff for ease of integration into the workflows of other units. furthermore, the cataloging workflows include a place for the input of conservation and digitization information into the database as well, proactively supporting those dependent workflows. cataloging workflows have been used to train partner institutions on efficient approaches to cataloging similar materials in their own archives. conservation: the field book project cataloging process flows naturally into the conservation process. a separate database module is triggered and populates when an item is marked for conservation during cataloging. conservation staff may then access this database and use it to better organize their own work for maximum efficiency. the seven thousand cataloged items provide a test bed for communication within the archives unit and have resulted in better communication between processing and conservation staff when, for instance, processing staff identify unexpected condition issues in the course of their regular work. digitization: the project was formulating its digitization workflows at the same time the smithsonian institution archives was designing its overall digitization approach. because the project relied on a close partnership with the archives for its technical digitization capacity, the units developed digitization workflows in a close, iterative, and highly cooperative process, the archives staff providing technical knowledge and the field book project providing metadata, throughput, and testing of data capacity. the field book project digitization workflow included a mandate to digitize complete folders and, when possible (exceptions were made for items in need of conservation) complete boxes from a series identified for digitization both from a broad pick list and from occasional reference requests. project workflows also called for steady digitization throughput, providing complete item-level metadata and adding page-level technical and administrative metadata during digitization. at the time, the institution archives was doing item or even page-level digitization for reference upon request, and item-level description was often based on information available from the requestor. archival description had been completed only at the series level as is common in the "more product, less process" approach2, but which had complicated the process of providing expedient and complex metadata for materials as they were digitized at a more granular level. field book project records provided for the first time a consistent folder level description that could be used as a model for the archives staff, enabling the archives to change their approach to digitization without extensive use of staff time and resources. the archives also followed the lead of the project in deciding that, when an item was requested for digitization, the entire folder or box would also be digitized and, when possible, made available online. project social media about field book content and online record availability translated to an increase in digitization requests for field notes. on top of the regular project digitization pick list, this increase meant a regular enough flow of digitization for the project and archives staff to test multiple workflows and baseline digitization activities, developing standard guidelines for how to digitize archival items. these items offer a wide range of challenges such as materials that are overand under-sized; materials in poor condition; materials with inserted photos, negatives, and specimens; materials that include previously unidentified contributing authors; and materials with potentially sensitive material (such as breeding ground or migration pattern data), all of which are addressed and expedited by the new workflows. the time required to digitize has markedly decreased even as requests have increased. the resulting guidelines and workflows are now being made available to other smithsonian departments and research institutions as they begin to take on the challenges of digitizing archival materials. figure 1: digitization workflow figure 1 shows the digitization workflow developed with smithsonian institution archives, with additional workflows for contribution to biodiversity heritage library (bhl) and the transcription center. smithsonian institution archives (sia) maintains images on the institution digital assets management system (si dams) which are then made available to the public through the smithsonian collections search center (csc). in order to contribute to additional systems like internet archive (ia) and bhl, an additional workflow was developed to send images and metadata to bhl's metadata collection and workflow system (macaw). the workflow has been of use to other bhl institution partners looking to contribute primary source materials.   3 flexible data records as mentioned briefly in the overview of the cataloging workflows, the field book project cataloged items in a custom schema that combined mods, eac-cpf, and ncd. the project also developed a custom installation of a filemaker pro database. normally, using a custom metadata schema and database is problematic. however, the custom schema and database have allowed the project to create item level records that are detailed yet flexible enough to be exported in a variety of formats for many archival and library systems. at the item level, the extensive descriptions in controlled subject fields, multiple access points, and metadata maps, which allow records to be exported in an anglo-american cataloguing rules (aacr2) marc-compliant format, mean that records can be published by the biodiversity heritage library (bhl). collection level records provide the basis of collection level description in archives online records for newly accessioned field book collections. item level abstracts are utilized by si archives and now provide a template for description of non-field-book materials that are digitized and require description. full item, collection, and eac-cpf records with abstract and controlled subject fields can be, and are, made available through a simple xml export to the smithsonian's collections search center (csc) and, through there, on digital public library of america (dpla) and europeana. field book project records can not only be easily exported for use by more traditional systems like the marcand xml-based bhl, csc, and dpla, but are also flexible enough to contribute to new systems, as the project is currently a pilot partner with social networks and archival context (snac), developing new exports of eac-cpf records for snac's shared search database.   4 early contribution to systems as discussed earlier, the field book project was an early contributor to the development of the smithsonian institution archives' digitization processes. the field book project has also been an early contributor to the smithsonian transcription center. the project's strong description, well developed workflows, large selection of catalog data, and expert content knowledge enabled the project to serve both efforts. when the transcription center first came into being, staff could easily sift through field book project content to select those materials most likely to engage volunteers and offer those materials quickly. digitized field books could be easily loaded to the transcription center and helped the staff learn what appeals to their audience of "volunpeers," tailor what they make available and when, and offer guidelines and expertise to other units based on what they learned in their experiments with the field book content. because of this early availability and willingness to serve as a testing base, even today field books make up a substantial portion of the materials fully transcribed despite the project's smaller size and limited funding furthermore, the full text of completed transcriptions is searchable through the smithsonian's online catalog. more than 100 field books are now text searchable. this is a large enough group that both project and archives staff have been able to develop search methods to find content, which has in turn provided transcription center staff with solid feedback about the utilization of crowd sourced transcription for accessibility. the result has been a positive feedback loop, where new users share their ideas and experiences with the collection based on the transcriptions, and the project staff can then use that input to manipulate their data in other ways and address concerns or create services to further increase discoverability and usability.   5 workflows offered to other institutions — unexpected results the project planned to develop all workflows with the idea that they could be used and re-used by other units and other institutions. originally, the project anticipated that smaller partner institutions might be more interested in our workflows relating to collection level description. this has not been the case. the greatest expressed interest from both small and large partner institutions has been for the item level workflows, especially as the project has demonstrated the capability to contribute to well-known and established systems like biodiversity heritage library. bhl's strong online presence and wide consortium of libraries and research institutions has demonstrated that even though the project uses an innovative, hybrid cataloging structure, project methods and data can easily contribute to current systems. this contribution to bhl has been especially important to the long term project goal of reconnecting all primary source field book documentation with the resulting publications. to date, the project has contributed more than 543 digitized field books, using commonly available software (excel) and file formats (csv, pdf, tiff), to bhl, where they can be searched for and discovered alongside their sibling publications.   6 social media — not to be underestimated many institutions have questioned the effect and value of social media; the field book project has benefited markedly from a consistent, strong output of online materials and interactions. the field book project maintains a website, blog, and twitter account. early in the project, catalogers began to write blog posts that included collection highlights and descriptions of unexpected finds in materials. they have also written blogs that have been important for explaining project methods, providing examples of the type of work the project does, and providing informal access points to field book collections for researchers. as staffing numbers have changed, the project has expanded and maintained a consistent social media presence which has meant significant numbers for blog visits (60 unique hits per day), our twitter followers (750), and flickr and website despite our diminutive size. field book content has been used for wikipediathons, flickr sets, blog posts, and transcription projects. field books have been a consistent source of new information about the history of women in science at si and citizen science contributions. project blog posts and other media are important tools to sharing stories and showing connections that often inspire new research interest and, on the transcription center, new "volunpeer" participation. the project has also begun to coordinate its social media output with that of other smithsonian units, increasing visibility while decreasing the time that full time project staff spend creating social media content.   7 still to accomplish while the field book project has come a long way, from hardly knowing what content might be held at the smithsonian to having: over 7,300 items cataloged, 550 items digitized, 105 items in the smithsonian transcription center, catalog records published in the collections search center, dpla, and bhl, and robust and flexible workflows for accomplishing all of those tasks with fewer than three full time staff. the project is still investigating how to strengthen the relationship between the field notes, the specimens they describe, and related published literature. contributing field books to bhl answers part of this challenge. social media may also play a role. transcription and other crowd sourcing appear to be an imperative to finding a connection between specimen and publication. there are simply no financial resources currently available for such a huge task. furthermore, while the project has methods to manage historic field books, it is also concerned about what the upcoming challenges might be for dealing with field books from current creators. the project has been opening dialogues with natural history departments and divisions not only about their closed collections of field notes, but also about how they have used and stored field documentation for the last five and ten years. this effort has already produced unique information on what the project is likely to encounter (such as digital "field notes" kept in word, google docs, hard drives, and other cloud sources) and also unique partnerships for describing and using historical field notes, as demonstrated by the february 2015 #fwtruelove transcription center challenge. finally, how does the project continue to accomplish its goals in a way that is consistent with current and future trends? the project has had to be flexible with field book data in online systems in the past. furthermore, even as project throughput and upload have increased, staff have been able to demonstrate enough value both to the project and to future users to ask those online systems to adapt to the project rather than the other way around. for instance, smithsonian's collections search center now utilizes image galleries in order to show high resolution images of field book pages. bhl originally required marc records for upload and could not display transcribed content or item level abstracts, but in seeing the value of project requests for other bhl users, it has adapted to project requirements and is now able to accept csv files and transform them into marc, it worked with project staff to find a place for transcriptions, and is currently working on a display for item level abstracts. the project staff cannot predict what might be the next system requirement, user service, or functional capacity it will be asked to serve, nor can it anticipate where the changes will originate and what the project resources will be at that time. the field book project has been agile enough to stay on top of institutional developments for the first five years of its existence. it only hopes to be able to continue to adapt and remain relevant in the shifting digital archival landscape.   notes 1 this article is an update to nakasone, sonoe, carolyn sheffield (2013). "descriptive metadata for field books: methods and practices of the field book project," d-lib magazine, vol. 19. http://doi.org/10.1045/november2013-nakasone, which describes the project cataloging structure in detail. 2 greene, mark a., dennis meissner (2005). "more product, less process: revamping traditional archival processing". american archivist 68: 208—263.   about the authors lesley parilla is the database manager and principal cataloger for the field book project. she coordinates project outreach efforts in conjunction with smithsonian institution archives and nmnh staff as well as manages project's social media content. she began as a contract cataloger with the project in 2011.   julia blase is the project manager for the smithsonian field books project. she manages day-to-day project operations and coordinates communications between project partners. julia blase comes to the biodiversity heritage library from the national digital stewardship residency, a fellowship program with the library of congress, where she spent the last year completing a digital asset management analysis, needs assessment, and strategic plan for the national security archive. prior to that she managed the denali '13 centennial exhibition project at the american alpine club library in golden, co.   copyright © 2015 lesley parilla and julia blase report on the workshop of learning object repositories as digital libraries: september 22, 2006, alicante, spain search   |   back issues   |   author index   |   title index   |   contents d-lib magazine october 2006 volume 12 number 10 issn 1082-9873 report on the workshop of learning object repositories as digital libraries september 22, 2006, alicante, spain   miguel r. artacho uned university, spain erik duval k.u. leuven, belgium the first ever learning object repositories as digital libraries workshop was held in alicante, spain, as part of the european conference on digital libraries (ecdl 2006). the workshop took place in the campus of universidad de alicante on friday, 22 september. miguel rodríguez-artacho, uned university, spain and erik duval, k.u.leuven, belgium served as the workshop chairs. current challenges for learning objects repositories as digital libraries repositories of learning objects are a key component for supporting the authoring of educational material. the learning object repositories (lor) workshop addressed the increasing importance of the digital library (dl) paradigm as a reference framework for building learning object repositories. current trends are focused on incorporating new interfaces and metadata based on learning semantics and social interactions. lor workshop papers covered interesting open issues on the above mentioned topics, from the application of semantic technologies to reusability in the context of dls. in accessing a learning object repository through a semantic layer, i. mayorga and co-authors addressed the need to create semantic frameworks on top of a lor to enhance metadata annotation and facilitate meaningful retrieval. the paper states that browsing by metadata fields does not fit learners' or authors' needs, and relies on a semantic layer for automatic metadata creation. in related work, the construction of a spanish educational repository: advances and future directions of the doteine project, from m. a. marzal, addresses the building of a semantic reference framework for a specific community (in this case spain). marzal's paper provides an example of how strongly metadata structures are related to social and educational context. kaufmann and figueira in disaggregation and reutilization of sharable content objects focuses on a specific reusability issue: automated disassembling and reassembling of scorm (sharable courseware object reference model) packages. their paper details the classification process and problems caused by the loss of metadata and learning context of the disaggregated sub-packages and shared learning objects. in their paper, bridging the semantic gap in standards-based learning object repositories, e. rodríguez et al. show how, in practice, current interfaces to access digital repositories of learning objects (los) may need a semantic layer. in this case, the ims digital repositories specification has been enhanced in order to deal with the reference in metadata to ontologies and the use of semantic query languages. finally, m harris and j. thom show in their paper, challenges facing the retrieval and reuse of learning objects, a broader view of reusing los that face social, technical and cultural challenges. a statistical study is presented with some interesting conclusions on the behaviour of academic staff when they are reusing educational material, concluding that metadata labelling can create a gap between creators and users of los, making the retrieval process tricky rather than facilitating search in lors. the acceptance rate for papers submitted to lodl 2006 was 35,7%. keynote talk: lors as dls or dls as lors? the lor workshop keynote speaker was edward fox, from virginia polytechnic and state university (virginia tech), who used the videoconference service provided by our partners from the open university uk (http://www.flashmeeting.com). his in-depth, rich talk, titled "a 5s perspective on learning objects repositories as digital libraries", focused on how dls can be the cornerstone for e-learning, facilitating collaboration, distribution and reuse. full workshop papers, presentations and recorded keynote talk can be accessed from workshop web site at . acknowledgment the lor workshop chairs would like to thank the kaleidoscope and prolearn networks of excellence for their support of this workshop, and rafael carrasco (u. alicente) and felisa verdejo (uned university) for their excellent work chairing the ecdl 2006 conference. copyright © 2006 miguel r. artacho and erik duval top | contents search | author index | title index | back issues previous article | next conference report home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/october2006-artacho   a policy checklist for enabling persistence of identifiers search   |   back issues   |   author index   |   title index   |   contents d-lib magazine january/february 2009 volume 15 number 1/2 issn 1082-9873 a policy checklist for enabling persistence of identifiers   nick nicholas link affiliates nigel ward link affiliates kerry blinco link affiliates abstract one of the main tasks of the persistent identifier linking infrastructure (pilin) project on persistent identifiers was to establish a policy framework for managing identifiers and identifier providers. a major finding from the project was that policy is far more important in guaranteeing persistence of identifiers than technology. key policy questions for guaranteeing identifier persistence include: what entities should be assigned persistent identifiers, how should those identifiers be exposed to services, and what guarantees does the provider make on how long various facets of the identifiers will persist. to make an informed decision about what to identify, information modelling of the domain is critical. identifier managers need to know what can be identified discretely (including not only concrete artefacts like files, but also abstractions such as works, versions, presentations, and aggregations); and for which of those objects it is a priority for users and managers to keep track. without working out what actually needs to be identified, the commitment to keep identifiers persistent becomes meaningless. to make sure persistent identifiers meet these requirements, the pilin project has formulated a six-point checklist for integrating identifiers into information management, which we present here. introduction organisations and individuals depend on resources made available online. increasingly, this includes the requirement that well-managed resources remain available and accessible over the long term. to guarantee long-term access, as well as availability, maintaining persistent identifiers for resources is critical and needs to be included in any planning with that goal in mind. the pilin project (persistent identifier linking infrastructure) was funded from 2006 through 2008 to strengthen australia's ability to use global persistent identifier infrastructure, particularly in the repository domain. there is a clear technological component to this work, and the project includes software and service specifications among its outputs. [1] however, it is well known that technology is not sufficient to guarantee persistence: if anything, depending on particular technologies ends up getting in the way of persistence. guaranteeing persistence of identifiers depends much more on cogent policies to make sure that someone keeps assets online and links up to date, over a well-defined period. it is just as important to be seen as having cogent policies, as to have them: the proper goal of persistence is to establish user trust in using the identifiers. with that in mind, a major focus of the pilin project was to establish a policy framework for managing identifiers and identifier providers. the policy framework includes guidelines and considerations in formulating and using persistent identifiers; instances of policy documents as exemplars; and documentation of community requirements for identifiers. to make such a framework independent of specific technologies (and able to survive changes in technology), it was also necessary to model identifiers and identifier services, independently of technology. the pilin modelling includes an ontology, a glossary, and a service usage model. the technology-independent modelling means that pilin is agnostic as to the choice of identifier scheme: as a result, pilin used a number of identifier technologies for implementation work, and the handle system® in particular to develop identifier services. pilin also produced documentation on how to use http uris persistently. [2] the pilin project was cross-community: it was intended to address the needs of the research, library, and learning domains. as the project gathered requirements from various stakeholders for how persistent identifiers should be managed and used, several questions kept coming up on how to create and manage identifiers for persistence. addressing these questions, we came to realise, leads to a policy checklist against which identifier solutions can be benchmarked. the recurring questions about persistent identifiers were: q1. what do you mean by "persistent"? q2. how long should identifiers persist? q3. what should i identify persistently? q4. what steps should i take to guarantee persistence of identifiers? q5. what technologies should i use to guarantee persistence of identifiers? of these questions, the first forced us to do some detailed modelling of identifiers and their qualities, so that persistence could be spelled out properly. the modelling exercise has been extensive, but we summarise our thinking about persistence in particular below. the last question does not have one right answer, and depends on how the other questions are answered and how the identifiers will interact with other computer systems. the middle three questions, however, go to the heart of planning for persistence, and to address them the project came up with a "six-step information and service modelling program", which we describe here. [3] the six-step program is defined at a high level, and needs to be filled out with domain-specific details. it can be thought of as a checklist for thinking to ensure that q2, q3 and q4 are addressed for the given enterprise. q3 depends on engagement with the resource provider and q4 on engagement with the resource manager; q2 engages with the resource user, clarifying what undertaking is being made to them. we start by addressing q3 and q4; answering q2 depends on our answer to q1, and we consider both together at the end. figure 1: recurring questions about persistent identifiers what should i identify persistently? the persistence of identifiers, which is what we discuss here, is a separate issue from the persistence of the resources they identify. in discussing persistent identifiers, we are assuming in what follows that the resources they identify are prioritised to remain available; the persistent identifiers ensure that they remain accessible as well. plans to ensure that resources remain available (such as data management plans discussed below) need to factor in persistent identifier considerations. persistent identifiers, however, do not "just happen". persistence involves a guarantee to the user that the identifiers will be kept up to date, and this requires an ongoing commitment of resources. for that guarantee to be meaningful, identifier managers cannot undertake to identify everything in their domain: they need to decide on the resources for which they will provide persistent identifiers. our recommendation is to prioritise resources that will be accessed, cited, described, managed, or otherwise engaged with by the user community, over the timespan of persistence. this priority cannot be set by the identifier manager in isolation: it needs the involvement of the user community (often represented by the resource provider), who will have a good idea about how the resource will be used or referred to. it also needs the involvement of the resource manager, who will keep the resource accessible and in good order; they will determine for how long the guarantee is realistic, and what the technical constraints are on maintaining the resource (at least while it is online). prioritising what to identify, in turn, critically depends on information modelling of the domain. persistent identifiers can be associated not only with concrete objects such as digital files, but also more abstract concepts, that can be used for discovery and management of resources. such abstract concepts include families of resources (such as the works, expressions and manifestations modelled under frbr); [4] aggregations; disaggregations; and transformations of resources. modelling is needed to define and manage those abstractions. resource providers and resource managers need to be informed of what abstractions are possible, and how they should be translated to concrete representations. so the question "what should i identify persistently?", maps to two strategies, which depend on engaging the resource provider: 1. work out what things in the domain can be identified by an identifier (what abstractly and concretely exists in the domain), and how they are related.   2. work out what things in the domain will be identified by a persistent identifier (what matters in the domain). these strategies correspond to the two first steps of our policy checklist: i. have an information model for your domain.   ii. incorporate persistent identifiers into the information model. an explicit information model allows the distinction to be made between concrete and abstract entities, as well as representing the relations between them – e.g., version, copy, transformation, choice of granularity. what steps should i take to guarantee persistence? the community of users expects that persistent identifiers will provide persistent discovery and access to resources. but beyond that, there is also an information management concern: the resources being identified should be managed in a way that will reduce the impact on identifier persistence. for example, moving a resource's digital location, in order to guarantee its persistence, should result in a corresponding update of the identifier, to keep the identifier persistent. but it would be better if the digital location of the resource were moved less frequently – and if not every update of the resource had to force an update to the identifier. a clear way to prevent disruption is to ensure that identifier management is integrated into information management. if information management uses one ("local") identifier to interact with the resource, but identifier management uses a different identifier for persistence, then information management and identifier management are decoupled. this dns-like indirection strategy is commonplace: urls or record keys are typically used to manipulate resources directly, as local identifiers. these are strictly speaking access keys (locators) for obtaining resources on local systems. they can be used as identifiers, because they still associate a name (the access key) with a resource; but they are only as persistent as the storage arrangements they reflect. on the other hand, persistent identifiers (including handles, purls, and "cool" uris) [5] are used to refer to resources more abstractly, and provide indirection to the local identifiers. this decoupling forces a maintenance dependency between identifier management and information management: if the local identifier changes, the indirection must be updated for the persistent identifier. the ability to perform such updates makes persistent identification possible, and independent of technologies. failing to actually perform the updates compromises persistent identification, and defeats the purpose of indirection. as already noted, guaranteeing updates is to a large extent a policy issue. but the workflows for managing resources, as determined by those policies, should maintain a loose coupling between local and persistent identifiers: any workflow leading to a change of local identifier must force an immediate update to the indirection of the persistent identifier. where possible, that update should be automated, rather than manual. moreover, identifier workflows need to be timed so that persistent identifiers are created when it is least disruptive. if a resolvable persistent identifier is branded onto an object (for example, if a web page has a uri identifying it embedded in the web page), then the identifier has to be created before the object is – but the identifier can only resolve to the object once the object is created, complete with the branded identifier! the typical workaround in this case is to create the identifier beforehand with a dummy resolution or no resolution, and then to update the identifier once the object is updated and stored digitally – but before either the identifier or the object are published. (by "published", we mean that the resource is made available externally, beyond a "curation boundary" – i.e., outside of the group of people working on creating and preparing that resource.) [6] the processes for managing resources should themselves be persistent where possible, and not be bound to particular technologies or local identifiers. the more information management relies on persistent identifiers, the less risk there is that information management will be affected by changes in the local identifier. there is also less risk of the local identifier being used in preference to the persistent identifier – a practice that can ultimately leak outside the "curation boundary" and undermine the persistent identifier framework. to give an example: if version management uses local identifiers instead of persistent identifiers, the local identifier may be circulated to external users, to give them access to a specific version. if the local identifier is changed (perhaps because a new version with a different identifier has been created), the user now has a broken link. if the persistent identifier is used internally for version management to begin with (possibly in conjunction with a version parameter), there is much less risk of two competing identifiers being released externally. so the question "what steps should i take to guarantee persistence?" maps to three strategies, which depend on engaging the resource manager: 3. ensure that identifiers persist, despite changes to how the resources they identify are managed.   4. ensure there are mechanisms to keep the identifiers up to date.   5. exploit identifiers to manage and access the resources identified robustly, despite changes to how they are managed. these strategies correspond to the next three steps in our policy checklist: iii. decouple identifier management from information management.   iv. define the workflows for managing identifiers, integrating them into information management. (this includes automated processes for updating identifiers when resources change, as well as deciding when to assign identifiers in the data life cycle.)   v. build information management services that leverage identifiers (both for internal information management and external access). avoiding a tight coupling between resources and identifiers allows the flexibility of indirection, and makes persistence possible despite changes in local identifiers. well-integrated identifier management restores the looser coupling between resources and identifiers, so that the indirection does not end up broken when the local identifier does change. and leveraging identifiers in information management minimises the dependence on non-persistent identifiers. persistent identifiers become an added information layer, through which resources are accessed and cited with reliability and accountability. what technologies should i use to guarantee persistence? relying on a particular technology to deliver persistence in identifiers misses the point of persistence, in a sense: the identifiers should persist through changes in identifier technology, just like any other technological change. and as has often been argued, it is good policy – rather than particular technologies – that establishes persistence. the requirement placed on identifier technologies is merely that they can help realise the strategies just mentioned. for this to happen, it is critical that the identifier technology used is interoperable with the services already used in the enterprise. if that integration does not happen, the persistent identifiers will be kept in a silo separate from the day-to-day management of resources, meaning that they cannot be leveraged for improved information management. without integration, we are back where we started: the identifiers will not be kept in sync with how the objects are managed, so they will fail to persist. while persistent identifiers need to become an added information layer, that layer is useless unless it interacts effectively with existing information management systems and processes. this leads to a new strategy: 6. ensure that the persistent identifiers are integrated into existing services. however, this strategy is already addressed in our policy checklist item iv from above: iv. define the workflows for managing identifiers, integrating them into information management. the workflows do not just integrate identifiers in future information management; they need to integrate them into current processes. that means the existing services used to access resources – that will quite likely remain in use for some time. this also means that, if the persistent identifiers are to support a separate information layer, there must be enough identifier services in place to deal with all information management requirements. key information management services that should use persistent identifiers include retrieval: accessing the resource to which the identifier refers. resolution: working out what the identifier refers to (not necessarily the same thing as retrieving it). resource creation, update and deletion. http-rest is a good example of a uri-driven approach to providing such services: resource creation, update, and deletion, as well as reading/accessing resources, are all mediated through http operations on uris for the resources. [7] how long should identifiers persist? once persistent identifiers are released to end users, there is a clear undertaking that the identifier will in fact persist for some time. the question we were asked most often by our stakeholders was, "how long should that undertaking be for?" while we initially answered the question glibly ("say 25 years"), it became apparent that any enterprise would have its own preferred timespan, which would depend on both business-specific and external factors. so coming up with one preferred timespan, like "25 years", would be meaningless. instead, the concern over persistence is that the identifier is guaranteed to perform as expected, over however long a period makes sense for the enterprise. we define persistence accordingly as the guarantee that the identifier will be maintained over a defined and discoverable timespan – be it 25 years or 25 days. the end user can find out how long the guarantee is expected to last, and can rely on the identifier to remain operational over that timespan; depending on who they are and what they are doing, the timespan does not have to be of archival length, but it must be uninterrupted. because persistence involves an undertaking from the provider to the end user, the timespan of that undertaking should be made explicit. consumers of persistent identifiers need to plan for a time when the identifier will no longer perform as expected. identifier providers may well intend to persist the identifier indefinitely, but when others depend on the identifier persistence, hard limits are preferable to good intentions. to the end user, persistence is an expectation of trustworthiness; identifier providers must establish – and circumscribe – that trust. this area has been substantially explored in the work of john kunze, and particularly the ?? operator in the ark persistent identifier scheme [8]. what do you mean by "persistent"? if persistence involves trust, it is a trust that something will happen or that something will not happen. just declaring that an identifier will persist does not help us work out what exactly is expected to happen. several things can go wrong with an identifier over time, and these things need to be considered separately. to better understand persistence, we model it as follows: an identifier, in its purest sense, is the association of a name with a thing. an identifier does not have to be online, hyperlinked, or a url to count as an identifier. persistence is always the persistence of some property of the identifier. the properties of an identifier which can be persistent include: the name of the identifier. if the string in a url does not change over time, then its name clearly persists, even if it fails to resolve. the association of the identifier with the thing identified. if the identifier name is recorded as being associated with some thing, that association persists, even if the identifier no longer can be used to retrieve an online copy of the thing. example 1: .cs is known historically to have been the top level domain for czechoslovakia, even though .cs has now been deleted from the iana registry. that is, the association between .cs and the top level domain for czechoslovakia persists. example 2: a 1999 essay by leo felkin at http://www.progress.org/leon01.htm cites a flyer which "used to be online at http://www.libertymatters.org/flyers.htm". the url http://www.libertymatters.org/flyers.htm given still identifies the 1998 liberty matters flyer, because felkin's essay records that association. the association is recorded, even though the liberty matters flyer is no longer online. the applicability of a service on an identifier. the uri http://www.libertymatters.org/flyers.htm is no longer resolvable, so the retrieval service for the identifier has not persisted, even though its association has. the applicability of services can be broken down by service instance. for example, the default retrieval service associated with the uri http://www.libertymatters.org/flyers.htm (an http get on the uri itself) has failed. but (unlike in 1999) the wayback machine service now allows the flyer to be accessed through its own retrieval service, acting (indirectly) on the uri: http://web.archive.org/web/19981206012626/http://www.libertymatters.org/flyers.htm. so the url as an identifier is still retrievable – but not from its default service instance, as a locator. the applicability of services can also be broken down by service type. an identifier may be resolvable – leading to information that uniquely describes what is being referred to – without also allowing retrieval. for example, the paper identified by http://arxiv.org/abs/gr-qc/0609101 has been withdrawn from arxiv as plagiarised; but the title, authors, abstract, and publication details are still available. what we get when we click on the hyperlink is no longer retrieval, but it is resolution. the accountability of an identifier. archivists often want to know who is responsible for updating an identifier and who has created it. the common understanding of persistence is persistence of retrievability. this has resulted from the longstanding problem of uris breaking (no longer allowing retrieval), when the resource pointed to is moved or no longer maintained. but identifiers can be maintained after the resources they pointed to are no longer online; in fact this is essential for archival purposes. in that case, we would expect resolution of the identifier to a metadata record, preserving the association of the identifier with the resource (the arxiv withdrawn paper described above is an illustration of this possibility.) in terms of the long-term lifecycle of identifiers, persistence of resolution is more critical than persistence of retrievability; maintaining such metadata makes for more robust information management overall. so any guarantees of persistence need to be tied down, not only as to the period of the guarantee, but also what aspect of the identifier is being guaranteed. in particular, the longstanding conflation between resolving an identifier and retrieving a resource needs to be separated if the guarantee is to have any meaning after the resource is no longer accessible. so the question "how long should identifiers persist?" and its associated question "what do you mean by 'persistent'?" map to the following strategies: 7. establish persistence of the various aspects of identifiers as well-defined guarantees from the provider to end users. concretely, this corresponds to the final step in our policy checklist: vi. decide how long to commit to identifying resources, and how long to provide access to identifier services (what persistence is expected by users, and to what the identifier provider can commit). communicate these persistence decisions (make your persistence policies available). the emergence recently of data management plans, especially in research, provides a formal, contractual framework for the engagement between data managers and data users. we anticipate that identifier persistence will be an integral part of that engagement [9]. summary & checklist we have developed a six-point policy checklist for integrating persistent identifiers into information management. it is summarised as follows: with resource provider: i. have an information model for your domain.   ii. incorporate persistent identifiers into the information model. with resource manager: iii. decouple identifier management from information management.   iv. define the workflows for managing identifiers, integrating them into information management. (this includes automated processes for updating identifiers when resources change, as well as deciding when to assign identifiers in the data life cycle.)   v. build information management services that leverage identifiers (both for internal information management and external access). with resource user: vi. decide how long to commit to identifying resources, and how long to provide access to identifier services (what persistence is expected by users, and what the identifier provider can commit to). communicate these persistence decisions (make your persistence policies available). these policies are based on strategies, which in turn are based on key questions about identifiers. a relationship between the policies, strategies and key questions is laid out below: figure 2. questions, strategies, and policy checklist for persistent identifiers. acknowledgements this article reports on work done under the pilin project and the pilin ands transition project. pilin was funded by the australian commonwealth department of education, science and training (dest) under the systemic infrastructure initiative (sii) as part of the commonwealth government's backing australia's ability – an innovation action plan for the future (baa). the pilin ands transition project was funded by the australian government as part of the national collaborative research infrastructure strategy (ncris), as part of the transition to the australian national data service (ands). the authors wish to acknowledge the support and feedback of the rest of the pilin team. notes & references [1] pilin project: . pilin software outputs: . [2] on the handle system®, refer to . the pilin documentation on using http uris persistently is pilin project 2007. using urls as persistent identifiers. . all pilin project outputs are available under . [3] the background thinking to the plan is given in more detail in pilin project 2007. persistence of identifiers guidelines . [4] ifla study group. 1998. functional requirements for bibliographic records, final report. , §3.2: entities. [5] purl (persistent uniform resource locator): . "cool uris": berners-lee, tim. 1998. cool uris don't change. . [6] on the "curation boundary", see treloar, andrew, groenewegen, david & harboe-lee, cathrine. 2007. the data curation continuum: managing data objects in institutional repositories. d-lib magazine 13: 9/10. . [7] for the original definition of rest, see chapter 5 of fielding, roy t. 2000. architectural styles and the design of network-based software architecture. ph.d. dissertation, university of california, irvine. . [8] kunze, john a. & rogers, richard p.c. 2008 [2001]. the ark identifier scheme. , and kunze, john a. 2003. towards electronic persistence using ark identifiers. proceedings of the 3rd ecdl workshop on web archives, august 2003.. [9] on data management plans, see e.g., chapter 5 of: fitzgerald, anne, pappalardo, kylie & austin, anthony. 2008. practical data management: a legal and policy guide. legal framework for e-research project & open access to knowledge law project, queensland university of technology. . (on 19 january 2009, the digital object identifier shown at the bottom of the article web page was corrected to read: doi:january2009-nicholas.) copyright © 2009 univeristy of southern queensland top | contents search | author index | title index | back issues editorial | next article home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/january2009-nicholas   managing digital collections survey results search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine march/april 2015 volume 21, number 3/4 table of contents   managing digital collections survey results liz bishoff the bishoff group liz.bishoff@gmail.com carissa smith duraspace csmith@duraspace.org doi: 10.1045/march2015-bishoff   printer-friendly version   abstract in spring 2014, duraspace commissioned the bishoff group to conduct a survey of the digital collection management practices performed by academic libraries that are not members of the association of research libraries (arl) in order to gain a better understanding of the status of their current digital content creation, management, and preservation activities. respondents included 145 institutions across all types of academic libraries comprising twoand four-year colleges, masters, and doctorate granting universities. the survey addressed institutional content, repositories, services, future needs and plans, staff support, and the libraries' overall commitment to digital preservation. this article reports on the results of the survey.   introduction beginning in december 2013, the not-for-profit organization duraspace began collaborating with the bishoff group in order to gain a better understanding of the status of digital content creation, management, and preservation activities underway in the non-association of research libraries (arl) academic library community. as part of the work, duraspace commissioned the bishoff group to conduct a managing digital collections survey of non-arl academic libraries. the survey was conducted march 25 april 18, 2014. the purpose of the survey was to collect information that: determined if non-arl academic libraries are involved in the management of institutional content (including faculty and student created content, and library created/acquired content) and if the repository is locally managed or hosted; identified responders' perceptions about the benefits of an institutional repository/digital repository (ir/dr); identified current services used by responders to support their ir/dr initiatives; identified future needs and plans for ir/dr services and software; defined information technology staffing used to support ir/dr; explored responders' commitment to digital preservation; identified the preservation service(s) respondents' use and/or are investigating and what prevents them from implementing a preservation program.   survey details the bishoff group developed and conducted the survey with input from duraspace staff. the target audience for this survey was non-arl academic libraries due to the fact that earlier studies, including the census of institutional repositories in the us: miracle project research findings [1], found that few non-arl institutions had implemented ir/drs. the survey was sent to the association of college and research libraries (acrl) listservs serving the community colleges, four-year colleges, and universities. in addition, library deans and directors from the 80 oberlin group libraries and the 22 university libraries group members received personalized emails. as a result, 49 of 80 oberlin schools responded, a 61% response rate, and 9 of 22 members of the university libraries group responded, a 41% response rate. survey responses from individuals affiliated with arl institutions were eliminated and are not included in this report. the survey asked each respondent to choose a carnegie classification for their institution. these classifications include associate colleges (community/junior colleges); baccalaureate colleges (four-year colleges); master's colleges and universities; and doctorate-granting universities. further analysis will use the different classifications. of the 145 organizations that responded to the survey, 61 (42%) were from baccalaureate colleges; 34 (23%) respondents each from master's colleges and universities and doctorate-granting universities; and 12 (8%) responses from associate colleges. deans and directors were the major respondents at 83 (57%), while 22 (15%) were assistant directors. further, 20 (14%) respondents indicated that they held other positions, including chief information officers or chief information officers/library deans.   results   digital content a significantly large number, 117 (81%) of the respondents reported that they are currently creating and/or acquiring digital content, such as the output of faculty and student research, institutional records, dissertations and theses, and digital library collections. of the 27 (19%) respondents who indicated that they were not currently creating or collecting digital content, 14 (52%) said that they planned to in the next one to three years, 6 (22%) indicated no, while 7 (26%) did not know. these 13 respondents who answered either no or did not know exited from the survey.   institutional repository/digital repository prevalence of the organizations that currently have digital content, 94 (72%) have an ir/dr that is either locally managed or hosted, 26 (20%) do not, while 11 (8%) did not know. across all types of organizations, there is a high level of implementation of ir/dr with 51 baccalaureate institutions reporting use of an ir/dr. further, 73 of 102 (72%) respondents indicated that their ir/dr is a hosted service, while 29 (28%) identified that their organization is leveraging a locally managed service. does your organization have an ir/dr, either a locally managed repository or a repository managed by a third party that hosts your collections? response percent response count yes 72% 94 no 20% 26 don't know 8% 11 total 100% 131 (14 skipped)     institutional repository/digital repository benefits respondents were asked to rate the importance of a list of benefits of an ir/dr on a scale of 1 to 5, where 1 was not important and 5 was extremely important. the benefits are based on the list used in the 2007 census of institutional repositories of the united states. the following seventeen choices were included in the survey: capture the intellectual capital of my institution faculty and student research; expose my institution's intellectual output to researchers around the world who would not otherwise have access; increase the library's role as a viable partner in the research enterprise; provide long term preservation of my institution's digital output; enable offering better service to my institution's learning community; increase the access to knowledge assets including audio, video, data sets, etc.; boost my institution's prestige; maintain control over my institution's intellectual property; enable the offering of new services to learning communities beyond my institution; reduce the amount of time between discovery and dissemination of research findings; increase in citation counts to my institution's intellectual output; reduce user dependence on my library's print collections; provide a means of promoting undergraduate research activity; maintain strong connections with alumni; provide a platform for my library's e-publishing program; supports my institution's open access mandate; and supports my institution's data management policy and plan. figure 1 below shows the four most frequently selected benefits. figure 1: most frequently selected benefits ("very important" appear in red and "extremely important" appear in blue.) benefits selected as "somewhat important" and "important" are shown in figure 2 below. of the benefits listed as not very important, respondents selected reducing user dependence on library's print collections most often (42%), followed by supporting institution's open access mandate (32%) and providing a platform for library's e-publishing program (24%). figure 2: other benefits chosen ("somewhat important" appear in red and "important" appear in blue.)   digital preservation policies respondents indicated that there is an overall commitment to digital content preservation at their organizations with 66 positive responses (66%). however, 19 (19%) individuals indicated that there is not a commitment to digital preservation at their organization and 15 (15%) did not know. of 100 respondents, exactly half 50 (50%) indicated that they are actively preserving content while 35 (35%) noted they were planning on implementing a digital preservation program in the future. ten (10%) and 5 (5%) responded that they are not preserving content or didn't know, respectively. when asked what was preventing the implementation of a digital preservation program, the following reasons were identified. what is preventing your library from implementing a digital preservation program? (select all that apply) response percent response count lack of funding 73% 38 other priorities 56% 29 lack of technical expertise 23% 12 other 23% 12 lack of administrative support 21% 11 we don't know where to start 8% 4 don't know 2% 1 total 100% 52 (93 skipped)     digital preservation practices when asked which digital preservation strategies respondents' libraries have implemented, 66 individuals selected data backup followed by 32 choosing outsourcing to an externally managed preservation repository. quite a few respondents opted to describe other alternatives in the form of adding a written comment to the survey. these optional responses included: cloud space for limited digital assets, duracloud, aptrust, and resourcespace. in regards to funding digital preservation activities, many respondents either chose funding through another line in the regular budget (37 responses) or don't know (32 responses). further responses to this inquiry were quite dispersed suggesting that direct funding for preservation activities is not standardized in the non-arl academic library community.   key findings creation/acquisition of digital content one hundred seventeen of the 145 (81%) respondents report that they are currently creating and/or acquiring digital content, such as the output of faculty and student research, institutional records, dissertations and theses, and digital library collections. furthermore, 14 of the 27 (52%) respondents who indicated that they were not currently creating or collecting digital content noted that they planned to create or acquire digital content in the next one to three years. implementation of an ir/dr ninety-four of the 145 (65%) respondents have a locally managed or hosted ir/dr. of those who have not implemented an ir/dr, associate and baccalaureate colleges comprised the largest group that had not implemented, with 8 and 10 respondents, respectively. hosted versus locally managed ir/dr the majority of the ir/dr are using hosted services. of the respondents, 73 respondents use hosted services, 29 are locally managed. key features for ir/dr survey respondents indicated important features for future ir/dr enhancements including greater capacity for handling digital preservation, friendlier digital content submission, (greater) availability of hosted services, and reduced local it support. digital preservation digital preservation is a priority as respondents indicated that they have a commitment to digital preservation through mission/strategic plans. further, respondents indicated that they have executed digital preservation programs largely through participation in collaborative efforts such as portico and lockss. however when asked about barriers to implementation of digital preservation programs they identified: lack of funding, other priorities, lack of expertise, lack of administrated support, and not knowing where to start.   conclusions this survey of the non-arl academic library community was undertaken to better understand the state of digital content management activities including digital preservation policies and practices. based on the key findings the following conclusions were drawn: non-arl libraries and digital collections: since the 2007 miracle project study, there has been widespread implementation of digital content management programs and ir/drs in non-arl academic libraries. as a result, the survey identified the trend of libraries now being able to focus their concerns and efforts on growing their collections, expanding the use of the collections, as well as addressing the issues around long term access to the collections through digital preservation. access versus preservation: digital preservation is a high priority in the non-arl academic library community, in spite of the fact that few of the respondents had campus-wide open access mandates or indicated a need to support research data management. as the target of the survey was baccalaureate and master's degree granting institutions this particular finding in the survey is a logical conclusion. hosted versus locally managed services: hosted services are being rapidly adopted by the non-arl academic community.   future work based on the conclusions drawn from the survey, several themes became apparent that further research and resource development will help to better address. specifically, the following categories emerged: comparison of hosted service options: academic libraries see hosted services as a viable technology strategy and, therefore, there has been rapid adoption of hosted services for ir/drs. the features and functionality of hosted services vary widely, particularly as relates to their digital preservation service capabilities. a side-by-side comparison of the functionality and available options would advance the use of hosted services, and assure that academic libraries can realize their goal of long-term access to and use of their collections. opportunity offered through hosted services: today's hosted services offer benefits not available in most locally managed ir/dr software. the hosted solutions available currently address many of the future features identified in the survey such as the need to integrate digital preservation with the ir/dr service, the need to reduce dependence on local it, and the need for easier loading of collections. a further analysis of the available services could be useful to the non-arl and broader academic library community. barriers to implementation of a digital preservation program: at the same time that hosted services address the future features detailed above, they inherently solve the expressed barriers — the lack of local expertise and other priorities — that have prevented implementation of digital preservation programs in the past. by offering a full end-to-end hosted repository service that incorporates digital preservation functionality, the academic library has a single decision making process that can be incorporated in the work already underway to implement an ir/dr, such as provisioning funding, getting administrative support, and determining where to start. resources that assist organizations in making decisions, evaluating content and collections, and selecting the tools and services that best meet their needs and budget could be of great use. since this survey was conducted, the digital powrr (preserving objects with restricted resources) research project published from theory to action: good enough digital preservation for under-resourced cultural heritage institutions [3] that addresses some of the suggestions mentioned here. we look forward to more work taking place in these important areas.   references [1] markey, karen, et al. census of institutional repositories in the united states: miracle project research findings. washington, dc, clir, february, 2007. [2] nykanen, melissa. (2011) institutional repositories at small institutions in america: some current trends, journal of electronic resources librarianship. 23: 1, 1-19. http://doi.org/10.1080/1941126x.2011.551089 [3] schumacher, jaime, et al. (2014) from theory to action: good enough digital preservation for under-resourced cultural heritage institutions. huskie commons, northern illinois university.   about the authors liz bishoff is the owner of the bishoff group, a library and cultural heritage organization consulting organization. ms. bishoff works with archives, libraries and museums on various aspects of their digital initiatives, including development of digital plans, adoption of standards and best practices, and preservation programs. since 2005 she has conducted more than 45 digital preservation readiness assessments, and served as faculty on several imls and neh funded workshops. she has held senior management positions at oclc and bcr. she is a member of ala and holds an mls from rosary college, and has post-graduate work in public administration at roosevelt university.   carissa smith is the product manager for the duracloud and dspacedirect services from the duraspace not-for-profit organization. ms smith works with a wide range of organizations including university libraries and archives, state and government agencies, and cultural heritage institutions assisting them in their digital preservation, open access, and repository needs. she has presented at various conferences including the library of congress digital preservation meeting, the open repository conference, the digital library federation forum, and ala. she holds a bs from syracuse university in information management and technology.   copyright © 2015 liz bishoff and carissa smith a metadata registry for the semantic web search  |    back issues  |    author index  |    title index  |    contents d-lib magazine may 2002 volume 8 number 5 issn 1082-9873 a metadata registry for the semantic web   rachel heery research & development ukoln <[email protected]> harry wagner office of research oclc / dcmi <[email protected]> abstract the semantic web activity is a w3c project whose goal is to enable a 'cooperative' web where machines and humans can exchange electronic content that has clear-cut, unambiguous meaning. this vision is based on the automated sharing of metadata terms across web applications. the declaration of schemas in metadata registries advance this vision by providing a common approach for the discovery, understanding, and exchange of semantics. however, many of the issues regarding registries are not clear, and ideas vary regarding their scope and purpose. additionally, registry issues are often difficult to describe and comprehend without a working example. this article will explore the role of metadata registries and will describe three prototypes, written by the dublin core metadata initiative. the article will outline how the prototypes are being used to demonstrate and evaluate application scope, functional requirements, and technology solutions for metadata registries. introduction establishing a common approach for the exchange and re-use of data across the web would be a major step towards achieving the vision of the semantic web. the semantic web activity statement articulates this vision as '…having data on the web defined and linked in a way that it can be used for more effective discovery, automation, integration, and reuse across various applications. the web can reach its full potential if it becomes a place where data can be shared and processed by automated tools as well as by people.' [i] in order to move towards this vision, we need to consider the tools on which such 'data sharing' will be based. in parallel with the growth of content on the web, there have been increases in the amount and variety of metadata to manipulate this content. an inordinate amount of standards-making activity focuses on metadata schemas (also referred to as vocabularies or data element sets), and yet significant differences in schemas remain. different domains typically require differentiation in the complexity and semantics of the schemas they use. indeed, individual implementations often specify local usage, thereby introducing local terms to metadata schemas specified by standards-making bodies. such differentiation undermines interoperability between systems. certainly unnecessary variations in schemas should be avoided wherever possible, but it is impossible, and undesirable, to try to impose complete uniformity. innovation emerges from such differentiation. this situation highlights a growing need for access by users to in-depth information about metadata schemas and particular extensions or variations to schemas. currently, these 'users' are human — people requesting information. increasingly, such 'users' will be automated — 'agents' as part of applications that need to navigate or query schemas. it would be helpful to make available easy access to schemas already in use to provide both humans and software with comprehensive, accurate and authoritative information. the w3c resource description framework (rdf) [ii] has provided the basis for a common approach to declaring schemas in use. at present the rdf schema (rdfs) specification [iii] offers the basis for a simple declaration of schema. it provides a common data model and simple declarative language. additional work is underway in the context of the w3c's rdfcore working group [iv] and the web ontology group [v] to add 'richness' and flexibility to the rdf schema language, to incorporate the features of the darpa agent markup language (daml) [1] and the ontology interface layer (oil) [2] ontology language [vi], and to bring this work to recommendation status. even as it stands, an increasing number of initiatives are using rdfs to 'publish' their schemas. metadata schema registries are, in effect, databases of schemas that can trace an historical line back to shared data dictionaries and the registration process encouraged by the iso/iec 11179 [vii] community. new impetus for the development of registries has come with the development activities surrounding creation of the semantic web. the motivation for establishing registries arises from domain and standardization communities, and from the knowledge management community. examples of current registry activity include: agencies maintaining directories of data elements in a domain area in accordance with iso/iec 11179 (this standard specifies good practice for data element definition as well as the registration process. example implementations are the national health information knowledgebase hosted by the australian institute of health and welfare [viii] and the environmental data registry hosted by the us environmental protection agency [ix].); the xml.org directory of the extended markup language (xml) document specifications facilitating re-use of document type definition (dtd), hosted by the organization for the advancement of structured information standards (oasis) [x]; the metaform database of dublin core usage and mappings maintained at the state and university library in goettingen [xi]; the semantic web agreement group dictionary, a database of terms for the semantic web that can be referred to by humans and software agents [xii]; lexml, a multi-lingual and multi-jurisdictional rdf dictionary for the legal world [xiii]; the schemas registry maintained by the european commission funded schemas project, which indexes several metadata element sets as well as a large number of activity reports describing metadata related activities and initiatives [xiv]. metadata registries essentially provide an index of terms. given the distributed nature of the web, there are a number of ways this can be accomplished. for example, the registry could link to terms and definitions in schemas published by implementers and stored locally by the schema maintainer. alternatively, the registry might harvest various metadata schemas from their maintainers. registries provide 'added value' to users by indexing schemas relevant to a particular 'domain' or 'community of use' and by simplifying the navigation of terms by enabling multiple schemas to be accessed from one view. an important benefit of this approach is an increase in the reuse of existing terms, rather than users having to reinvent them. merging schemas to one view leads to harmonization between applications and helps avoid duplication of effort. additionally, the establishment of registries to index terms actively being used in local implementations facilitates the metadata standards activity by providing implementation experience transferable to the standards-making process. scope and functionality the dublin core metadata initiative (dcmi) has defined a relatively small set of data elements (referred to within the dcmi as the dcmi vocabulary or dcmi terms) for use in describing internet resources as well as to provide a base-line element set for interoperability between richer vocabularies. the dcmi has long recognized the need to provide users with enhanced access to information about these terms in the form of an added-value 'information service'. this service should include information about the dcmi vocabulary over and above that provided by the rdf schema, or any explication of the vocabulary on the dcmi web site. this service is intended to assist humans and applications to obtain reliable and trusted information about the dcmi. the interests of a variety of dcmi members have converged around this idea, including: those interested in building a generic schema registry, those interested in expressing the rich structure of ontologies in a schema language, those interested in providing a user friendly search interface to the dcmi vocabulary, and those interested in effectively managing the evolution of the dcmi vocabulary. it was with these goals, and this level of interest, that the dcmi registry working group [xv] was chartered. the original charter for the dcmi registry working group was to establish a metadata registry to support the activity of the dcmi. the aim was to enable the registration, discovery, and navigation of semantics defined by the dcmi, in order to provide an authoritative source of information regarding the dcmi vocabulary. emphasis was placed on promoting the use of the dublin core and supporting the management of change and evolution of the dcmi vocabulary. the overriding goal has been the development of a generic registry tool useful for registry applications in general, not just useful for the dcmi. the design objectives have been to provide a tool suitable for use for the dcmi registry while also ensuring the registry was sufficiently extensible to include other, non-dcmi schemas. in addition, the dcmi registry working group has been committed to using open standards and developing the software from open-source distributions [3]. discussions within the dcmi registry working group (held primarily on the group's mailing list [4]) have produced draft documents regarding application scope and functionality. these discussions and draft documents have been the basis for the development of registry prototypes and continue to play a central role in the iterative process of prototyping and feedback. application scope and functional requirements have evolved through an iterative process of prototyping, discussion and evaluation. many aspects of functionality have been identified using this process, including: selecting a user interface that is both meaningful and user-friendly to the largest possible set of users (for example, metadata specialists, rdf experts, and automated agents each require different interfaces [xvi]); automating the identification of relationships between terms in vocabularies, and linking those terms in a way that facilitates their discovery and navigation; implementing a metadata registry that is multilingual, both from a user interface and a data perspective; identifying appropriate administrative metadata to describe the registered schemas themselves (i.e., title, publisher, description, etc.); associating encoding schemes (such as date formats and identifier schemes) with data elements, to enable the identification and navigation of encoding schemes belonging to different resource communities in a way that is both flexible and scalable; evaluating various methods of persistent storage of metadata terms for a solution that is both practical and scalable. an important focus of the prototyping effort has been to provide a dcmi oriented view of the data model that exists in the underlying rdf schema. this enables users to easily visualize the 'grammar' that structures the dcmi semantics (elements, element qualifiers and encoding schemes). in this article, the 'classification' of dcmi terms is referred to as a taxonomy. a user interface based around this taxonomy is felt to be particularly important for the novice user. another significant issue, and one not easily solved, has been the requirement to manage the evolution of the dcmi vocabulary. this has involved balancing our internal vocabulary management requirements (audit trail, versioning, status of proposed terms) with the provision of a clear and authoritative view of the current vocabulary. developing with prototypes many of the issues regarding metadata registries are unclear and ideas regarding their scope and purpose vary. additionally, registry issues are often difficult to describe and comprehend without a working example. the dcmi makes use of rapid prototyping to help solve these problems. prototyping is a process of quickly developing sample applications that can then be used to demonstrate and evaluate functionality and technology. the following sections describe three metadata registry prototypes that were written by the dcmi [5]. they serve two purposes: to facilitate the generation of scope and functional requirements, and to assist in the identification of practical technology solutions for registry applications. while each of the prototypes provides a different solution, they have several features in common. for example, the prototypes are all open-source solutions, built entirely from open-source distributions [6]. all three prototypes rely on rdf schemas for their input data format, and all three are multilingual [7] java servlet applications. from there the prototypes diverge, each providing a different solution and each building on experience gained from the previous prototype(s). prototype 1 prototype 1 is a database solution, based on the extensible open rdf toolkit (eor) [xvii]. eor is an open-source rdf toolkit providing a collection of java classes and services designed to facilitate the rapid development of rdf applications. eor is based on the stanford rdf api [8], which provides classes and methods necessary for processing rdf data (i.e., parsing, serializing, etc.). prototype 1 is essentially a search and navigation service for rdf schemas. it extends the eor search service by providing a compound search function, which is required for several of the search types, such as "find x in relation to y" and "find classes containing the following term". the user interface (ui) for prototype 1 is forms-based (see figure 1) and is a combination of java server pages (jsp) and extensible stylesheet language transformation (xslt). the ui provides both standard and rdf interfaces, intended to serve two different types of users: metadata specialists and rdf experts. the differences between the two interfaces are the types of searches supported and the labels used for the search result set. for example, the standard interface provides queries for "find refinements for term x". the corresponding rdf interface query is "find terms that are a subproperty (or subclass) of x". additionally, the standard interface supports the query "find registered encoding schemes for term x". this is a good example of functionality that supports local terminology relevant to its target audience (in this case, those interested in the dcmi). one can envisage other 'community specific' registries orientating their ui to specialized audiences (e.g., the ims community [9]). the registry also provides an 'rdf interface' that takes account of the underlying dcmi taxonomy of terms, but uses rdf terminology for the user interface, rather than the more familiar dcmi terminology. figure 1. standard user interface an additional difference between the standard and rdf interface is in the labels generated for query results. standard interface labels are natural-language oriented and translated into each of the supported languages. the rdf interface uses the fully qualified predicate (namespace-uri and local-name) for its labels. one feature the standard and rdf interfaces have in common is that every resolvable resource in the result-set is an html link either to that resource or to a symbol used to link and navigate terms. query results often include both. for example, clicking on any of the labels or resources generated by the query in figure 2 would resolve to that resource. clicking on the "show refinements" symbol () would generate a new search for all refinements (subclassof or subpropertyof) of that term. the "new search" symbol () generates a new query for that particular item. this provides a simple means to navigate terms and illustrates the relationship between terms. prototype 1 uses a postgresql relational database [10] as a persistent data store. postgresql was chosen because it is one of the few open-source database management systems (dbms) that support unicode. separate databases are maintained for each language. this is due to the limited support within eor and the stanford rdf api for the xml:lang attribute. this distribution of translations works fine for displaying results in a particular language, but it limits the application's ability to simultaneously display all the translations for a particular term (i.e., display all translations for the term "title"). figure 2. hyperlinked query results multilingual support for the user interface is provided using java resource bundles [11]. resource bundles are java classes that are essentially tables of locale-specific translations. a separate bundle was created for each supported language. these bundles are loaded into the client session whenever a language selection is made. the registry uses the resource bundles to display the user interface in the selected language. ontology relationships (i.e., complementof, equivalentto, inverseof, etc.), as are provided with daml+oil, are supported by the stanford rdf api via the daml_o class. however, this layer of support is not currently part of the eor toolkit and is not supported by this prototype. solutions built on rdf toolkits such as eor can be implemented quickly due to the extensive classes and services provided by the toolkit (only a small number of changes to eor were required to produce prototype 1). however, this is a complex and resource-intensive approach, which did not perform as well as other solutions. prototype 2 prototype 2 is an extremely lightweight, in-memory solution. it is a small java servlet application and differs significantly from the other prototypes in that it does not use an rdf api. data is parsed using an xml parser, and queries are resolved on the client-side using xslt stylesheets. all input data is in the form of rdf schemas, which are stored locally as sequential ascii files [12]. the schemas are loaded by the xslt stylesheets and rely on the search servlet to identify which schemas to load. this information is passed from the servlet to the stylesheets as parameters. all non-ascii characters in the schemas are required to be escape-sequenced. the server-side processing for this prototype is very simple and is the key to its flexibility. the search servlet is the primary component and consists of less than 200 lines of code. this servlet evaluates the input data that was posted and sets parameters identifying the specific request and the current language selection. these are then passed to the appropriate stylesheet for processing. figure 3. point-and-click style user interface the user interface has an intuitive point-and-click style, designed to facilitate navigation rather than searching [13] (see figure 3). the ui of the search service is composed entirely of xslt stylesheets. the stylesheets are segregated by function (banner, navigation bar, footer, etc.) and language. the multilingual aspects of the ui are managed by isolating all language-dependent text to individual stylesheets, and including those stylesheets as needed. in prototype 2, the translations were isolated to the intro, navbar and labels stylesheets. non-ascii characters in these stylesheets are encoded as escape-sequenced strings. prototype 2 provides one interface style. this ui is designed to provide native-language labels for terms. however, due to the modular approach of the xslt stylesheets, additional interfaces (i.e., the rdf interface provided in prototype 1) could be added. all resources in the result-set are displayed as html links that resolve either to the listed resource, or that initiate a new query for the listed resource. this includes both labels (predicates) and property values (objects) and allows users to easily navigate the metadata terms and explore the relationship between terms. due to the simplified data model of prototype 2, refinements for terms (subclassof or subpropertyof) are much easier to discover and navigate, and can be displayed in a more intuitive manner. all refinements for each selected term are automatically included in query results (see figure 4). support for ontologies is limited. complex relationships (such as those provided with daml+oil) between terms cannot be easily automated without an rdf parser, and would be difficult to implement with this type of lightweight solution. the modular approach of the stylesheets does offer, to a greater degree, more flexibility regarding local dcmi vocabulary taxonomy (i.e., elements, qualifiers, encoding schemes, controlled vocabulary terms). however, a method to automate the discovery and navigation of this taxonomy has not yet been discovered. figure 4. discovery and navigation of refinements lightweight, in-memory solutions, such as prototype 2, can be implemented quickly, and are flexible and fairly simple to maintain. however, they are not expected to scale well due to the limitations of in-memory processing. this may be perfectly acceptable for applications that are able to limit the number of schemas they plan to register. prototype 3 the third prototype, like prototype 1, is a database solution. it is based on the jena semantic web toolkit [14], the arp rdf parser [15] and the berkeleydb [16] database management system. jena is a functionally rich java api for processing rdf data. it supports both in-memory and persistent storage rdf models, rdf data query language (rdql), integrated rdf parsers (i.e., arp and sirpac) and a rich set of resource-centric and model-centric methods for manipulating rdf data. jena and the arp parser are open-source and were developed by the hp labs semantic web activity. prototype 3 uses berkeleydb for persistent data storage. berkeleydb is an open-source database management system that provides a comprehensive set of features, including data concurrency, recovery and transaction support. it differs from most database management systems in that it is neither relational nor object-oriented. databases are composed of simple key-value pair records. berkeleydb performs better than relational database systems for a couple of reasons. first, it uses simple function calls to access data, thus eliminating the overhead of a query language, such as sql. second, it is an "embedded" database, running in the same address space as the application that uses it. this eliminates the need for inter-process communication. berkeleydb appears to be a perfect solution for a persistent rdf data store, but the interface between jena and the database is not completely stable and can result in unexpected system failures. prototype 3 is also a java servlet application. it builds on the previous prototypes and is composed of a number of services, including search/navigate, registration, login, and import. the user interface is written entirely in xslt. each of the services produces an xml document that is then parsed using xslt. the xslt stylesheets produce html, which is delivered to the client. one ui style is currently provided, which produces native-language style labels (i.e., dcmi data element identifiers such as title, description, publisher, etc.). the search/navigate service includes features from both of the previous prototypes. the user interface has a more advanced search engine than prototype 1 (see figure 5) and incorporates the point and click navigation style of prototype 2. the search form enables case-sensitive searching and supports specific types of searches (i.e., match all terms, match any term, etc.). this function can be easily expanded to further refine searches to a specific schema or taxonomy. figure 5. search service like prototype 2, this application uses an intuitive point-and-click style of navigation. as with the previous prototypes, all resources are displayed as html links that resolve to the listed resource or initiate a new query for that resource. additionally, "canned" queries are provided for displaying registered schemas, properties, classes, namespaces, or a summary of all registered items. the register, login, and import services provide a means for authorizing users and for allowing those users to register schemas. additionally, the import service provides two very important functions: an alternative means for users to specify the schema language, and two methods for specifying the administrative metadata associated with a schema. metadata describing the schema itself is required for each registered schema. this includes title, description, publisher, url of related documents, and date. this information can be included as part of the schema, as is the case with the dcmi schemas, or can be provided when the schema is imported using the fields provided on the import screen (see figure 6). figure 6. import service the schema language must also be specified when the schema is registered. this can be accomplished in two ways: by coding the xml:lang attribute on all literals within the schema [xviii] (see figure 7) or by specifying the language at registration time using the field provided on the import screen. identifying the schema language during registration enables the discovery and navigation of terms and schemas in multiple languages. users can specify their language preference, using the preferences service, by selecting from a list of supported ui languages and result-set languages. the search service selects the appropriate query results based on the language preferences selected and the language encoding used by the schemas. figure 7. language identification using xml:lang the multilingual user interface is accomplished via an xslt "translate" stylesheet. this stylesheet is passed the term requiring translation and the requested language. it uses language-specific xml documents, similar to java resource bundles, to perform the translations. translations are done both for the user interface and the result-set labels. prototype 3 appears the most promising of the three solutions. the extensive amount of functionality for manipulating rdf data provided with the jena api offers a big advantage over the other prototypes. this is especially significant regarding support for daml ontologies. conclusions software and performance issues java proved an ideal solution for multilingual registry applications because of its use of utf-8 encoding for internal string representation and because of the rich application-programming interface provided for internationalization [xix]. one significant difference between the multilingual solutions we tried is the xml:lang support offered by the jena api. this attribute simplifies language processing and enables flexibility not easily offered by other solutions. relational database solutions, such as was tried with prototype 1, do not appear to be an ideal fit for rdf, at least not from a performance standpoint. the overhead of using a high-level data access (such as sql) with a data model as minimalist as rdf is noticeable, and performance suffers. berkeleydb, in conjunction with the jena api, offers significantly better performance, but is not 100% stable. this problem is expected to be resolved in the near future. in the interim, the best choice for a persistent data store for rdf metadata registries appears to be the slower-performing, but more reliable, relational model. prototype 3 demonstrates a solution for automating the identification and management of administrative metadata that describes the registered schemas (i.e., title, publisher, description, etc.). a standard method for describing administrative metadata for schemas does not currently exist. prototype 3 resolves this issue by providing two alternative methods for capturing this information. the modular use of xslt stylesheets has proven to be a good solution for providing user interfaces that can be customized to suit multiple user types. this level of functionality is demonstrated with prototype 1, and could be easily adopted by the other prototypes due to their common use of xslt. functionality and data model issues there has been an ongoing tension between the requirements for managing the evolution of the dcmi vocabulary and the requirement to provide user friendly navigation of the dcmi vocabulary. the internal vocabulary management requirements of the dcmi (i.e., audit trail, versioning, tracking status of proposed terms, etc.) could not easily be resolved with the requirements for an "open" metadata registry and were not addressed by the three prototypes described in this article. the principal obstacle to satisfying vocabulary management requirements has been a lack of data. the dcmi rdf schemas do not include this level of information. although several solutions were proposed-including overloading the schemas to include this data and maintaining multiple data sources-none were considered acceptable. while this may be considered a complication particular to metadata 'standards-making bodies', such as the dcmi, it is also, to a lesser extent, an issue faced by all schema maintainers. the dcmi has chosen to split their vocabulary management requirements into a separate application. it will be interesting to see if such a system has more widespread use amongst schema maintainers. the formulation of a 'definitive' set of rdf schemas within the dcmi that can serve as the recommended, comprehensive and accurate expression of the dcmi vocabulary has hindered the development of the dcmi registry. to some extent, this has been due to the changing nature of the rdf schema specification and its w3c candidate recommendation status. however, it should be recognized that the lack of consensus within the dcmi community regarding the rdf schemas has proven to be equally as impeding. the identification and navigation of the 'taxonomy' (the 'grammar' of a vocabulary arising from the data model) belonging to different resource communities is still an open issue. each of the prototypes demonstrate a limited degree of functionality in this area, but do not provide a method for automating it. until an automated method can be found, "canned queries" and other navigational aids to these structures will be limited to widely accepted taxonomies (i.e., properties, classes, etc.) and locally defined taxonomies (such as dcmi elements, qualifiers, etc.) that are manually maintained. the automated sharing of metadata across applications is an important part of realizing the goal of the semantic web. users and applications need practical solutions for discovering and sharing semantics. schema registries provide a viable means of achieving this. much has been learned from the prototyping efforts to date, and the dcmi has renewed its commitment to developing an operational registry that facilitates the discovery, exchange and reuse of semantics. acknowledgements the authors would like to acknowledge the dcmi registry working group and other members of the dcmi for their contribution to the formulation of ideas expressed in this article and for their support in the development work. notes and references [1] the darpa agent markup language is accessible at . [2] the ontology interface layer is accessible at . [3] source code for the registry applications is available at . [4] the registry working group archives are available at . [5] the prototypes are accessible at . [6 ] java is developed using sun's java community process. this process is dedicated to open standards and has many similarities to open-source, but is not technically considered open-source. [7] one of the known requirements for the metadata registry is that it be multilingual, both from a user interface (ui) and a data perspective. all three of the prototypes were written with this requirement in mind, and several different approaches were used. the dublin core element set is currently translated into fourteen different languages, of which six were selected to serve as proof of concept. the selected languages include those comprised of both single and double byte character sets (i.e., spanish and japanese). due to the cost of the translations, and the temporary nature of the prototypes, only portions of the data and user interface (enough to serve as proof of concept) were translated in each application. [8] the stanford api is accessible at . [9]the ims global consortium, inc. () is an organization committed to the development and promotion of metadata standards related to various aspects of distributed learning. [10] postgresql is accessible at . [11] the resourcebundle class documentation () is a good source of information regarding resource bundles and their use. [12] this was done as a temporary measure. a more practical approach for a production version of this prototype would be to store the url of the registered schemas and load them via the network at system startup. [13] a limited degree of searching is still possible due to the internal search function provided with most web browsers. [14] jena is accessible at . [15] the arp parser is accessible at . [16] berkeleydb is accessible at . resources [i] w3c semantic web activity group. accessed february 11, 2002. . [ii] resource description framework (rdf). accessed february 11, 2002. . [iii] resource description framework (rdf) schema specification 1.0. w3c candidate recommendation 27 march 2000. work in progress. . [iv] rdfcore working group. accessed february 11, 2002. . [v] w3c web-ontology (webont) working group. accessed february 11, 2002. . [vi] daml+oil web ontology language. accessed january 14, 2002. . [vii] iso/iec 11179-1:1999 specification and standardization of data elements. part 1: framework for the specification and standardization of data elements (available in english only). international standards organisation. [viii] national health information knowledgebase. hosted by the australian institute of health and welfare. accessed february 11, 2002. . [ix] environmental data registry. accessed february 8, 2002. . [x] the xml registry hosted by oasis. accessed february 11, 2002. . [xi] metaform: database containing dublin core manifestations and other metadata formats. hosted at the state and university library in goettingen. accessed february 11, 2002. . [xii] swag dictionary. accessed february 11, 2002. . [xiii] lexml: open source development of an rdf dictionary. accessed february 11, 2002. . [xiv] schemas forum for metadata implementors registry. accessed february 11, 2002. . [xv] dublin core metadata initiative. dcmi registry working group. accessed january 14, 2002. . [xvi] dublin core metadata initiative dcmi registry functional requirements. rachel heery. accessed january 10, 2002. . [xvii] oclc office of research. dublin core metadata initiative. extensible open rdf toolkit. accessed january 14, 2002. . [xiii] world wide web consortium. rdf schema for the rdf data model. accessed january 15, 2002. . [xix] sun microsystems. internationalization. accessed january 15, 2002. .   copyright © rachel heery and oclc top | contents search | author index | title index | back issues book review | next article home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/may2002-wagner   jpeg 2000 for long-term preservation: jp2 as a preservation format search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine may/june 2011 volume 17, number 5/6 table of contents   jpeg 2000 for long-term preservation: jp2 as a preservation format johan van der knijff kb / national library of the netherlands johan.vanderknijff@kb.nl doi:10.1045/may2011-vanderknijff   printer-friendly version   abstract despite the increasing popularity of jpeg 2000 in the archival community, the suitability of the jp2 format for long-term preservation has been poorly addressed by existing literature. this paper demonstrates how some parts of the jp2 file specification (related to icc profiles and grid resolution) contain ambiguous information, leading to a situation where different software vendors are interpreting the standard in slightly different ways. this results in a number of risks for preservation. these risks could be reduced by applying some minor changes to the format specification, in combination with the adherence to the updated standard by software vendors.   introduction the last few years have seen a marked rise in the use of jpeg 2000 in the cultural heritage sector. several institutions are now using jpeg 2000 part 1 (the jp2 format) as a preferred archival and access format for digital imagery. examples include (but are not limited to) the national library of the netherlands (gillesse et al., 2008), the british library (mcleod & wheatley, 2007), the wellcome library (henshaw, 2010a), library of congress (buckley & sam, 2006), the national library of norway (national library of norway, 2007), and the national library of the czech republic (vychodil, 2010). a number of other institutions are currently investigating the feasibility of using jp2 as a replacement of uncompressed tiff, which is still the most widely used still image format for long-term archiving and preservation. in spite of the wide interest in jpeg 2000 from the archival community, the existing literature is surprisingly sparse on the actual suitability of the standard for long-term preservation. if preservation is addressed at all, what's often lacking is a specification of what information inside an image is worth preserving in the first place. moreover, such discussions are often limited to largely theoretical considerations (e.g. features of the jp2 format), without going into the more practical aspects (e.g. to what extent do existing software tools actually follow the features that are defined by the format specification). however, without taking such factors into account, can we say anything meaningful about how an image that is created using today's software will be rendered in, say, 30 years time? also, at some point in the future it may be necessary to migrate today's images to a new format. how confident can we be about not losing any important information in this process? alternatively, if we opt for emulation as a preservation strategy, how will the images behave in an emulated environment? the above questions are central to this paper. there are many aspects to assessing the suitability of a file format for a particular preservation aim (see e.g. loc, 2007 and brown, 2008). in this paper i limit myself to addressing two areas where the jp2 format specification can be interpreted in more than one way: support of icc profiles and the definition of grid resolution. i demonstrate how these ambiguities have lead to divergent interpretations of the format by different software vendors, and how this introduces risks for long-term preservation. i also present some possible solutions. finally, i provide a number of practical recommendations that may help institutions to mitigate the risks for their existing collections. unless stated otherwise, the observations in this paper only apply to the jp2 file format, which is defined by jpeg 2000 part 1 (iso/iec, 2004a).   colour management in jp2: restricted icc profiles section i.3 of the jp2 format specification (iso/iec, 2004a) describes the methods that can be used to define the colour space of an image. the most flexible method uses icc profiles, and is based on version icc.1:1989-09 of the icc specification (icc, 1998). jp2 supports the use of icc profiles for monochrome and three-component colour spaces (such as greyscale and rgb). however, jp2 does not support all features of the icc standard. instead, it uses the concept of a "restricted icc profile", which is defined as follows: "this profile shall specify the transformation needed to convert the decompressed image data into the pcsxyz, and shall conform to either the monochrome input or three-component matrix-based input profile class, and contain all the required tags specified therein, as defined in icc.1:1998-09." (iso/iec 2004a, table i.9). to appreciate what this actually means, it is helpful to give some additional information on the icc standard. first of all, the icc specification distinguishes 7 separate icc profile classes. the most commonly used ones are the "input device" (or simply "input"), "display device" ("display") and "output device" ("output") classes. another one that is relevant in this context is the "colorspace conversion" class. second, it is important to know how colour transformations can be defined within the icc standard. for monochrome images, the colour transformation is always described using a gray tone reproduction curve (trc), which is simply a one-dimensional table. for rgb spaces, two methods are available. the first one is based on a three-component matrix multiplication. the second (n-component lut-based method) uses an algorithm that includes a set of tone reproduction curves, a multidimensional lookup table and a set of linearisation curves (icc, 1998). going back to the jp2 specification, the restrictions in the "restricted icc profile" class are: for rgb colour spaces, n-component lut-based profiles are not allowed (only three-component matrix-based profiles). only (device) input profiles are allowed (for both monochrome and rgb spaces) the first restriction makes sense, since n-component lut-based profiles are more complex than three-component matrix-based ones, and thus more difficult to implement. the logic behind the restriction of allowing only input profiles is more difficult to understand, since it prohibits the use of all other icc profile classes. according to the icc specification, the "input" class represents input devices such as cameras and scanners. however, widely used working colour spaces such as adobe rgb 1998 (adobe, 2005) and ecirgb v2 (eci, 2007) are defined using profiles that belong to the display profile class. as a result, they are not allowed in jp2, even though both the adobe rgb 1998 and ecirgb v2 profiles use the three-component matrix-based transformation method. since there is no obvious reason for prohibiting such profiles, it would appear that the restriction to "input" profiles may be nothing more than an unintended error in the file specification. this impression is reinforced by the fact that the file specification of the jpx format (which is defined by jpeg 2000 part 2) also consistently uses the phrase "input icc profiles" in the definition of its "any icc profile" method (which doesn't have any restrictions on the use of n-component lut-based profiles) (iso/iec, 2004b). a major consequence of the "input" restriction is that a literal interpretation of the format specification limits the use of icc profiles to such a degree that any serious colour management becomes impossible in jp2. for colour imagery, the only colour space that can be handled without using icc profiles is srgb. full-colour printed materials often contain colours that cannot be represented in the srgb colour space. if such materials need to be digitised with minimal loss of colour fidelity, a colour space with a wider gamut (such as adobe rgb or ecirgb) is needed, and this requires the use of icc profiles. since the format specification prohibits this, this means that — in its current form — the jp2 format is unsuitable for applications that require colour support beyond srgb.   handling of icc profiles by different encoders in order to test how the most widely-used jpeg 2000 encoders handle icc profiles in practice, i took a number of tiff images that contain embedded icc profiles, and tried to convert them to jp2 with the most widely used jpeg 2000 encoders. the icc profiles in all experiments were display device profiles for adobe rgb 1998 and ecirgb v2 working colour spaces, which both use the three-component matrix-based transformation method. i subsequently analysed all generated images using exiftool 8.12 (harvey) and jhove 1.4 (jhove). table 1 summarises the results. software result luratech command line tool version 2.1.20.10 (jp2clt.exe) upon detection of non-"input" profile, output is automatically written in jpx format. icc profile embedded using "any icc" method luratech command line tool version 2.1.22.0 (jp2clt.exe) jp2 file; icc profile embedded using "restricted" method. the profile class of the original profile ("display device") is changed to "input device" adobe photoshop cs4 using adobe jpeg2000 (version: 2.0, 2007) plugin; jp2 compatible option not checked jpx (jpeg 2000 part 2) file; icc profile embedded using "any icc" method adobe photoshop cs4 using adobe jpeg2000 (version: 2.0, 2007) plugin; jp2 compatible option checked jpx file that contains two versions of the icc profile: original profile using "any icc" method modified version using "restricted" method. the profile class of the original profile ("display device") is changed to"input device", and a "modified" prefix is added to the profile description field (e.g. "modified ecirgb v2") kakadu 6.3 (kdu_compress.exe) jp2 file; icc profile not embedded imagemagick 6.6.1-2 (convert.exe) jp2 file; icc profile not embedded aware jpeg 2000 sdk 3.18 (j2kdriver.exe) jp2 file; icc profile embedded using "restricted" method table 1: preservation of icc profiles in tiff to jpeg 2000 migration using different encoders. we can make a couple of interesting observations from these results. first, 3 out of the 7 experiments resulted in a jpx file. jpx is an extension of the jp2 format that is defined by jpeg 2000 part 2 (iso/iec, 2004b). most jpx files can be read by jp2 decoders, which will simply ignore any features that are not permitted within jp2. it also contains a separate "any icc" method that — unlike jp2 — supports the use of n-component lut-based icc profiles. decoders that do not include jpx support will simply ignore icc profiles that are defined using this method. at present very few decoders include support for jpx, and the adoption of the format is negligible. because of this, the format is not well suited for preservation. with this in mind, the behaviour of version 2.1.20.10 of the luratech software (which was reported also by henshaw, 2010b) is somewhat odd. depending on the characteristics of the input image, the encoder may decide to use the jpx format without any explicit instruction from the user to do so. even worse, users may be completely unaware of this. since the icc profiles in all test images use the three-component matrix-based transformation, the only reason for not allowing them in jp2 would be the fact that they are not "input" profiles. however, since the "any icc" method in the format specification of jpx contains the very same "input" restriction, switching to jpx doesn't solve this problem. this behaviour has been corrected in more recent versions of luratech's software. if version 2.1.22.0 of the encoder encounters a "display" profile in the input image, it writes a jp2 file, but it changes the "display" profile class value of the original profile to "input" in the resulting image [1]. adobe's jpeg 2000 plugin for photoshop only encodes to jpx format [2]. however, it has an option to create jpx files that are "jp2 compatible". when this option is activated, in addition to the original profile, it adds a modified version to the image, where the "display" class is simply changed to "input". so, these images contain two different versions of the same profile. icc profiles are lost altogether in the kakadu and imagemagick migrations. this is consistent with earlier results by kulovits et al. (2009). i should add here that kakadu does actually support the use of icc profiles, but in an indirect way that requires the user to specify a profile's parameters on the command line. only the aware encoder managed to create jp2 images that include embedded "display" icc profiles without altering them in any way during the migration. so, only aware and recent versions of the luratech encoder currently permit basic colour management in the jp2 format. aware achieves this by deviating from the jp2 format specification, whereas luratech simply changes the profile class fields.   resolution headers most still image formats use straightforward, fixed header fields for describing the grid resolution of the image data. for jp2 (and the other jpeg 2000 formats) the situation is somewhat more complex, because it distinguishes two separate resolution types. they are both optional, and an image may contain any, both or neither. first, there is a "capture resolution", which is defined as "the grid resolution at which the source was digitized to create the image samples specified by the codestream". two examples are given: the resolution of the flatbed scanner that captured a page from a book, or the resolution of an aerial digital camera or satellite camera (iso/iec 2004a, section i.5.3.7.1). second, there is a "default display resolution", which is defined as "a desired display grid resolution". the specification states that "this may be used to determine the size of the image on a page when the image is placed in a page-layout program". it then continues by warning that "this value is only a default", and that "each application must determine an appropriate display size for that application" (iso/iec 2004a, section i.5.3.7.2). the definition of these resolution types is problematic for a number of reasons. first of all, the use of the word "digitized" in the definition of "capture resolution" implies that it only covers analog-to-digital capture processes, such as the scanning of a printed photograph. however, in the case of born-digital materials there is no such analog-to-digital capture process, so the definition does not apply. a similar situation arises if we scan a photograph at, say, 300 ppi, and subsequently resample the resulting image to 150 ppi. obviously the original image has a capture resolution of 300 ppi, but it is less clear where we should store the grid resolution of the resampled image. one possibility would be to use the "default display" fields. however, the definition of "default display resolution" is rather vague, and it is difficult to understand what it means at all (e.g. what is "desired", and if this value is "only a default", what is this "default" based on?). my interpretation is that it is basically intended to allow reader applications to establish some sensible (but arbitrary) default zoom level upon opening the image. if this is correct, its value may be quite different from the grid resolution of the (either resampled or born-digital) image. semantic issues aside, the use of two separate sets of resolution fields also creates practical problems. first of all, it complicates the process of establishing the grid resolution of an image, since the location of this information ("capture" or "default display" fields) would become dependent on its creation history. second, in the case of format migrations that may be part of imaging workflows as well as (future) preservation actions, there is no obvious mapping between the resolution fields of jp2 and other formats. figure 1 illustrates this. just as an example, most digitisation workflows still use tiff for capture and intermediate processing, and the conversion to jp2 is only done as a final step. since a tiff image only has one set of resolution fields, to which jp2 fields should we map these values (taking into account that the tiff may or may not have been resampled after capture)? finally, there is the observation that, to the best of my knowledge, there is not a single example of a jpeg 2000 encoder that uses jp2's resolution fields in a manner that is consistent with the format specification. i will illustrate this in the next section. figure 1: mapping of resolution fields in migrations to and from jpeg 2000. migration 1 is a typical tiff to jp2 migration in a digitisation workflow; migration 2 represents a preservation action that involves a migration from jp2 to some future image format. in both cases, the mapping of the resolution fields before and after the migration is not clearly defined.   handling of resolution headers by different encoders in order to find out how current encoders are handling the resolution fields in practice, i analysed how grid resolution is stored in the output images of the aforementioned tiff to jpeg 2000 migration experiment. table 2 shows the results. software capture resolution display resolution luratech command line tool version 2.1.20.10 (jp2clt.exe) ✓ — luratech command line tool version 2.1.22.0 (jp2clt.exe) ✓ — adobe photoshop cs4 using adobe jpeg2000 (version: 2.0, 2007) plugin ✓ — kakadu 6.3 (kdu_compress.exe) — ✓ imagemagick 6.6.1-2 (convert.exe) — — aware jpeg 2000 sdk 3.18 (j2kdriver.exe) ✓ — table 2: header fields used for storing grid resolution after tiff to jpeg 2000 migration using different encoders. luratech, adobe and aware always map the tiff resolution fields to "capture resolution" in jpeg 2000. the imagemagick files do not contain any resolution information at all. only kakadu always uses the "default display" fields. on a side note, accusoft imagegear, which uses the kakadu libraries for writing jp2, also uses the "display" fields. this may apply to other kakadu-based products as well. crucially, none of these encoders use "capture resolution" in the way it is described in the format specification. what these results show is that establishing the grid resolution of a jp2 image is not straightforward, because the location of this information is not well defined. it also shows that most encoders ignore the literal meaning of "capture resolution" in the jp2 format specification, and simply use these fields in a manner that is analogous to the tiff resolution fields.   implications for preservation icc profiles in the previous sections i explained how the jp2 file specification appears to be unnecessarily restrictive with respect to embedded icc profiles, and i demonstrated that different software vendors are handling these restrictions in a variety of ways. from a preservation point of view, the central issue here (as already stated in the introduction to this paper) is what may be the impact of this on rendering existing images in the future, and the preservation of information in any future migration to some new format. there are several problems here. first of all, a strict adherence to the format specification would simply rule out the use of icc profiles in most cases. this would make the format unsuitable for any applications that require a colour gamut beyond srgb space. the aware encoder permits the use of jp2 for such applications by ignoring the "input" profile restriction. however, by doing so, such files no longer adhere to the format specification. recent versions of luratech's encoder do stick to the format specification, but enable the use of "display" icc profiles by changing the profile class fields. the impact on future migrations, or the use of such files in an emulated environment will most likely be minor in both cases. an "input" profile defines a transformation from a device-dependent colour space to a universal profile connection space (pcs), whereas a "display" profile simply describes the reverse pathway (from the pcs to a device-dependent space). technically, both are identical, and the colour transformation will be performed correctly even if the profile class label doesn't match the actual use. however, as for aware's solution, one cannot completely rule out that future decoders may ignore embedded "display" profiles, which is a potential risk for future migrations. luratech's current solution is also somewhat unsatisfactory, as it achieves adherence to the format specification by modifying (if only slightly) the original data. earlier versions of the luratech encoder produce a jpx file if they encounter an icc profile that doesn't adhere to the "restricted icc" definition. as software support for jpx is so poor, there is a real risk that the icc profiles will get lost in a future migration (even though the image data will most likely be preserved). moreover, since the jpx file specification also limits the use of icc profiles to the "input class", such files do not adhere to the jpx file specification either. the same applies to adobe's implementation, although the risks are even greater for these files because of the use of an erroneous file type header field, which makes the handling of these files by current and future decoders largely unpredictable. resolution header fields grid resolution does not directly affect the rendering of an image (unlike icc profiles). nevertheless, it is an important image property: for digitised imagery, resolution enables us to establish the dimensions of the digitised object. from a preservation point of view, the main risk that results from the current situation with jp2's resolution header fields is that resolution information may be lost in future migrations (see also figure 1). for instance, a (future) decoder that expects grid resolution to be stored in the "capture" fields and ignores the "default display" fields will not be able to establish any meaningful resolution information from images that were created using current versions of kakadu. some tools will internally substitute the missing resolution fields with default values. for instance, if adobe photoshop cannot find the "capture resolution" fields, it assumes a default value of 72 ppi. if such files are subsequently re-saved, it will actually write this (entirely fictional) value to the resolution fields of the created file. other tools may behave in a similar way, which introduces the risk that resolution information may change after a migration. also, none of the existing encoders appear to follow the (strict) definitions of these fields in the file specification. the file specification allows the use of both sets of fields in one file. although i am not aware of any existing applications that actually do this, the correct interpretation of the resolution information would get very confusing in that case.   way forward for icc profile and resolution issues although the issues i reported here are relatively minor, they can have major consequences within a preservation context. however, both the icc and the resolution issues could be largely fixed by making some small changes to the jp2 file specification. regarding the icc issue, the jpeg committee is already working on a proposal for extending the support of icc profiles in jp2, and bringing it in line with the latest icc specification. this would involve removing the "input" restriction in the "restricted icc" method, which would allow the use of "display device" profiles (robert buckley, personal communication). (the "output device" class would still be prohibited in that case, since it always uses n-component lut-based profiles.) as for the resolution issue, the solution may be as simple as slightly expanding the definition of "capture resolution". as explained before, the current definition only covers analog-to-digital capture processes. however, both the rasterisation of a vector drawing (born-digital material) and the resampling of an existing image can be seen as digital-to-digital capture processes. hence, a possible solution would be to include such cases in the definition of "capture resolution", which could then be generalised as "the grid resolution at which the source was captured to create the image samples specified by the codestream". this updated definition should then be illustrated using examples of both analog-to-digital and digital-to-digital capture processes. this would make these fields consistent with their de facto use by most existing encoders (as shown by table 2). it would also ensure backward compatibility for existing files as they are produced by most encoders (except kakadu, and some products that are based on the kakadu libraries). the definition of "default display resolution" could either be made more specific, or, alternatively, these fields could be deprecated altogether. in addition to these changes in the file specification, software vendors should be encouraged to produce encoders that are compliant with the (corrected) standard. the cultural heritage community could play an important role here by insisting on using software that is standards-compliant.   interim recommendations for existing collections in the previous section i suggested a way forward, which requires actions from the standards body and the software industry. in the meantime, institutions that are currently using jp2 as a preservation format may take a number of steps to mitigate any future risks. for existing collections, it is essential that any features that may imply a risk are both known and documented. this documentation should at least answer the following questions: what is the file format (jp2, jpx)? do the images contain icc profiles? what are the underlying characteristics of the icc profiles (profile description, matrix coefficients, and so on)? are icc profiles embedded using the "restricted" or "any icc" method? do the images contain multiple versions of the icc profile? which fields (if any) are used for storing the image's grid resolution? which software was used to create the images? apart from the last one, all the above questions can be answered using freely available software tools. particularly useful in this respect are exiftool (harvey) and jhove (jhove). both tools are capable of giving the required information on file format and resolution fields. jhove does not give any direct information about embedded icc profiles; however, it will tell whether an image makes use of the "restricted" or "any icc" methods. on the other hand, exiftool provides detailed information about embedded icc profiles, but it doesn't tell what method was used. so both tools complement each other here. the resulting documentation will be helpful for making a realistic assessment of long-term risks. it may also be a starting point for planning a medium-term preservation action, such as the normalisation to standards-compliant (i.e. compliant to an updated version of the standard) jp2 images. however, in the latter case one should be aware that such a normalisation procedure by itself introduces further risks of information loss. if done thoughtlessly, the long-term outcome may be worse than doing nothing at all. for new and ongoing digitisation projects, the most sensible interim recommendations would be to stick to the jp2 format whenever possible, avoid jpx, embed icc profiles using the "restricted" method, and avoid multiple icc profile versions. in addition, the aforementioned recommendations for existing collections all apply here as well.   conclusions in this paper i showed that the current jp2 format specification leaves room for multiple interpretations when it comes to the support of icc profiles, and the handling of grid resolution information. this has lead to a situation where different software vendors are implementing these features in different ways. in the case of icc profiles, a strict interpretation of the standard even completely prohibits the use of icc profiles for defining working colour spaces, which would make the format unsuitable for any applications that require colour support beyond the srgb colour space. for preservation, this results in a number of risks, because images may not be rendered properly by future viewers, and colour space and resolution information may be lost in future migrations. these issues could be remedied by some small adjustments of jp2's format specification, which would create minimal backward compatibility problems, if any at all. for the icc profile issue, a proposal for such an adjustment is already under way from the jpeg committee, and i have suggested a possible solution for the resolution issue here. in addition, it would be necessary that software vendors adhere to the modified standard. small as they may be, such changes could significantly improve the suitability and acceptance of jp2 as a preservation format.   acknowledgements i would like to thank hans van dormolen (kb) for sharing his observations on various problems related to the handling of icc profiles and grid resolution. this ultimately served as the impetus for much of the research presented here. thanks are also due to christy henshaw, laurie auchterlonie and ben gilbert (wellcome library) for providing the luratech 2.1.22 test images. wouter kool (kb) is thanked for providing the luratech 2.1.20 test images. jack holm (international color consortium) and axel rehse (luratech imaging gmbh) are thanked for their helpful comments and suggestions on the "input"-"display" profile issue. thomas richter (accusoft pegasus) and scott houchin (aerospace corporation) are thanked for sharing their thoughts on the capture resolution issue, which guided me to towards the current proposed solution. robert buckley (rob buckley consulting), richard clark (elysium ltd) and barbara sierman (kb) are all thanked for their feedback on an earlier draft of this paper.   notes [n1] the luratech software also does this for jpx files, which means it is standards-compliant for both formats. [n2] although the adobe plugin produces files that contain features which are only allowed in jpx, it assigns an erroneous value to the "brand" header field that uniquely identifies a jpeg 2000 file as either jp2 or jpx. as a result, these files are neither valid jp2 nor jpx. moreover, any file identification tools that are based on byte signatures ("magic numbers") will identify these files as jp2, even though the real format is jpx.   references [1] adobe. adobe rgb (1998) color image encoding version 2005-05. san jose: adobe systems inc., 2005. 29 dec 2010 http://www.adobe.com/digitalimag/pdfs/adobergb1998.pdf. [2] brown, a. digital preservation guidance note 1: selecting file formats for long-term preservation. london: the national archives, 2008. 5 jan 2011 http://www.nationalarchives.gov.uk/documents/selecting-file-formats.pdf. [3] buckley, r. & sam, r. jpeg 2000 profile for the national digital newspaper program. washington: library of congress office of strategic initiatives, 2006. 27 dec 2010 http://www.loc.gov/ndnp/guidelines/docs/ndnp_jp2histnewsprofile.pdf. [4] eci. ecirgb_v2 the update of ecirgb 1.0 background information. european color initiative, 2007. 29 dec 2010 http://www.eci.org/doku.php?id=en:colourstandards:workingcolorspaces. [5] gillesse, r., rog, j. & verheusen, a. alternative file formats for storing master images of digitisation projects. den haag: koninklijke bibliotheek, 2008. 27 dec 2010 http://www.kb.nl/hrd/dd/dd_links_en_publicaties/publicaties/alternative_file_formats_for_storing_masters_2_1.pdf. [6] harvey, p. exiftool. 30 dec 2010 http://www.sno.phy.queensu.ca/~phil/exiftool/. [7] henshaw, c. we need how much storage? london: wellcome library, 2010a. 27 dec 2010 http://jpeg2000wellcomelibrary.blogspot.com/2010/06/we-need-how-much-storage.html. [8] henshaw, c. finding a jpeg 2000 conversion tool. london: wellcome library, 2010b. 30 dec 2010 http://jpeg2000wellcomelibrary.blogspot.com/2010/07/finding-jpeg-2000-conversion-tool.html. [9] icc. specification icc.1:1998-09 — file format for color profiles. international color consortium, 1998. 29 dec 2010 http://www.color.org/icc-1_1998-09.pdf. [10] iso/iec. "information technology — jpeg 2000 image coding system: core coding system". iso/iec 15444-1, second edition. geneva: iso/iec, 2004a. 28 dec 2010 http://www.jpeg.org/public/15444-1annexi.pdf ("annex i: jp2 file format syntax" only). [11] iso/iec. "information technology — jpeg 2000 image coding system: extensions". iso/iec 15444-2, first edition. geneva: iso/iec, 2004b. 28 dec 2010 http://www.jpeg.org/public/15444-2annexm.pdf ("annex m: jpx extended file format syntax" only). [12] jhove jstor/harvard object validation environment. 30 dec 2010 http://hul.harvard.edu/jhove. [13] kulovits, h., rauber, a., kugler, a., brantl, m., beinert, t. & schoger, a. "from tiff to jpeg 2000? preservation planning at the bavarian state library using a collection of digitized 16th century printings". d-lib magazine 15.11/12 (2009). 27 dec 2010 doi:10.1045/november2009-kulovits. [14] loc. "sustainability factors". sustainability of digital formats planning for library of congress collections. washington: library of congress, 2007. 5 jan 2011 http://www.digitalpreservation.gov/formats/sustain/sustain.shtml. [15] mcleod, r. & wheatley, p. preservation plan for microsoft — update digital preservation team. london: british library, 2007. 27 dec 2010 http://www.bl.uk/aboutus/stratpolprog/ccare/introduction/digital/digpresmicro.pdf. [16] national library of norway. digitization of books in the national library — methodology and lessons learned. oslo: national library of norway, 2007. 27 dec 2010 http://www.nb.no/content/download/2326/18198/version/1/file/digitizing-books_sep07.pdf. [17] vychodil, b. "jpeg2000 specifications for the national library of the czech republic". seminar jpeg 2000 for the practitioner. london: wellcome trust, 16 nov 2010. 27 dec 2010 http://www.dpconline.org/component/docman/doc_download/520-jp2knov2010bedrich.   about the author johan van der knijff is a digital preservation researcher at the koninklijke bibliotheek, national library of the netherlands. his work focuses around preservation-related aspects of digital file formats. he holds an msc in physical geography from utrecht university (nl), where he specialised in hydrology, geographical information systems and remote sensing. johan previously worked on the development of hydrological simulation models at the european commission's joint research center.   copyright © 2011 johan van der knijff towards continuous web archiving: first results and an agenda for the future search   |   back issues   |   author index   |   title index   |   contents d-lib magazine december 2002 volume 8 number 12 issn 1082-9873 towards continuous web archiving first results and an agenda for the future   julien masanès bibliothèque nationale de france 1. introduction the web has become one of the major channels for dissemination of culture and knowledge. consequently, there is a growing awareness of the need to track and archive web content. pioneering work in this regard began in 1996 by the australian, canadian and swedish national libraries and also by the internet archive, a us-based non-profit foundation. these institutions have been using two different approaches for archiving the web. the australian [1] and the canadian [2] national libraries have been exploring a selective approach based on pre-selection, capture and cataloguing of individual web sites. this approach uses a selection policy adopted to restrict the scope of candidate web sites for archiving. the swedish [3] national library and the internet archive [4] have been archiving the web based on automatic harvesting policies using tools called "crawlers". crawlers had already been developed for other types of information retrieval on the net. they allow a much broader scope for archiving web content, but they do not provide for the monitoring of individual sites or adapting the archiving frequency of those specific sites. these pioneering efforts have contributed to rising awareness about the possibility for, and necessity of, web archiving. an increasing number of national libraries are now engaged in building, or are planning to build, web archives, and many other institutions like universities are starting topic related projects [5]. the need to develop common tools and enable information exchange on the topic of web archiving has already been addressed by the nedlib european [6] project, which initiated the development of an open source crawler that meets specific functional requirements such as harvesting priorities (for example, trying to lower the total delay for archiving a single site to keep a consistent image of it). within the small but growing community of web archivists [7], possible collaborations and joint projects are openly discussed [8]. those involved in the discussions hope that effective collaboration will be possible in the future to help the community face difficult but fascinating challenges. in this article, i will outline the contribution of the national library of france (bnf) to this discussion. at bnf, we began a research project on web archiving in late 1999. our project experiments have been ongoing even as the legal deposit law has been in the process of being updated—a process that has not yet ended. our work on web archiving is divided into two parts. the first part is to improve crawlers for continuous and adapted archiving. this means being able to automatically focus the crawler for satisfactory archiving. apart from getting existing, hands-on tools, this part of our project, which is presented in this article, consists of defining and testing good parameters toward that aim. the second part of our work is testing every step of the process for depositing web content. in our view, deposit is a necessary part of archiving the web, because a large amount of very rich web content is out of the reach of crawlers. this hidden content is often called the "deep web", and it consists mainly of large document repositories accessible only through descriptive information stored in relational databases (see, for instance, the 30 million digitized pages from the gallica collection at ). during the workshop on web archiving [9] at the 2002 european conference on digital libraries (ecdl), i presented results from a pilot study on archiving deep web sites. before going into more detail about improving crawlers in this article, let me first explain how our project envisions articulation of online harvesting and deposit. we need to take the advantage afforded by the internet to collect online content quickly using crawlers. collection tools must be automatic, as far as possible, in order to deal with the huge amount of content accessible on the internet. accessing, as well as gathering, can only be achieved on such a large scale with automatic tools. some very high quality content on the web is unreachable by crawlers but is worthy of additional manual harvesting efforts because there is no automated way to get this deep web content at this time. even though crawlers can't access deep web content, they can at least provide technical features analysis to detect the deep web sites. for example, we have defined a notification functionality consisting mainly of relevant technical features-tracking (like form or password protection) combined with linguistic analysis to systematically detect sites that may contain "deep web" content. a list of these deep web sites can then be sent to experts who can evaluate their relevance for deposit. the combination of both automatic and manual collecting is the only way to ensure that, no matter what the technical wrapping, the most relevant web content is found and archived. figure 1 2. towards continuous surface web archiving the snapshot approach and its limitations up to now, on-going projects based on automatic tools have gathered what we call "snapshot" collections. ironically, one of these snapshot collections is anything but instantaneous. the single capture period for a snapshot extends for months because the amount of data to be collected is huge. what we finally get are temporally based slices of a collection. consequently, each site is only archived once every two to six months, no matter what the site is. though harvesting content from some sites (steady or cumulative ones) at such intervals may be adequate, for most sites it is not adequate (think for instance of newspapers sites). as trying to increase the frequency of global snapshots costs too much, the only solution is to refine the collection process by adapting the capture frequency to each site and/or change the perimeter of the crawl. valuation of page dynamic and incremental crawler such adapted crawls can take into account only the site change rate or, at least, an estimation of the change rate. this estimation has long been implemented in search engine robots. when a first snapshot is completed, the robot checks for updates without actually loading documents, which saves a great deal of time. the http protocol makes this possible, even though it is based on information provided by servers, which is not terribly reliable. the second crawling round being much more rapid, the robots can make another harvest, for example, the week after, and again and again, until the robots get an estimation of the change rates for every page in the database at a dayor even an hour-level. using this information, archivists can then set their robots to more frequently refresh particular pages. for archiving purposes, one might be reluctant to utilize information provided by http servers (which are often only roughly configured). in that case, it is possible to load the document and perform one's own change valuation. this can be useful when trying to make a better change appraisal (for instance, excluding time stamps or visiting counters on html pages, mostly dynamically generated ones). research to evaluate site-level, adapted diff functions may also be helpful in this regard [10]. change frequency estimation can then be made on a more reliable basis, and the rest of the process can be the same as described above. even with change rate estimation, there is still a trade-off between the updating and breadth of the collection, but change rate estimation dramatically improves achievements in both dimensions [11]. focusing the crawl another non-exclusive procedure is to adapt capture frequency, taking into account not only a particular site's intrinsic dynamic but also the relative importance of each site. evaluating site importance or relevance allows one to focus the crawl on specific parts of the web in order to ensure that these parts will be archived and monitored appropriately. however, this is not an all-or-nothing choice. one can assign various levels of priority to sites for crawling and also for archiving (which may be different). this makes it possible, for instance, to perform a yearly global snapshot and then to focus additional crawls on subsets of the web so that these subsets are archived as often as necessary. discovering or manually adding new sites and manually changing parameters in the database can be done "on the fly" in such a process. it may be important to have different priorities for crawling and archiving. for instance, it might be interesting for discovery purposes to frequently crawl certain urls but not archive them each time. here again, focusing the crawl and the archiving process would dramatically improve the web archive and help us face the continually growing amount of publicly accessible information on the internet. overall, focused crawls would help in building coherent collections for which the collection extent and freshness would be defined according to a documentation policy rather than determined by technical or financial limitations. (i don't feel comfortable with the perspective of having to justify myself in a few years saying "we don't have this site because we had to stop the robot after its two-month blind trip".) importance estimation the question remains: is it possible to define relevant and automatically computed parameters to focus a robot on only that part of the web we want to archive? of course, other national libraries may have different policies in this regard, just as they have different collection policies for traditional publications. it is not my intention to discuss here the variety of existing and possible choices in this domain. but we should try to set down a policy on parameters, and the key issue is: what parameter can we use? at the bnf we have tried to explore in two directions, but other directions could certainly appear in the future. the first direction we explored is framed by the traditional librarian approach based on published material (contrary to a traditional archivist approach). the problem is that the notion of what constitutes "publishing" (to make some content "publicly available") has been greatly broadened due to the low cost of making content available on the internet. as internet "publishing" is almost cost-free, the traditional editorial filter seems to be no longer needed, or at least can be by-passed easily. this means that material can be "published" on the internet without professional appraisal of the significance or relevance of that material. national libraries cannot replace the workload of all the traditional publishers and, on their own, filter this proliferating mass of content. this doesn't mean selection of sites can't be done on the internet. it just means that national libraries can't select all the content on the internet needed for the collections they are used to building and archiving through legal or copyright deposit. however, many persons actually do select content on a very small scale using links. extracting the link structure at the scale of the global web tells us much about what is "visible" in the web space. in a very broad sense, sites that have more in-going links can be considered as the more relevant or significant sites [12]. at least, it is a hypothesis that we have been testing at the bnf. before presenting our test results in more detail, i offer a few comments on our approach. using a global computation of the link structures is a way for us to stick to something similar to the previously mentioned notion of publishing. if making something "publicly available" on the net becomes insignificant in and of itself, let's try to refine and focus on some part of it—the most "public" one, which in a hypertext space is the most linked one. this type of computation has proven its efficiency for ranking query results on particular keywords (see google's success, based on utilization of the now famous algorithm, pagerank [13]). of course, the context in which we propose to use this information is quite different. the main difference is that google is able to efficiently rank pages, given a specific set of query terms. google computes an absolute ranking from the global matrix of all links, but this ranking is only used each time in a subspace defined by the query terms. using this type of ranking to define a global selection policy is different. two particular risks arise in the selection policy context: the first risk is to under rank valuable sites on a very specific domain. for instance, highly specialized research topics may be represented by only a few sites with few links, even if these sites are to become very important in the future. the second risk is—on the contrary—to over rank some sites, mainly very well known sites that may have little value for archiving, like document-like weather forecast sites or sites that may be of low priority for long term archiving (like e-commerce sites). analyzing the link structure is not a very efficient means to avoid these risks, as we will see in the results of our study. however, robots can obtain other types of information, like linguistic information, that help define selection policies. that is the second direction in which we are working at bnf. following are a few words on this work, which is just beginning. the presence of rare words is a particular feature of focused topic-related sites, for instance. these rare words can be good indicators of valuable, though specialized, sites that aren't discerned from the global linking matrix. rare word appraisal could therefore be used to balance the pure ranking-based appraisal. we are currently testing this at the bnf on a sample of ranked sites for which a "rare words" parameter has been computed. over valuation of certain types of sites could also be balanced if they were detected efficiently. these sites are characterized by linguistic patterns like clue phrases, for example. some applications already achieve fairly good identification of e-commerce sites based on clue phrases. identification of such sites could be used to lower crawling or archiving priority for certain types of sites, even though they are very well ranked. as can be seen, utilization of a site's "popularity" estimation based on in-going links has to be done in a certain way to prove efficiency. but it then appears to be very relevant, as our tests tend to show. testing site importance estimation the test described below was conducted at the bnf during the winter of 2002. eight skilled librarians [14] participated to evaluate of a sample of ranked sites. protocol the test sample was extracted from an 800-million-page crawl made by xyleme in fall 2001. the ranking computation is based on a xyleme algorithm, but it is quite close to google's ranking algorithm. (the main difference is that xyleme is able to make this computation without storing the links matrix, which saves both storage capacity and time [15]). (for more details on this algorithm, see abiteboul and cobena [10]). the sample was restricted to the .fr domain with the site rankings based on their index page rankings. site rankings were distributed on a 9-level scale, ranging from 10 to 100 with the higher levels usually considered as more significant. xyleme provided 100 unique site urls for each of the 9 levels. the tested levels were: table 1 10 30 50 70 85 90 95 99 100 manual removal of non-responding sites and pure e-commerce sites was conducted by one of the testers in order to save time for the other testers. 664 sites (mainly e-commerce) were removed. figure 2 shows a distribution graph of the 236 remaining sites (dark green). figure 2 the librarians' evaluations of sites were based on their professional experience of the utilization of the traditional legal deposit collection by the bnf's readers. no other specific rules were defined. site evaluation had to be made on a four-level scale by librarians. the scale used was: "should not be taken" "might be taken" "better be in the collection" "should definitely be in the collection" the test was a blind test, and each tester made his or her appraisal ignoring others' evaluations and site rankings, and the responses were fed through the intranet into a database. results a median human evaluation was calculated for comparison with selection based on the linking parameter [16]. this median result was considered as the "ideal" evaluation. note that ranking wasn't taken into account for evaluating this "ideal result". the following results matrix illustrated in figure 3 specifies for a given level of linking parameter and of human evaluation regarding how many sites are concerned. the two white parts indicate where the two appraisals (human and ranking) correlate, the two grey ones where they differ. figure 3 the tab shown in figure 4 summarizes these results: figure 4 we can see that ranking enables an effective choice (the no-choice table would have contained 50 in each cell). when human evaluation (median) determines that a site should be included in the collection, the linking parameter is high in 75% of cases. this means that a choice based on the linking parameter would have been equivalent to human appraisal in 75% of cases. for excluding sites from the collection, results are lower. human and linking parameter based choices correlate only in 60% of cases. these results show that ranking and human evaluation correlate quite well. this is obvious when we take a look at each human tester results compared to the median results. the same matrix has been calculated for each tester. to make a cross-comparison, an "accuracy factor" on a 10-level scale from 1 to 10 (the best correlation) has been calculated as follows: the sum of converging evaluations is divided by the total number of results. a= (yy+nn) / (yy+nn+yn+ny) figure 5 provides the results for each tester: figure 5   figure 6 the results in figure 6 above show that all valuations tend to converge (>1), and most importantly, the valuation based on linking parameter is not the one that most diverges from the median (though it has not been included for its computation). this means the ranking is a good candidate to help focus the crawl. to go into more detail, let's look at sites for which valuations diverge most between human tester and ranking. figure 7 see figure 7. first consider sites for which librarians' appraisals were highest (level 4). none has a linking parameter under 90. this is pretty comforting, for it means a selection based on this parameter would not miss very important sites. if we look at sites on level 3 of human valuation (see figure 8), 4 had a linking parameter under 70, all belonging to a specific type: they are all research lab sites (gazettelabo.fr is a research material vendor gazette). figure 8 librarians have selected these sites because they provide research papers or other documentation of interest. but they are much too specialized to get high ranking. if we take a look at sites "selected" according to the linking parameter but not to librarians, we also see types of sites emerging. see figure 9 figure 9 most of the sites (9 of 25) are services sites providing weather forecasts or other types of "everyday life" information. two are chat sites. one may be surprised to find 5 academic sites with good linking parameter that were nevertheless set out by librarians. for these sites, testers have estimated that no real material was provided except information on course schedules or a similar kind of general information. librarians have selected other academic sites when they provided real content, research publications, for instance. this kind of refinement seemingly surpasses linking parameter possibility. the same thing applies with town council sites about which librarians discriminated based on content evaluation. the fact that most of the misleading evaluations occurred with regard to specific site types (service, chat, research lab) leads us to think that we will have the ability to correct this situation with site structure analysis and linguistic content evaluation. chat sites and service sites, like e-commerce sites, have specific features that can be analyzed (clues phrases, for instance). more tests need to be done in the future with a crawler able to extract such information. for the rest, these results show that linking-based parameter is a good candidate to be used for focusing the crawl. 4. conclusion analyzing and evaluating web page or site relevance is a priority for building heritage web collections on a large scale. we can't manually select large collections of sites, but neither can we be satisfied with blind crawls for selecting content for heritage collections. commercial search engines have long been improving functional capacity of their crawlers for information retrieval purposes. it is now time for national libraries to adapt these techniques for heritage web collection building. the "popularity parameter" is a good candidate to help us in this regard. it can be automatically calculated, and our results show that it correlates with librarian evaluations of site relevance. this correlation can be even better if sites are categorized by "genre" or type. the techniques described in this article could effectively help national libraries build large-scale, focused web collections and, by allowing a better allocation of resources, achieve continuous archiving for sites that deserve it. we now need to develop tools using this technology and test them on a large corpus. 5. notes and references [1] cathro, w., webb, c. & whiting, j. (2001). "archiving the web: the pandora archive at the national library of australia." preserving the present for the future web archiving conference, copenhagen, 18-19 june 2001. available at: . [2] electronic collections coordinating group, national library of canada, (1998). "networked electronic publications policy and guidelines". available at: . [3] arvidson, a., persson, k. & mannerheim, j. (2000). "the kulturarw3 project—the royal swedish web archiw3e: an example of 'complete' collection of web pages." 66th ifla council and general conference, jerusalem, israel, 13-18 august 2000. available at: . [4] kahle, b. (2002). "editors' interview: the internet archive." rlg diginews, 6 (3), 15 june 2002. available at: . [5] see for example political sites archiving in the netherlands, available at: . or the digital archive for chinese studies (dachs), available at: . we are trying to make an inventory of on-going web archiving project, so you are welcome to send information about the ones of which you are aware. [6] hakala, j. (2001). "collecting and preserving the web: developing and testing the nedlib harvester." rlg diginews, 5 (2), 15 april 2001. available at: . [7] the main discussion list on this topic is web-archive@cru.fr. information available at: . [8] cf. 2nd ecdl workshop on web archiving, rome, italy, 19 september 2002. available at: . [9] masanès, j. (2002). "archiving the deep web" 2nd ecdl workshop on web archiving, rome, italy, 19 september 2002. available at: . [10] abiteboul, s., cobéna, g., masanès, j. & sedrati, g. (2002). "a first experience in archiving the french web." in: research and advanced technology for digital libraries: 6th european conference, ecdl 2002, agosti, m. & thanos, c., eds., rome, italy, september 16-18, 2002. lecture notes in computer science, 2458. berlin: springer, 1-15. also available at: . [11] to give an illustration of possible gains, here are figures extracted from our 'elections 2002' collection. this collection encompasses 2.200 sites or part of sites related to the presidential and parliamentary elections held in france in 2002. on a sample on these sites, 43 of the most captured ones, we have for april 2.103.360 files for 6 captures which represent 108 gb of data. among these files only 45,7% are unique files, which represent 56.3% of the total amount of data. this means that more than a half of the crawling capacity and 43.7 of the storage capacity is 'wasted' in this case. it is really beneficial to have crawler able to manage sites changes in this kind of 'continuous' crawl. the small crawler we have used, httrack (see ), is able to do incremental crawl and with a few scripts and a database, it can be used to handle automatic crawl of hundreds of sites. [12] masanès, j. (2001). "the bnf's project for web archiving." what's next for digital deposit libraries? ecdl workshop, darmstadt, germany, 8 september 2001. available at: . [13] brin, s. & page, l. (1998). "the anatomy of a large-scale hypertextual web search engine." computer networks and isdn systems, 30 (1-7), 107-117. full version published in the proceedings of the 7th international world wide web conference, brisbane, australia, 14-18 april 1998. available at: . [14] véronique berton, virginie breton, dominique chrishmann, christine genin, loïc le bail, soraya salah, jean-yves sarazin and julian masanès. [15] abiteboul, s., preda, m. & cobéna, g. (2002). "computing web page importance without storing the graph of the web (extended abstract)." bulletin of the ieee computer society technical committee on data engineering, 25 (1), 27-33. available at: . [16] thanks to gregory cobéna from inria for his help on this part. copyright © julien masanes top | contents search | author index | title index | back issues previous article | next article home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/december2002-masanes   ndsa storage report: reflections on national digital stewardship alliance member approaches to preservation storage technologies search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine may/june 2013 volume 19, number 5/6 table of contents   ndsa storage report: reflections on national digital stewardship alliance member approaches to preservation storage technologies micah altman mit libraries micah_altman@alumni.brown.edu jefferson bailey metropolitan new york library council jbailey@metro.org karen cariani wgbh media library and archives karen_cariani@wgbh.org michelle gallinger, jane mandelbaum, trevor owens library of congress {mgal, jman, trow}@loc.gov doi:10.1045/may2013-altman   printer-friendly version   abstract the structure and design of digital storage systems is a cornerstone of digital preservation. to better understand ongoing storage practices of organizations committed to digital preservation, the national digital stewardship alliance conducted a survey of member organizations. this article reports on the findings of the survey. the results of the survey provide a frame of reference for organizations to compare their storage system approaches with ndsa member organizations.   introduction the national digital stewardship alliance (ndsa) is a network of partners dedicated to ensuring enduring access to digital information. the alliance's mission is to establish, maintain, and advance the capacity to preserve our nation's digital resources for the benefit of present and future generations. ndsa membership consists of universities, consortia, professional societies, commercial businesses, professional associations, and government agencies at the federal, state and local level. the alliance sponsors working groups that enrich digital preservation practice for all. the ndsa infrastructure working group identifies and shares emerging practices around the development and maintenance of tools and systems for the curation, preservation, storage, hosting, migration, and similar activities supporting the long term preservation of digital content. in one effort to achieve that, the ndsa infrastructure working group conducted a member survey, examining trends in preservation storage. the goal of this survey was to develop a snapshot of storage practices within the organizations of the ndsa. this survey is part of the group's larger effort to explore ndsa members' current approaches to large-scale storage systems for digital stewardship as well as the potential for cloud computing and storage in digital preservation. the ndsa storage survey was conducted between august 2011 and november 2011. responses were received from 58 of the 74 members, who stated that they were actively involved in preserving digital content at the time. this represents a 78% response rate. ndsa had a total of 98 members during that period. the respondents represent a diverse cross section of organizations working with preservation storage systems. a copy of the survey can be found in appendix 1, and a glossary of terms used in this report in appendix 2.   diverse partnership, common goals the partners who responded to the questionnaire illustrate the diversity of the national digital stewardship alliance. they include federal and state agencies, public and commercial media organizations, research libraries, and non-profit organizations tasked with the stewardship of digital information. each partner has different specific stewardship goals (for example, re-use, public access, internal access, legal mandate, etc.). with that noted, all have a goal to not only preserve but make accessible their digital content in order to help their organization fulfill its mission. partners are storing a wide range of digital materials. nearly all reported a significant amount of text and still images. many are also storing or beginning to store moving images and audio files. for the most part, members described their 2011 collections as between 50-400 tb of digital materials, although one respondent is storing 5 pb. partners preserving text documents reported much larger object counts than those with other media. nearly all respondents are using some sort of commercial spinning disk/server storage for online storage in combination with a digital data tape storage system for offline/nearline storage. within the ndsa, many members also participate in a range of distributed replication/infrastructure networks and cooperatives (i.e. lockss, metaarchive, data-pass, etc.).1   key findings the key findings from the survey were: 90% of respondents are distributing copies of at least part of their content geographically. 88% of respondents are responsible for their content for an indefinite period of time. 80% of respondents use some form of fixity checking for their content. 75% of respondents report a strong preference to host and control their own technical infrastructure for preservation storage. 69% of respondents are considering, or currently participating in, a distributed storage cooperative or system (ex. lockss alliance, metaarchive, data-pass). 64% of respondents are planning to make significant technological changes in their preservation storage architecture in the next three years. 51% of respondents are considering or already using a cloud storage provider to keep one or more copies of their content. 48% of respondents are considering, or currently contracting out, storage services to be managed by another organization or company. note that these percentages varied by organizational role, although in general the subpopulations are too small to support reliable inferences about the differences. in appendix 3, we provide details on the distribution of key findings by organizational role, and of selected other responses. note that not all respondents responded to all questions, although question-level non-response was generally quite low. throughout this article, proportions reported are calculated as a percentage of those responding to the specified question. and to support further analysis of response rates, replication and reanalysis, we have deposited a de-identified open access version of the response data in a public archive.2 trends in the survey responses are grouped in four areas: diversity of access requirements distributed and remote preservation storage file fixity and digital preservation storage infrastructure plans   diversity of access requirements diversity is the primary feature of nsda members' approaches to and requirements for access. member organizations are providing very different degrees of access to their holdings, managing everything from currently inaccessible dark archives to various modes of offline and online access, as well as support for high performance computing usage. access ranged from very low to very high availability, and can be described using five categories: dark archives, offline availability, nearline availability, online availability, and high-performance availability. the responses indicated that: 59% of the responding members have collections with requirements for instant access to a moderate number of simultaneous users necessitating online availability. 40% of the responding members have collections that are kept for eventual availability only. these collections are dark archives or are being kept strictly for disaster recovery. 28% of the responding members have collections needing nearline availability, meaning the ability to retrieve content within three hours of a request. 24% of the responding members have collections requiring retrieval within two business days of a request; this allows for offline availability. 21% of the responding members have collections that require high-performance availability, which includes access to large numbers of simultaneous users or for high-performance computing. there is also substantial diversity in the access requirements within each organization. member organizations frequently provide different levels of access for different collections they hold. for example, an organization may need to provide high availability for some collections and low availability for other collections. among the five categories of access requirements bulleted in the above list: 53% reported having a single access level (e.g. online availability) for all the collections they are preserving. that is, just over half of the organizations are providing a single degree of access to all of their materials. 31% reported supporting two degrees of access among their collections (e.g. dark archives and online availability depending on the collection). 16% reported supporting three or more degrees of access among their collections. additionally, many members have different storage systems for preservation and access. a majority of the organizations are providing separate systems for preservation and access. indeed, 65% of the organizations reported using separate systems, while only 35% reported using the same system for both preservation and access.   distributed and remote preservation storage general conversation about "the cloud" in information technology tends to focus on third-party cloud storage providers. adoption of these cloud storage services remains relatively small. however, when we consider cloud storage alongside several related ways of distributing and using storage as a service, some interesting trends emerge. the answers illuminate both the widespread acceptance of some digital preservation storage practices and the continuing uncertainty regarding others. for example, there is broad acceptance of the importance of geographic redundancy in maintaining preservation copies of content. a majority of members are currently keeping all or some of their preservation copies in multiple geographic locations. this geographic redundancy of digital content signals a success in establishing baseline best practices for preservation storage. similarly, participation in distributed or collective preservation systems is gaining in popularity, with half the respondents participating in or planning on joining such a system. lastly, usage of third-party and cloud-based storage systems is still a disquieting idea. many members are exploring this option, but functionality challenges, issues of trustworthiness, and uncertainty over sustainability are limiting widespread adoption. 76% report keeping data in more than one location for all their content. 14% reported keeping a complete copy in multiple geographic locations for some of their content. 10% reported that they do not keep their data in multiple geographic locations.   cooperatives, contracting out, and cloud storage members were asked if they were currently using, planning to use, exploring the possibility of using, or not considering using a distributed storage cooperative, a contracted provider of storage, or other third party cloud storage providers. the chart below reports the members' responses. figure 1: members' dispositions to cooperatives, contracting, and cloud storage among the membership there is an implied trend toward participating in distributed storage cooperatives. there is also a substantive interest in cloud storage illustrated by the 20 members currently exploring or planning on incorporating cloud storage systems. distributed storage cooperatives or systems 43% are using distributed and/or cooperative systems. 26% are planning on or actively exploring using these systems. 31% of members are not currently considering this storage option. of those using or exploring distributed and cooperative systems, 67% are using or exploring some type of lockss system (lockss was used by 81% of respondents reporting they used a distributed/cooperative system). the trustworthiness of distributed digital preservation cooperatives appears to be gaining acceptance. contracting out storage services 27% are currently contracting out some of their preservation storage to third parties. 4% are planning to contract out some of their preservation storage. 18% are currently exploring this option. 51% of members are not considering contracting out storage services to be managed by a third party. third-party cloud storage service providers 16% of members are using third-party cloud storage service providers for keeping at least one copy of their content. 7% are planning on using third-party cloud storage service providers. 28% are currently exploring this option. 49% are not considering using cloud services for keeping any copies of their content.   control and the cloud the survey revealed the tension between using third-party systems and a preference to host, maintain, and control preservation storage by the organizations themselves. nearly 50% of respondents are using, planning on using, or considering contractor services or third-party cloud storage. at the same time, 74% of the members agreed or strongly agreed that they had a strong preference for maintaining and controlling their preservation storage systems. the most-cited reasons for this preference were costs, trustworthiness, legal mandate, and security and risk management. one survey question offers insight into this seeming contradiction. the question asked members to rank the significance of specific preservation storage system features (with 1 being least significant and 7 being most significant). the chart below shows the results. the highlighted cells indicate 10 or more responses. the sum for each function was calculated by multiplying the number of responses by the priority and then adding the totals. table 1: priorities for functionality in new storage systems sorted by sum of priority scores functionality priority scores (1 low, 7 high) sum of priority scores 1 2 3 4 5 6 7 more built-in functions (like fixity checking) 0 3 0 9 14 16 13 299 more automated inventory, retrieval and management services) 0 1 5 7 17 10 15 295 more storage 0 4 4 5 12 12 17 291 higher performance processing capacity (ex. indexing on content) 0 3 5 6 21 13 6 270 file format migration 2 3 4 10 17 9 9 261 more security for the content 1 5 6 15 13 8 8 258 block level access to storage 11 16 5 11 7 4 0 161   there is strong demand for features that contractor and third-party cloud services are not yet widely satisfying: primarily built-in fixity checking, automated tasks, and migration services; and secondarily, block-level access. there is not as strong an interest by the preservation community in the block-level access feature as in the larger cloud community. instead, the ndsa member organizations represent a group looking for a degree of granularity of control over their data that is not widely shared by organizations that do not have a preservation focus. this is also reflected by the higher participation in distributed cooperative systems. vendor and cloud-based systems are playing a significant role in preservation, but a dearth of functionality and the uncertainties inherent in relinquishing control are likely limiting their widespread use. taking this information about desired functionality into account, these results suggest that views of control are being expressed in different ways. while to some, control may mean block-level access to content, this level of control was far and away the least requested feature. in contrast, to the digital preservation community of the ndsa, built-in functions like fixity checking and automated inventory, retrieval and management services express a different sense of control. built-in functionality provides an organization with preservation information that gives assurance over the integrity of their content. however, it does so by actually reducing the direct control individuals in the organization can exert on digital objects. in this sense, control may be conditionally defined according to specific preservation activity or whether that activity is occurring locally or in "the cloud." here the survey results open up more questions than they answer about exactly what kinds of control member organizations want to be able to exercise. the combined desire for control over storage coupled with a desire for additional automated functionality suggests that desires for control are not manifesting themselves in strong desires for block-level control.   file fixity and digital preservation storage digital objects pose difficulties to ensuring their ongoing authenticity and stability. files can become corrupted by use, bits can rot even when unused, and during transfer the parts essential to an object's operability can be lost. at the most basic level, digital preservation requires us to be confident that the objects we are working with are the same as they were prior to our interaction with them. to deal with this problem, those in the digital preservation field often talk about the fixity of digital objects. fixity, in this sense, is the property of being constant, steady, and stable. content stewards can check their digital objects to make sure that they maintain these qualities. fixity checking is the process of verifying that a digital object has not been altered or corrupted. in practice, this is most often accomplished by computing and comparing cryptographic hashes (these are sometimes loosely referred to as "checksums").   ndsa members' approaches to fixity checking one key theme that emerged from the survey was the prevalence of fixity checking as a performance requirement and the challenges imposed on storage systems by this activity. eighty-eight percent of the responding members are doing some form of fixity checking on content they are preserving. this widespread use of fixity checking illustrates the recognition that validation of the integrity and consistency of the objects we are preserving is a critical component in digital preservation workflows. with that said, ndsa members are taking distinctly different approaches to checking the fixity of their content. the differences are most likely due to a variety of complicated issues including the scalability of fixity-checking software, network limitations and data transfer costs, transaction volume and access requirements, and other contextual factors around the availability and management of specific sets of content. amongst survey respondents, fixity checking occurs as follows, with some members maintaining multiple practices: 82% of the organizations report that they are doing some form of fixity checking on content they are preserving. 57% of the organizations are doing checks before and after transactions such as ingest. 34% of the organizations are doing checks on some reoccurring fixed schedule. 32% of the organizations are randomly sampling their content to check fixity. 18% of the organizations use tamper-resistant fixity check mechanisms.3 17% of the organizations store fixity information in an independent system. most respondents reported using multiple practices. while fixity checking itself is widespread, ndsa members also take various approaches to scheduling these checks. some are randomly sampled and others use a fixed schedule for checking. twenty-four of the responding organizations use a fixed schedule for at least part of their content. 46% check fixity of content on at least a monthly basis. 21% check fixity of content on at least a quarterly basis. 29% check fixity of content on an annual basis. 4% check fixity of content on a tri-annual basis.   the future of fixity ndsa infrastructure working group members have frequently noted that the state of the art in fixity checking involves distributed fixity checking and frequent, robust repair to intentional or unintentional corruption. this is done by replacing corrupted data with the distributed, replicated, and verified data held at "mirroring" partner repositories in multi-institutional, collaborative distributed networks. the consortia groups metaarchive and data-pass use lockss for this kind of distributed fixity checking and repair. these consortia and a number of others are also using or testing the safearchive tool, developed by ndsa members, which provides an automated collection-fixity and replication-policy auditing on top of distributed storage networks such as lockss.4 as well, some individual institutions use a self-maintained distributed repository system that allows them to replace damaged content with a verified, uncorrupted copy; or are investigating services such as duracloud5 that provide at least some fixity checking services. as previously mentioned, one of the key interests of this ndsa working group was the potential role for cloud storage systems in digital preservation storage architectures. for those using cloud storage systems, complying with fixity requirements can prove problematic. as david rosenthal has suggested in 20116, cloud services at the time were not able to prove that they are not simply replaying fixity information created and stored at the time of deposit. rosenthal highlighted the need for cloud services to provide a tool or service to verify that the systems hold the content rather than simply caching the fixity metadata. without that kind of assurance, it can be prohibitively expensive to run any kind of frequent fixity checks on content in various cloud storage platforms. built-in functionality like automated fixity checking and repair was highlighted as the most desired feature in future preservation storage systems. this desire, along with the challenges of system-type dependencies and diversity of uniform current practices in fixity checking procedures, show the complex interplay between access, performance, preservation requirements, storage infrastructure, and institutional resources. as practices such as fixity checking become ubiquitous and new options like distributed storage gain further acceptance, the hardware underpinning these requirements will be called upon to meet new demands. our hope is that preservation stewards navigating these decisions will benefit from the knowledge and experience of other ndsa members as they encounter similar complexities and devise new solutions.   infrastructure plans there are a number of survey questions which did not fit thematically in early sections but will be of interest to both content users and service providers — specifically storage media currently being used by survey respondents, how many preservation copies of digital assets institutions are keeping, and the number of members that have documented requirements for storage systems.   number of copies, storage media, and documented requirements the chart below shows the number or preservation copies institutions are keeping, with 45% keeping three or more copies of their digital assets. (see appendix 2 for a distribution by organization type.) figure 2: number of preservation copies of digital assets survey participants are keeping figure 3 shows the media being used by members for preservation storage. some members use multiple kinds of media. figure 3: percentages of types of storage media ndsa members use for preservation storage the question "does your organization have specific documented requirements for your storage systems" elicited a wide range of responses to the different types of requirements. forty-nine of the 58 organizations that responded to the survey reported currently having some form of requirements, or planning to develop requirements in the next year. within this subset of respondents, the specific requirements varied: 43% have documented functional requirements. 37% have documented security requirements. 35% have documented general performance requirements. 29% plan to develop requirements within one year. 18% have other documented requirements. 16% have documented performance requirements for ingest. 12% have documented performance requirements for migration to new technology or other one-time intensive operations. for the 18% claiming "other documented requirements," the additional document requirements were most often client-specific or content-specific.   storage usage and expectations one fundamental consideration when planning digital preservation infrastructure needs is the amount of storage space required. the survey queried participants both on the amount of storage space they were currently using for all copies of their digital content and the amount they expect to need three years from now. table 2: storage use and expectations storage space amount current storage for all copies requirement anticipated in 3 years for all copies under 10 tb 18 13 10-99 tb 19 13 100 to 999 tb 14 16 1000+ tb (1+ pb) 5 9   charting out these numbers shows the expected growth of storage needs in the next three years, especially in the upper ranges of storage amount. the chart shows many of the member organizations moving out of the less than ten terabytes category and moving into the bigger brackets. notably, the 1000+ tb (1+ pb) category is likely to see the largest increase, almost doubling from 5 members to 9. when averaged out between the two questions, the disparity in the amount of storage used in 2011 and expected to be needed in 2014 becomes even more apparent. the 2011 usage averaged out to 492 tb per institution whereas anticipated need in three years more than doubled, averaging out to 1107 tb per institution.   predicting future storage needs a number of the survey questions asked members to estimate other aspects of digital preservation storage needs three years in the future. while cost modeling for digital preservation has been getting increased research scrutiny lately,7 the storage survey polled members on issues of strategic planning and administration of infrastructure including expectations on technology changes, available resources, organizational plans, and audit and certification as a trustworthy repository. the speed of technological change and its impact on digital preservation is nowhere more evident than in the fact that 64% of respondents agree or strongly agree that their organization plans to make significant changes in technologies in their preservation storage architecture within the next three years. at the same time, survey participants remained confident of their ability to meet these challenges, with 83% agreeing or strongly agreeing their institution will have adequate resources to meet projected preservation storage requirements over the next three years. table 3: future storage needs in the next three years my organization plans to... agree neutral disagree make significant changes in preservation storage technologies 37 (64%) 10 (17%) 11 (19%) will have adequate resources to meet storage requirements 48 (83%) 7 (12%) 3 (5%) has a plan to meet our preservation storage requirements 45 (79%) 8 (14%) 4 (7%) plans to meet trustworthy digital repository requirements 32 (57%) 19 (34%) 5 (9%)   as is evident in the table, the statistics on adequate resources expectations and proper organizational planning are very similar. the positivity reflected in these numbers is a good sign for the future of digital preservation. another positive result revolved around expectations for meeting the requirements for the recently approved iso standard 16363, or related trustworthy repositories audit & certification.8 the fact that 60% of the survey respondents plan on complying with the rigorous trac standards within three years signals an increased acknowledgement of the importance of these requirements in certifying digital preservation repository standards.   conclusions survey respondents did not shrink from the challenges of meeting the requirements, current and future, of digital preservation storage. the survey revealed an inherent optimism in addressing future digital preservation storage infrastructure issues even as anticipated storage needs rise dramatically and technology changes often. the results also revealed the complexity of digital preservation storage planning, especially given the large number of preservation copies being maintained, diversity of media used, and access requirements documented. the survey results communicated that organizations committed to the long-term preservation of digital materials share concerns and needs across industries. these needs are similar but not identical to those of organizations whose mission does not include providing long-term access to data.   next steps the ndsa plans to reissue the survey on a periodic basis, to track the trends and requirements of the membership and provide useful information to others in the community as well as service providers. feedback from the members will be incorporated to enhance the survey questions. for example, one question not included in the survey, but that will be considered for future storage surveys, is inquiring if institutions are planning on maintaining the same number of file copies into the future or whether redundancy policies are flexible in response to infrastructure limitations or forecasting. as the size of each digital item increases, and as size-intensive formats like audio and video become a larger percentage of preserved collections, keeping multiple copies will have an increased impact on storage capacity needs. other potential areas of investigation for a future survey could revolve around the roles that formats, compression, and means of access play in determining storage infrastructure. as institutions plan for their future storage needs, the knowledge sharing and collaboration activities of the ndsa will offer guidance as they make digital preservation infrastructure decisions.   acknowledgement the authors would like to thank the members of the ndsa infrastructure working group.   references 1 for descriptions of these networks see: reich, vicky & rosenthal, david s.h., "lockss: a permanent web publishing and access system", d-lib magazine, vol. 7, no. 6, june 2001, http://dx.doi.org/10.1045/june2001-reich. t. walters & k. skinner (eds.), new roles for new times: digital curation for preservation (pp. 51—53), association of research libraries. altman, m. (2009), digital preservation through archival collaboration: the data preservation alliance for the social sciences, american archivist, 72(72, no. 1 (2009)), 170—184. 2 altman, micah; bailey, jefferson; cariani, karen; gallinger, michelle; owens, trevor (2012), "data for ndsa storage report 2011 ", http://hdl.handle.net/1902.1/19768 v1. 3 this refers to local direct use of tamper-resistant mechanisms. indirect use is higher — as reported above, 25 organizations (43%) participate in a distributed storage cooperative, and over 80% of these cooperatives (34% of the total number of respondents) use tamper resistant fixity-check mechanisms. 4 altman, m., & crabtree, j. (2011). using the safearchive system: trac-based auditing of lockss. archiving 2011 (pp. 165—170). society for imaging science and technology. 5 see duracloud health checkup. 6 rosenthal, david s. h., lockss in the cloud, presented at "make it work: improvisations on the stewardship of digital information", joint ndsa ndiipp partners meeting, july 19-21 2011. washington, dc. 7 three recent recourses for cost modeling digital preservation of interest are: financial models and calculation mechanisms, hans westerhof, jeff ubois, marius snyders, presto centre, 2011. cdl cost modeling, stephen abrams, 2012. talk at pda2012, david rosenthal, dshr's blog, 2012. 8 trustworthy repositories audit & certification (trac) criteria and checklist, center for research libraries & online computer library center, 2007.   appendix 1 the ndsa storage survey was conducted between august 2011 and november 2011. responses were received from 58 of the 74 members who stated that they were actively involved in preserving digital content at the time. there were a total of 98 members of ndsa during that period.   system survey the infrastructure working group of the national digital stewardship alliance is working to better understand how member organizations are approaching storage for their preservation systems. as part of this effort, we are asking each ndsa member institution to respond to this 22-question survey. for institutions where preservation and access are coupled at the storage level, the following questions should be answered for the entire system. for institutions that have separate archival storage, the questions should be answered for the archival storage only.   1. my organization's storage system uses the following storage media for preservation storage. (check all that apply.) ___ spinning disk — locally or network attached storage (nas) ___ spinning disk — storage area network (san) ___ magnetic tape ___ other (specify)   2. in general, how many preservation copies of the digital assets are you keeping? 1   2   3   4   5   < 6   3. approximately, how many terabytes of storage space do you require for all copies of your content that you manage?   4. approximately, how many terabytes of storage space do you anticipate needing for all copies of your content that you manage in three years?   5. is your organization keeping copies of digital assets in geographically distinct places to protect from regional geographic disasters? (check all that apply.) ___ yes, we manage our own copies in one or more geographically distinct offsite locations ___ yes, we keep additional copies of our materials in a distributed collaborative partnership ___ yes, we keep one or more additional copies of our materials managed by another institution or commercial provider ___ in some cases, decided on collection basis ___ no, we would like to but we do not have the resources ___ no, we do not and this is not something we are pursuing   6. when does your organization check the fixity of the content you are preserving? ___ we do fixity checks before and after transactions like ingest ___ we do fixity checks on all content we are preserving at fixed intervals ___ we randomly sample content and check for fixity ___ we store fixity information in an independent system ___ we use a tamper-resistant fixity check mechanism (e.g. lockss, ace) ___ we do not do fixity checks on our content if your organization performs fixity checks on content you are preserving at fixed intervals, how frequently (in months) do you perform those checks? (i.e., if you perform them monthly, enter 1; if every nine months, enter 9; if annually, enter 12). _____   7. does your organization have specific performance requirements for your storage system or systems? (check any and all that apply.) ___ we have documented general performance requirements ___ we have documented performance requirements for ingest ___ we have documented performance requirements for migration to new technology or other one-time intensive operations. ___ we have documented functional requirements ___ we have documented security requirements ___ we plan to develop requirements within one year ___ we have other documented requirements (specify)   8. what are your requirements for access to the content you store? (if you have different requirements for different collections please check each option that applies to one of your collections.) ___ eventual availability only (dark archive/disaster recovery) ___ off-line availability ( e.g. able to retrieve on request w/in 2 business days) ___ near-line availability ( e.g. able to retrieve on request w/in 3 hours) ___ on-line availability (e.g. instant online access for "moderate" number of simultaneous users) ___ high-performance availability ( access to large number of simultaneous users/or for hpc)   9. does your organization use separate storage systems for access-only and preservation-only services? ___ yes ___ no   10. which services does your organization currently provide for files in your preservation storage? (check all that apply.) ___ secure storage with backup and recovery procedures in place ___ periodic fixity checking ___ version control ___ format normalization, format migration, or platform emulation   11. do you provide different services for different "collections" under preservation storage? ___ yes ___ no   12. if you do provide different services for different collections please describe them below.   13. how significant are each of the following general features of preservation systems for meeting your organizations objectives? (1 being insignificant, 5 being most significant.) ___ more storage ___ block level access to storage (not just file level) ___ higher performance processing capacity (to do processing like indexing on content) ___ more built-in functions (like fixity checking) ___ more automated inventory, retrieval and management services ___ more security for the content ___ file format migration   14. my organization has a plan to meet our preservation storage requirements over the next three years. ___ strongly disagree ___ disagree ___ neutral ___ agree ___ strongly agree ___ not applicable   15. in general, how long (in years) is your organization responsible for preserving content? (enter 999 if your organization has explicit or implicit indefinite responsibility.)   16. i expect my organization will have adequate resources to meet projected preservation storage requirements over the next three years. ___ strongly disagree ___ disagree ___ neutral ___ agree ___ strongly agree ___ not applicable   17. my organization plans to make significant changes in technologies in its preservation storage architecture within the next three years. ___ strongly disagree ___ disagree ___ neutral ___ agree ___ strongly agree ___ not applicable   18. my organization intends to meet requirements for a trustworthy digital repository according to trac or the planned iso standard 16363 within the next three years. ___ strongly disagree ___ disagree ___ neutral ___ agree ___ strongly agree ___ not applicable   19. is your organization participating in a distributed storage cooperative or system (e.g. lockssalliance, metaarchive, data-pass)? ___ yes, my organization currently participates in distributed storage cooperative or system. ___ no, but my organization is planning to participate in a distributed storage cooperative or system. ___ no, but my organization is currently exploring participating in a distributed storage cooperative or system. ___ no, my organization is not considering participating in a distributed storage cooperative or system. ___ no, and my organization is uninterested in participating in a distributed storage cooperative or system. if you are using, considering or exploring participating in a distributed storage cooperative please list specific cooperative you are participating in, considering or exploring.   20. is your organization contracting out storage services to be managed by another organization or company? ___ yes, my organization currently contracts out storage services which are managed by another organization. ___ no, but my organization is planning to contract out storage services which are managed by another organization. ___ no, but my organization is currently exploring contracting out storage services which are managed by another organization. ___ no, my organization is not considering contracting out storage services which are managed by another organization. ___ no, and my organization is uninterested in considering contracting out storage services which are managed by another organization. if your organization is considering, exploring or currently contracting out storage services to be managed by another organization or company please list specific services you are using, considering or exploring.   21. is your organization using third-party cloud storage service providers (e.g. amazon, rackspace, azure, duracloud, etc.) for keeping one or more copies of its content? ___ yes, my organization currently using third-party cloud storage service providers for keeping one or more copies of its content. ___ no, but my organization is planning to use third-party cloud storage service providers for keeping one or more copies of its content. ___ no, but my organization is currently exploring using third-party cloud storage service providers for keeping one or more copies of its content. ___ no, my organization is not considering using third-party cloud storage service providers for keeping one or more copies of its content. ___ no, and my organization is uninterested in using third-party cloud storage service providers for keeping one or more copies of its content. if you are using, considering or exploring third-party cloud storage service providers (e.g. amazon, rackspace, azure, duracloud, etc) for keeping one or more copies of its content please list specific services you are using, considering or exploring.   22. my organization has a strong preference to host, maintain, and control its own technical infrastructure for preservation storage. ___ strongly disagree ___ neutral ___ agree ___ strongly agree ___ not applicable if your organization does have a strong preference to host and control its own technical infrastructure for preservation storage why does it have this preference?   appendix 2 glossary of terms this glossary lists how terms are used in this document. access storage: storage designed to contain and serve content to users through common protocols such as the web. often, this is assumed to be available on a public website (or one accessible to a large group of users such as all students and faculty of a university). block-level access: reading and writing to disks at the physical level. only system engineers use block-level access to specify or identify exactly where data are stored, generally for performance reasons. dark archive: storage designed to be inaccessible (except for authorized storage system managers). fixity: the property of being constant, steady, and stable. fixity checking: the process of verifying that a digital object has not been altered or corrupted high-performance availability: it includes access to large numbers of simultaneous users or for high performance computing. nearline storage: storage generally designed to provide retrieval performance between online and offline storage. typically, nearline storage is designed in a way that file retrieval is not instantaneous but is available to the user in the same session. offline storage: storage recorded on detachable media, not under the control of a processing unit (such as a computer). online storage: storage attached under the control of a processing unit (such as a computer) designed to make data accessible close to instantaneously. preservation storage: storage designed to contain and manage digital content for long-term use.   appendix 3 distribution of key responses by organizational role     archive library museum other service provider geographic replication             no % 21.43% 8.00% 25.00% 0.00% 0.00%   n 3 2 1 0 0 yes % 78.57% 92.00% 75.00% 100.00% 100.00%   n 11 23 3 6 8 keep indefinitely             no % 0.00% 8.70% 0.00% 33.33% 37.50%   n 0 2 0 2 3 yes % 100.00% 91.30% 100.00% 66.67% 62.50%   n 14 21 4 4 4 strong control             no % 14.29% 24.00% 50.00% 33.33% 28.57%   n 2 6 2 2 2 yes % 85.71% 76.00% 50.00% 66.67% 71.43%   n 12 19 2 4 5 change soon             no % 21.43% 28.00% 50.00% 66.67% 50.00%   n 3 7 2 4 4 yes % 78.57% 72.00% 50.00% 33.33% 50.50%   n 11 18 2 2 4 collaborative storage             no % 28.57% 12.00% 25.00% 66.67% 62.50%   n 4 3 1 4 5 yes % 28.57% 24.00% 50.00% 16.67% 25.00%   n 4 6 2 1 2 considering % 42.86% 64.00% 25.00% 16.67% 12.50%   n 6 16 1 1 1 cloud storage             no % 53.85% 52.00% 50.00% 50.00% 25.00%   n 7 13 2 3 2 yes % 30.77% 44.00% 25.00% 33.33% 25.00%   n 4 11 1 2 2 considering % 15.38% 4.00% 25.00% 16.67% 50.00%   n 2 1 1 1 4 third party storage             no % 50.00% 52.00% 50.00% 83.33% 37.50%   n 6 13 2 5 3 yes % 25.00% 24.00% 25.00% 16.67% 12.50%   n 3 6 1 1 1 considering % 25.00% 24.00% 25.00% 0.00% 50.00%   n 3 6 1 0 4       distribution of number of copies by organizational role the height of each colored bar represents the number of respondents from each organizational role that indicated they kept that specific number of copies. the width of the bars represents the proportion of each group of total respondents.       about the authors micah altman is director of research and head/scientist, program on information science for the mit libraries, at the massachusetts institute of technology. dr. altman is also a non-resident senior fellow at the brookings institution. prior to arriving at mit, he served at harvard university for fifteen years as the associate director of the harvard-mit data center, archival director of the henry a. murray archive, and senior research scientist in the institute for quantitative social sciences. dr. altman conducts research in social science, information science and research methods — focusing on the intersections of information, technology, privacy, and politics; and on the dissemination, preservation, reliability and governance of scientific knowledge.   jefferson bailey is strategic initiatives manager at metropolitan new york library council. he previously worked in the office of strategic initiatives at the library of congress in the national digital information infrastructure and preservation program (ndiipp) and digital preservation outreach and education (dpoe) program. he has managed digital projects at brooklyn public library and the frick art reference library and has done archival work at nara and nasa.   karen cariani is the director of the wgbh media library and archives. karen has worked at wgbh since 1984 in television production and archival-related roles. she has 20 plus years of production and project management experience, having worked on numerous award-winning historical documentaries including macarthur, rock and roll, the kennedys, nixon, and war and peace in the nuclear age. she also worked with the wnet, pbs, nyu and wgbh preserving public television partnership as part of the library of congress national digital information infrastructure preservation project. she served two terms (2001-2005) on the board of directors of the association of moving image archivists (amia). she was co-chair of the amia local television task force, and project director of the guidebook "local television: a guide to saving our heritage," funded by the national historical publications and records commission. she is currently co-chair of the loc national digital stewardship alliance infrastructure working group.   michelle gallinger is digital programs coordinator for the national digital information infrastructure and preservation program at the library of congress. gallinger works to develop the digital preservation community, including the planning and execution of various international aligning national approaches to digital preservation activities. gallinger develops policies and guidelines for digital preservation practices, life cycle management of digital materials, and stakeholder engagement at the library of congress. she also provides strategic planning for the national digital information infrastructure and preservation program, a collaborative project that supports a network of partners exploring the capture, preservation and provision of access to a rich variety of digital information. gallinger developed the initial strategy for, and supported the creation, definition, and launch, of the national digital stewardship alliance in 2010 and is currently the ndsa facilitator. before joining the library of congress, gallinger developed the colonial williamsburg rockefeller library digitization and digital stewardship practices and worked at the university of virginia e-text center.   jane mandelbaum is currently manager of special projects in the office of the director for information technology services at the library of congress (lc). she is currently leading and guiding enterprise-wide projects and architecture initiatives for large-scale high-performance digital storage and archiving. she previously served as it implementation and operations manager for a number of large it systems at lc, and led a team to establish and operate the library's end-user computing environment.   trevor owens is a digital archivist with the national digital information infrastructure and preservation program (ndiipp) in the office of strategic initiatives at the library of congress. at the library of congress, he works on the open source viewshare cultural heritage collection visualization tool, as a member of the communications team, and as the co-chair for the national digital stewardship alliance's infrastructure working group. before joining the library of congress he was the community lead for the zotero project at the center for history and new media and before that managed outreach for the games, learning, and society conference.   copyright © 2013 micah altman, jefferson bailey, karen cariani, michelle gallinger, jane mandelbaum, and trevor owens semantometrics in coauthorship networks: fulltext-based approach for analysing patterns of research collaboration search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine november/december 2015 volume 21, number 11/12 table of contents   semantometrics in coauthorship networks: fulltext-based approach for analysing patterns of research collaboration drahomira herrmannova kmi, the open university, uk drahomira.herrmannova@open.ac.uk petr knoth mendeley ltd. petr.knoth@mendeley.com doi: 10.1045/november2015-herrmannova   printer-friendly version   abstract to date, many studies of scientific citation, collaboration and coauthorship networks have focused on the concept of cross-community ties. in this article we explore how semantometrics can help to characterise the types of research collaboration in scholarly publication networks and the nature of the cross-community ties, and how this information can be utilised in aiding research evaluation. in contrast to the existing research evaluation metrics such as bibliometrics, altmetrics or webometrics, which are based on measuring the number of interactions in the scholarly network, semantometrics build on the premise that fulltext is needed to understand the value of publications. using the core dataset as a case study, this paper looks at the relation between the semantic distance of authors and their research endogamy value. we identify four potential types of collaboration in a coauthorship network. the results suggest similar measures can be used to provide meaningful information about the nature of collaboration in scholarly publication networks. keywords: research evaluation, semantic similarity, research endogamy, research publication datasets   1 introduction to date, many studies of scientific citation, collaboration and coauthorship networks have focused on the concept of cross-community ties [guimerà et al., 2005; lambiotte and panzarasa, 2009; shi et al., 2010; silva et al., 2014]. it has been observed that in citation networks bridging, or cross-community citation patterns, are characteristic of high impact papers [shi et al., 2010]. this is likely due to the fact that such patterns have the potential of linking knowledge and people from different disciplines. the same holds true for cross-community scientific collaboration [newman, 2004; lambiotte and panzarasa, 2009]. likewise, in collaboration and coauthorship networks, it has been shown that newcomers in a group of collaborators can increase the impact of the group [guimerà et al., 2005]. however, the studies to date have predominantly focused on analysing citation and collaboration networks without considering the content of the analysed publications. our work focuses on analysing scholarly networks using the semantic distance of the publications in order to gain insight into the characteristics of collaboration and communication within communities. our hypothesis states that the information about the semantic distance of the communities will allow us to better understand the importance and the types of the cross-community ties (bridges). more specifically, in order to gain insight into the types of collaboration between authors, we are currently investigating the possibility of utilising semantic distance in a coauthorship network, together with the concept of research endogamy [montolio et al., 2013]. in social sciences, endogamy is the practice or tendency of marrying within a social group. this concept can be transferred to research as collaboration with the same authors or collaboration among a group of authors. the concept of research endogamy has been previously used to evaluate conferences [montolio et al., 2013; silva et al., 2014] as well as journals and patents [silva et al., 2014]. we refer to the set of metrics utilising fulltext for research evaluation as semantometrics. in contrast to the traditional metrics based on measuring the number of interactions in the scholarly communication network, the premise behind semantometrics is that fulltext is needed to understand the value of publications. in our previous work ([knoth and herrmannova, 2014]) we utilised this idea to develop a metric for assessing publication's contributions. our results suggested that the semantic distance between fulltexts provides meaningful information about publication's contributions not captured by the traditional metrics. this approach is just one example of the use of semantometrics, while the semantometrics principles are more widely applicable.   2 research question we investigate the relation that exists between the tendency to collaborate within a group of authors and the semantic distance of their respective research fields. in particular, we are interested in the distribution of the semantic distance of authors collaborating on a publication, the relation between the author distance and their endogamy value and whether, based on these two measures, there exists a typology of scientific collaboration across, and inside of, knowledge domains. the rationale behind this approach is based on how research collaboration happens. in cases where the scientific collaboration spans fields or disciplines, that research is likely to link the two disciplines and thus provide opportunities for knowledge transfer and for novel visions and ideas [lambiotte and panzarasa, 2009; silva et al., 2014]. on the other hand, collaboration within one discipline can potentially increase the authors' performance [lambiotte and panzarasa, 2009]. we assume that based on the combination of semantic distance and research endogamy the types of research collaboration can be divided into four groups (table 1). we believe this classification is a useful tool in characterising the types of research collaboration that goes beyond the traditional understanding of the concept of bridges as used in scholarly communication networks. while semantic distance allows distinguishing between interand intra-disciplinary collaboration, research endogamy allows differentiating between emerging and established research collaborations. table 1: types of research collaboration based on semantic distance of authors and their research endogamy   high endogamy low endogamy high distance established interdisciplinary collaboration emerging interdisciplinary collaboration low distance expert group emerging expert collaboration the relation between author similarity and research endogamy is studied using the core dataset. core is an aggregator of both metadata and fulltexts of open access publications from repositories and journals worldwide and covers all scientific disciplines. the availability of publication fulltexts from across disciplines enables us to perform analysis of author distance. our results provide an overview of the publications available in the core dataset and an exploration of the relation between endogamy value of the publications and author distance.   2.1 basic concepts this section introduces basic concepts used in this article. in particular, it presents the definition of research endogamy and author distance as used in our experiment. 2.1.1 author distance we propose to measure the semantic distance of all coauthors of publication p as a mean value of the semantic distance of all pairs of the coauthors (equation 1). here a(p) is a set of authors of publication p. because semantic distance is a symmetric relation, this calculation can be optimised by disregarding repeating pairs in the calculation, that is by selecting the author pairs using combination rather than permutation. the number of pairs is then equal to . we calculate the distance of a pair of coauthors by considering their publication record as a single text. the distance of two authors is then calculated by calculating the distance of two texts. while this is a very simplistic approach, it is also beneficial in terms of complexity of the calculation. another approach would be to calculate the distance between every pair of publications of the two authors, perhaps omitting their coauthored publications. however, because the number of pair combinations of items of two sets has a polynomial growth rate, this number would significantly grow in cases of very productive authors. for this reason we chose to simplify the problem by considering publications of one author as a single text. 2.1.2 research endogamy in order to distinguish between emerging, short-term and established research collaboration, we propose to combine the semantic distance with the research endogamy value of the publication. the research endogamy of a publication is calculated based on research endogamy of a set of authors a, which is defined by [montolio et al., 2013] and [silva et al., 2014] using the jaccard similarity coefficient (equation 2). here d(a) represents papers coauthored by authors in a. a higher endogamy value then expresses more frequent collaboration. endogamy of a publication p is then defined by [montolio et al., 2013] and [silva et al., 2014] as the average of endogamy values of the power set of its authors (equation 3). here l(p) is the set of all subsets with at least two authors of p, l(p) = lk(p), where lk(p) = c(a(p), k) is the set of all subsets of a(p) of length k. endogamy has one significant limitation due to the way endogamy of a publication is currently defined. because the calculation of publication endogamy is based on finding the power set of the set of publication authors, the number of times that author endogamy has be to calculated grows exponentially (this number will be exactly 2|a(p)| — (|a(p)| + 1)). that means that for a publication with 20 authors, author endogamy will have to be calculated on more than 1 million sets. however, it is not uncommon to have publications with more than a thousand authors, especially in some scientific disciplines. a potential simplification could be achieved by splitting the set a(p) into groups of authors who have collaborated together on any other publication than the reference publication p, and using these subsets for the endo(p) calculation instead of using the whole set a(p). because the reference publication p would not be considered in the calculation, this would (potentially) slightly lessen the resulting endogamy values. as the aim of this article is not redefining research endogamy, we used the existing equation, however we limited our dataset to publications with 25 or fewer authors.   3 experiment this section presents a basic overview of the dataset used in our experiment and the method used to obtain results. furthermore, it provides a graphical representation of the distribution of research endogamy and author distance in the dataset and discusses the results.   3.1 dataset core (connecting repositories) is an aggregator of open access research papers. core currently aggregates both metadata and full-texts of openly accessible publications from more than 600 repositories and over 10,000 journals worldwide and from across all disciplines. for this study we have used a subset of core composed of: all full-text documents which core harvested from open research online (oro) repository (the open university's repository of research publications) all other full-text publications found in core, which were authored by any of the authors of the publications harvested from oro, added for calculating author distance and publication endogamy. table 2 presents overview statistics of the dataset. in the table the average number of collaborators is the mean number of different individuals an author collaborated with. the total number of publications is the number of publications in the dataset after adding all other publications found in core, which were authored by any of the authors from oro. more than 4,000 publications were analysed and the whole dataset included over 30,000 publications. table 2: dataset statistics. fulltext articles for oro 4,207 number of authors 8,473 average number of publications per author 7.61 max number of publications per author 310 average number of authors per publication 4.31 max number of authors per publication 25 average number of received citations 0.30 average number of collaborators 80.23 total number of publications 30,484 we selected the oro repository because we needed a dataset containing the majority of publications of (at least a subset of) the academics. for this reason an institutional repository was a good candidate. we would like to note that we have not used any methods for disambiguating author names, as this problem is outside of the scope of this experiment.   3.2 dataset processing the following information was obtained from the core dataset: a list of authors of each of the selected documents and the publication record for each of these authors. the number of times the publication was cited in core. fulltexts of the selected documents. we calculated the author distance using the cosine similarity measure on tf-idf term-document vectors [manning et al., 2009] created from the document full-texts. the full-texts were pre-processed by removing stop words, tokenising and stemming. the distance used in the author distance measure was then calculated as dist(d1,d2) = 1 — sim(d1,d2) , where sim(d1,d2) is the cosine similarity of texts d1 and d2 (the 1 — sim(d1; d2) value is often referred to as distance although it is not a proper distance metric as it does not satisfy the triangle inequality property). we have produced two numbers for each of the publications from the oro repository: distance of the authors of the publication calculated according to equation 1, and the endogamy value of the publication calculated according to equation 3. the python source codes used to produce this experiment are available via bitbucket.   3.3 results figure 1 presents the distribution of both calculated values: research endogamy and author distance. while the author distance is more similar to normal distribution, with mean 0.34 and standard deviation 0.19, the distribution of research endogamy is skewed with 50% of the publications having a value of less than 0.15. this is an interesting result, as it suggests it's not that common for authors to keep collaborating with the same academics. figure 1: distribution of author distance and endogamy value. figure 2 shows the comparison of author distance and endogamy value with the number of authors. the lines in the plot represent a linear fit of the data. there is no correlation between author distance and the number of authors (pearson r = —0.09). there is a very slight negative correlation between endogamy value and the number of authors (pearson r = —0.22). this is an expected behaviour because the likelihood that the endogamy value of a publication will be lower generally increases with the number of authors, however they are not directly proportional. figure 2: author distance and endogamy value compared to the number of authors. figure 3 shows the relation between author distance and endogamy value. the lines in the plot represent the mean values of both data series. there seems to be one visible pattern in the data, which is that very few publications fall in the category of high endogamy and high author distance when using mean values as the division lines. the proportion of publications which fall into this category is 0.07, while the proportion of publications in the other categories varies between 0.27 and 0.38. this would suggest that collaboration across disciplines happens more often on a short-term basis. on the other hand, it seems that intradisciplinary research does not tend to be done in one specific way, for example researchers do not tend to collaborate more often with the same colleagues. figure 3: author distance and endogamy value. we were interested in whether certain types of publications attract more citations in general. unfortunately the citation data was available only for a very small subset of publications. figure 4 show the documents for which we had citation data (490 publications). the plot shows the relation between author distance and endogamy value, while the colour of the points indicates the number of received citations. the groups of publications with similar citation counts were selected based on percentile, the least cited group representing 50% of the publications while the highest cited group representing the top 10%. however, the differences between these groups are not large enough to be statistically significant. in our future work we would like to examine the relation between author distance, research endogamy and citation counts on a larger dataset. figure 4: author distance, endogamy value and number of citations.   4 conclusions as described in this paper, we applied the semantometric idea of using full-texts to recognise types of scholarly collaboration in research coauthorship networks. we applied semantic distance combined with research endogamy to classify research collaboration into four broad classes, and tested this classification using the core dataset. this classification can be useful in research evaluation studies and analytics, e.g. to identify emerging research collaborations or established expert groups. while bridges have been the concern of many research studies, their identification has been limited to structure of the interaction networks. in contrast to those approaches, our approach takes into account both the interaction network (coauthorship, citations) as well as the semantic distance between research papers or communities. this provides additional qualitative information about the collaboration, which hasn't previously been considered.   references [1] roger guimerà, brian uzzi, jarrett spiro, and luís a. nunes amaral. 2005. team assembly mechanisms determine collaboration network structure and team performance. science, 308(april):697-702. http://doi.org/10.1126/science.1106340 [2] petr knoth and drahomira herrmannova. 2014. towards semantometrics: a new semantic similarity based measure for assessing a research publication's contribution. d-lib magazine, 20(11). http://doi.org/10.1045/november2014-knoth [3] r. lambiotte and p. panzarasa. 2009. communities, knowledge creation, and information diffusion. journal of informetrics, 3(3):180-190. http://doi.org/10.1016/j.joi.2009.03.007 [4] christopher d. manning, prabhakar raghavan, and hinrich schutze. 2009. an introduction to information retrieval. number c. cambridge university press, online edi edition. [5] sergio lopez montolio, david dominguez-sal, and josep lluis larriba-pey. 2013. research endogamy as an indicator of conference quality. sigmod record, 42(2):11-16. http://doi.org/10.1145/2503792.2503795 [6] m. e. j. newman. 2004. coauthorship networks and patterns of scientific collaboration. proceedings of the national academy of sciences of the united states of america, 101(1):5200-5205. http://doi.org/10.1073/pnas.0307545100 [7] xiaolin shi, jure leskovec, and daniel a mcfarland. 2010. citing for high impact. in proceedings of the 10th annual joint conference on digital libraries — jcdl '10, page 49, new york, new york, usa. acm press. http://doi.org/10.1145/1816123.1816131 [8] thiago h. p. silva, mirella m. moro, ana paula c. silva, wagner meira jr., and alberto h. f. laender. 2014. community-based endogamy as an influence indicator. in digital libraries 2014 proceedings, page 10, london, united kingdom. http://doi.org/10.1109/jcdl.2014.6970152   about the authors drahomira herrmannova is a research student at the knowledge media institute, open university, working under the supervision of professor zdenek zdrahal and dr petr knoth. her research interests include bibliometrics, citation analysis, research evaluation and natural language processing. she completed her bs and ms degrees in computer science at brno university of technology, czech republic. aside of her phd she participated in research projects at the knowledge media institute (core, ou analyse).   petr knoth is a senior data scientist at mendeley, where he develops text-mining tools to help researchers' workflows. dr knoth is also the founder of the core system, which aggregates millions of open access publications from repositories and journals and makes them freely available for text-mining. previously, as a researcher at the open university, he acted as the principle investigator on a number of national and international research projects in the areas of text mining and open science.   copyright © 2015 drahomira herrmannova and petr knoth criteria for the trustworthiness of data centres search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine january/february 2011 volume 17, number 1/2 table of contents   criteria for the trustworthiness of data centres jens klump helmholtz centre potsdam german research centre for geosciences jens.klump@gfz-potsdam.de doi:10.1045/january2011-klump   printer-friendly version   abstract the use of persistent identifiers to identify data sets as part of the record of science implies that the data objects are persistent themselves. scientific findings, historical documents and cultural achievements are to a rapidly increasing extent being presented in electronic form — in many cases exclusively so. however, besides the invaluable advantages offered by this form, it also carries serious disadvantages. the rapid obsolescence of the technology required to read the information combined with the frequently imperceptible physical decay of the media themselves represents a serious threat to preservation of the information content. since research projects only run for a relatively short period of time, it is advisable to shift the burden of responsibility for long-term data curation from the individual researcher to a trusted data repository or archive. but what makes a data repository trustworthy? the trustworthiness of a digital repository can be tested and assessed on the basis of a criteria catalogue. these catalogues can also be used as a basis to develop a procedure for auditing and certification of the trustworthiness of digital repository.   introduction the rapid decay of urls pointing to research resources was an important part of the motivation to use persistent identifiers instead of ephemeral urls (see e.g. wren, 2008; lawrence et al., 2001). surely, if we use persistent identifiers to identify digital objects as parts of the record of science these objects themselves need to be persistent and kept in long-term digital repositories and archives. how can the trustworthiness of a particular repository in a network of data repositories (e.g. datacite data publication agents, world data system, esa ground segment, and others) be assessed? in recent years, scientific findings, historical documents and cultural achievements are to a rapidly increasing extent being presented in electronic form — in many cases exclusively so. besides the invaluable advantages offered by this form, it also carries serious disadvantages. in paper documents content and representation come together as one unit, where as in digital formats the content is separate from its representation and requires additional information and technology for the user to access the information. however, the underlying technology is still undergoing further development at an exceptionally fast pace. the rapid obsolescence of the technology required to read the information combined with the frequently imperceptible physical decay of the media themselves represents a serious threat to preservation of the information content. this makes our digital assets particularly vulnerable. given the tasks outlined above, only data centres prepared for long-term preservation can be considered to be trustworthy custodians of our digital heritage. but what makes a data repository trustworthy? this paper will discuss the fundamentals of criteria catalogues for assessing the trustworthiness of an archive for digital research data and how these criteria can be transferred into audit and certification of research data repositories and archives.   reference model and criteria catalogues in the project "publication and citation of primary scientific data" (std-doi), which laid the conceptual and technical foundations for datacite, the question arose how to assess the trustworthiness of digital repositories. at the same time other groups started to investigate the issue of trustworthiness of digital archives. to help assess repositories, tools and metrics have been developed by various preservation organizations. to achieve a confluence of approaches in the definition of criteria for trustworthiness of digital archives, members of the digital archiving community developed "ten principles for minimum requirements for trustworthy digital preservation repositories" (center for research libraries (crl) et al., 2007). as early as 1994 it became apparent that criteria for the assessment of trustworthiness of digital archives were needed (dobratz et al., 2008; task force on archiving of digital information, 1996). in 1995 the international standards organisation (iso) approached the consultative committee for space data systems (ccsds) to develop a formal standard for the long-term preservation of data from space missions. in preparing a draft standard it became clear that a reference model was needed as a base for further standard building activities and that a reference model would solve cross-domain problems regarding the long-term preservation of digital materials (rank et al., 2010). the outcome of this process was the open archival information system reference model (oais-rm), or known to most as the "oais model". this document went through several consultation and review phases and was published as an international standard (iso 14721:2003). this standard is currently under review and a draft recommended practice was published in october 2009 (ccsds, 2009). although designed for the curation of space data, the oais model aims to be as context-neutral as possible and deliberately avoids jargon from both the it and archival professions. in this way, oais became a lingua franca for archival information systems that has since become widely adopted because it enables effective communication among projects on a national and international scale. with its general approach and universal applicability the oais model also served as a reference model for criteria catalogues for the assessment of the trustworthiness of digital archives. among these, the most widely known catalogues are: trustworthy repositories audit & certification: criteria and checklist (trac) (ambacher et al., 2007) catalogue of criteria for trusted digital repositories (nestor catalogue) (dobratz et al., 2006, 2009) dcc and dpe digital repository audit method based on risk assessment (drambora) (digital curation centre (dcc) and digital preservation europe (dpe), 2007) dini-certificate document and publication services (dini ag elektronisches publizieren, 2006) data seal of approval (sesink et al., 2008) the underlying principles of all of the above mentioned criteria catalogues are derived from the fundamental concepts of quality management, as formulated in the iso 9000 family of standards. these standards are designed to help organizations ensure they meet the needs of customers and other stakeholders (iso, 2000). key concepts in iso 9000, that also apply to assessing the trustworthiness of digital archives, are the documentation and transparency of activities surrounding the digital archive, the adequacy of the activities to the stated goals and the requirements of the designated user community, and the measurability of the degree of compliance of the archive activities with the criteria for trustworthiness (dobratz et al., 2008). the initiatives described above do not operate in isolation from each other. while the oais model has already been transferred into an iso standard, activities to derive an international standard criteria catalogue for trustworthy digital repositories are still under way. currently, trac is work in progress in the iso technical committee iso tc20/sc13 and ccsds (iso/dis 16363). the nestor criteria catalogue has been published as a draft standard by the german national bureau of standards (din 31644). this activity is not in competition with iso/dis 16363 but is intended to complement the work on the iso draft through the international standardisation structures of iso and its national members. in summer 2010 representatives of the respective working groups in ccsds, dans and din signed a "memorandum of understanding" to strengthen cooperation between these initiatives (giaretta et al., 2010).   translating criteria into practice — auditing and certification of digital archives each digital repository has its own targets and specifications. on the other hand, the criteria catalogues for trusted digital repositories have to take a general approach and thus remain at a high level of abstraction. for application to a specific domain and archive instance, the evaluation criteria have to be translated into the specified context and aligned to the needs of the designated user community. at this point, where abstract criteria are translated into specific use cases, the principle of applicability becomes important. an example for the translation of abstract criteria for the trustworthiness of digital repositories into a specific application is the set of "european ltdp common guidelines" of the european space agency ground segment coordination body (esa gscb) (albani et al., 2010) for their ground segment data centres. in a network of data repositories it is quite likely that not all repositories operate on the same technical level. yet it may be important to define criteria for auditing the performance of the networked repositories. as the example of the ccsds has already shown, the need to preserve data from space missions is particularly pressing, at the same time space science has a long record of curating data. data from space missions are not held in a central archive but are, at least initially, distributed among mission specific data systems. in this setting the need arose to find common guidelines for the long-term preservation of these valuable scientific assets. at the european space agency esa, european space agency centre for earth observation (eo) is the largest european eo data provider. it also operates as the reference european centre for eo payload data exploitation. long-term preservation of these data and of the ability to discover, access and process them is a fundamental issue and a major challenge at programmatic, technological and operational levels. to harmonise its approach to long-term data preservation among participating data centres the esa ground segment coordination body (esa gscb), in cooperation with nestor, formulated a set of "european ltdp common guidelines". the esa "common guidelines" document directly addresses esa ground segment data centres. its criteria are referenced against the nestor criteria catalogue and other relevant standards (e.g. metadata encoding, security). its structure follows the data life cycle. early in the design process for the common guidelines esa gscb recognised that not all data centres operate on the same technical level. at the same time, the requirements towards long-term preservation may differ from case to case. to accommodate these differences among data centres the esa common guidelines introduce three different levels of compliance. each criterion is graded as essential, important, or optional. the criteria are then combined into profiles, or levels of compliance, with an entry level followed by two more advanced levels. to allow for future developments in long-term digital preservation the grading scheme and levels of compliance can be extended to allow for even more advanced levels. an approach similar to the european ltdp common guidelines is proposed in the european framework for audit and certification of digital repositories, which was outlined in a memorandum of understanding between ccsds, dans and din (giaretta et al., 2010). this framework defines three levels of trustworthiness: basic certification through the data seal of approval (dsa). extended certification through dsa plus additional publicly available self-audit with an external review based on iso 16363 (trac) or din 31644 (nestor). formal certification after full external audit and certification based on iso 16363 (trac) or din 31644 (nestor). with a wider adoption of standard-based long-term data curation we will see more applications of criteria catalogues to specific data repositories.   conclusion the need for criteria to assess the trustworthiness of digital repositories was recognised by memory institutions and by data centres many years ago. this resulted in a number of initiatives aimed at developing criteria catalogues for trusted digital archives. data centres, in particular those organised in networks of several data repositories and archives, have shown interest also in auditing and certification of their trustworthiness as long-term digital repositories. the need for certification has led to the initiation of standardisation processes through iso and national standardisation bodies. the standardisation process and regular exchange between the main initiatives has aided a confluence of these activities, which will lead to a harmonisation of the criteria catalogues. in addition, growing adoption of criteria catalogues for auditing of archives and networks of archives has provided useful feedback on further development of criteria catalogues and auditing procedures for the certification of trusted digital archives.   acknowledgements the author would like to thank his colleagues in the nestor working group "trusted archives", in the project "publication and citation of scientific primary data", and at esa gscb for the interesting and fruitful discussions. the author gratefully acknowledges support by the german research foundation (dfg) through the project "publication and citation of scientific primary data" (std-doi), by the german federal ministry for education and research through nestor, and by esa.   references [1] albani, m., v. beruti, m. duplaa, c. giguere, c. velarde, e. mikusch, m. serra, j. klump, and m. schroeder (2010), long term preservation of earth observation space data european ltdp common guidelines (version 1.1), european space agency, ground segment coordination body, frascati, italy. available from: http://earth.esa.int/gscb/ltdp/europeanltdpcommonguidelines_issue1.1.pdf [2] ambacher, b. u. a. (2007), trustworthy repositories audit & certification: criteria and checklist (trac), crl center for research libraries, chicago, il. available from: http://www.crl.edu/sites/default/files/attachments/pages/trac_0.pdf [3] ccsds (2009), audit and certification of trustworthy digital repositories, draft recommended practice, red book, consultative committee for space data systems, greenbelt, md. available from: http://public.ccsds.org/sites/cwe/rids/lists/ccsds%206520r1/attachments/652x0r1.pdf [4] center for research libraries (crl), digital curation centre (dcc), digital preservation europe (dpe), and competence network for digital preservation (nestor) (2007), ten principles, available from: http://www.crl.edu/archiving-preservation/digital-archives/metrics-assessing-and-certifying/core-re [5] digital curation centre (dcc), and digital preservation europe (dpe) (2007), dcc and dpe digital repository audit method based on risk assessment (drambora), digital curation centre, edinburgh, uk. available from: http://www.repositoryaudit.eu/download [6] dini ag elektronisches publizieren (2006), dini-certificate document and publication services 2007 (version 2.0), deutsche initiative für netzwerkinformation (dini), göttingen, germany. available from: http://nbn-resolving.de/urn:nbn:de:kobv:11-10075687 [7] dobratz, s. et al. (2006), catalogue of criteria for trusted digital repositories, die deutsche bibliothek, frankfurt (main), germany. available from: http://edoc.hu-berlin.de/series/nestor-materialien/8/pdf/8.pdf [8] dobratz, s. et al. (2009), catalogue of criteria for trusted digital repositories, nestor materials, deutsche nationalbibliothek, frankfurt (main), germany. [online] available from: http://nbn-resolving.de/urn:nbn:de:0008-2010030806 [9] dobratz, s., p. rödig, u. m. borghoff, a. schoger, and b. rätzke (2008), the use of quality management standards in trustworthy digital archives, in: proceedings of the fifth international conference on preservation of digital objects joining up and working: tools and methods for digital preservation, a. farquhar (ed.), 8 pp., british library, london, uk. available from: http://nbn-resolving.de/urn:nbn:de:kobv:11-10092248 [10] giaretta, d., h. harmsen, and c. keitel (2010), memorandum of understanding to create a european framework for audit and certification of digital repositories, [online] available here.] [11] iso (2000), iso 9000:2000: quality management systems — fundamentals and vocabulary, standard, international organization for standardization (iso), geneva, switzerland. available from: http://www.iso.org/iso/iso_catalogue/catalogue_ics/catalogue_detail_ics.htm?csnumber=29280 [12] lawrence, s., f. coetzee, e. glover, d. pennock, g. flake, f. nielsen, r. krovetz, a. kruger, and l. giles (2001), persistence of web references in scientific research, ieee computer, 34(2), 26-31. doi:10.1109/2.901164 [13] rank, r. h., c. cremidis, and k. r. mcdonald (2010), archive standards: how their adoption benefit archive systems, in: standard-based data and information systems for earth observation, l. di and h. k. ramapriyan (eds.), pp. 127-142, springer berlin heidelberg, heidelberg, germany. doi:10.1007/978-3-540-88264-0_8 [14] sesink, l., r. van horik, and h. harmsen (2008), data seal of approval, data archiving and networked services (dans), den haag, the netherlands. available from: http://www.datasealofapproval.org/ [15] task force on archiving of digital information (1996), preserving digital information, commission on preservation and access and the research libraries group, mountain view, ca. available from: http://www.rlg.org/legacy/ftpd/pub/archtf/final-report.pdf [16]wren, j. d. (2008), url decay in medline-a 4-year follow-up study, bioinf., 24(11), 1381-1385, doi:10.1093/bioinformatics/btn127   about the author jens klump is e-science project manager at the helmholtz centre potsdam german research centre for geosciences in potsdam, germany. as "embedded scientist" with degrees in geology and oceanography dr. klump joins geological research projects to determine their information needs and to help design new e-science tools. he participated in the project "publication and citation of primary research data" which laid the foundations for datacite, and was a member of the working groups "trusted archives" of both the "competence network for digital preservation" (nestor) and of the german bureau of standards (din).   copyright © 2011 jens klump report on the 1st international critical success factors for institutional change workshop (csfic) search   |   back issues   |   author index   |   title index   |   contents d-lib magazine october 2006 volume 12 number 10 issn 1082-9873 report on the 1st international critical success factors for institutional change workshop (csfic) 22 september, alicante, spain   karen fill university of southampton, uk this first csfic one-day workshop was held at the university of alicante, spain in conjunction with the european conference on research and advanced technology for digital libraries (ecdl '06). it was chaired by dr. hugh davis, university director of education with responsibility for elearning at the university of southampton, uk, and susan eales of the uk's open university. the purpose of csfic '06 was to provide a forum for participants in mature elearning projects to exchange information about the lessons learned from their projects in trying to bring about technological and cultural change within their institutions. the workshop included opening and closing keynotes, presentation of refereed and position papers, and ended with a round table discussion of the critical success factors for embedding elearning in further and higher education. 'embedding elearning – a mixed economy' in the opening keynote professor mark stiles, university of staffordshire, posed questions about the driving forces behind elearning strategy. do institutions have a strategy to satisfy funding bodies or to underpin real change? after outlining some of the strategic approaches evident today, he described how, at staffordshire, academic staff are no longer offered standalone elearning training courses. rather they take an integrative approach that makes "development part of production." teachers have to indicate, at the course planning stage, that they want to do elearning, and then they are allocated a support team. the support includes attention to quality enhancement and allows the university to build the evidence base for the 'e' contribution to improved learning and teaching. caroline breslin, university of strathclyde, presented findings from the distributed, innovative design, education and teamwork (didet) project, and she reflected on key issues for scaling up and embedding approaches to integrating learning environments, digital repositories and supporting tools. helen durham, university of leeds, dr. hugh davis and professor david martin, university of southampton, presented three different papers based on the experiences of the digital libraries in support of innovative approaches to learning and teaching in geography (dialogplus) project. helen described the collaborative learning activity design (clad) methodology developed during the project, highlighting how it can overcome cultural differences. david spoke about the impact of a collaborative online masters programme on the culture and curriculum in the school of geography. both he and hugh offered thoughtful reflections on embedding blended learning in the institutional culture. presenting a paper intriguingly entitled geronimo's cadillac: lessons for learning object repositories, john casey, of the uhi millennium institute, scotland, made telling points about the need to avoid promoting technology in an area that is not yet ready for it. however, concentration on technical issues that are really minor can be a form of "displacement activity". the main obstacles to embedding elearning are philosophical, pedagogical, political and organisational. john cook from the centre of excellence in teaching and learning in reusable learning objects, london metropolitan university, presented an institutional change model to help identify critical interactions, processes and interdependencies between technology, practice and strategy. iain wallace, glasgow caledonian university, presented the outcomes of the spoken word project and their implications for institutional change. a key point here was the need to build for constant change (pedagogical pluralism). su white, university of southampton, presented findings from longitudinal research on the take-up of ict in uk higher education. this prompted an interesting discussion on whether we are seeing a repetitive cycle of early adoption of technologies followed by institutional reluctance to cross moore's chasm [1]. sarah mcnicol, evidence base, university of central england, presented findings from the support and evaluation study for the digital libraries in the classroom (dlic) programme. gill ferrell, jisc infonet, showed us a simulation tool, adapted from an insead approach to change management, which allows you to look at organisations (people in them) and the networks (formal and informal) and the extent to which key individuals are involved in the change processes (aware, interested, trying and adopter). gilly salmon, professor of elearning and learning technologies at the university of leicester, gave an invited keynote in which she offered reflections and advice based on the first year of implementing the university of leicester's elearning strategy and urged us to visit their media zoo, which encourages the take up of appropriate elearning approaches in the four possible matrix intersections delineated by combinations of new/existing students and new/existing technologies. the workshop finished with a round table discussion and summing up facilitated by susan eales, previously at jisc, now at the uk's open university. key points included: the term elearning causes many difficulties – perhaps it needs to be differentiated from esupport (e.g., putting notes on blackboard). the technology works; the problems are philosophical, pedagogical, political and organisational. academic career paths and curriculum development processes can militate against embedding elearning. there can be multiple levels of staff engagement; it does not have to be all or nothing. funding models that have worked well buy out staff teaching time so that they can concentrate on developing and embedding elearning, include incentives for student involvement in development, and keep the money coming (continuity). agile technical support processes are needed (a 'just-in time' approach). embedding needs to be at technological, pedagogical and cultural level. active involvement and commitment of senior management is critical. harness organisational objectives to make the changes and use organisational structures to nurture them. the csfic workshop proceedings and papers are available online at note [1] moore, g.: crossing the chasm: marketing and selling technology products to mainstream customers. harper business, new york (1991). copyright © 2006 karen fill top | contents search | author index | title index | back issues previous conference report | in brief home | e-mail the editor d-lib magazine access terms and conditions doi:10.1045/october2006-fill   visual search for supporting content exploration in large document collections search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine july/august 2012 volume 18, number 7/8 table of contents   visual search for supporting content exploration in large document collections drahomira herrmannova and petr knoth kmi, the open university {drahomira.herrmannova, p.knoth}@open.ac.uk doi:10.1045/july2012-herrmannova   printer-friendly version   abstract in recent years a number of new approaches for visualising and browsing document collections have been developed. these approaches try to address the problems associated with the growing amounts of content available and the changing patterns in the way people interact with information. users now demand better support for exploring document collections to discover connections, compare and contrast information. although visual search interfaces have the potential to improve the user experience in exploring document collections compared to textual search interfaces, they have not yet become as popular among users. the reasons for this range from the design of such visual interfaces to the way these interfaces are implemented and used. in this paper we study these reasons and determine the factors that contribute to an improved visual browsing experience. consequently, by taking these factors into account, we propose a novel visual search interface that improves exploratory search and the discovery of document relations. we explain our universal approach, and how it could be applied to any document collection, such as news articles, cultural heritage artifacts or research papers. keywords: information visualisation, content exploration, exploratory search   1. introduction search has been for a long time an integral part of many applications and systems. nowadays, it has become a daily activity for almost everyone and it is a common way of accessing data and information. search can be often a complex and a time-consuming task [marchionini, 2006]. among the main reasons are information overload and the so-called "lost in hyperspace" problem. information overload comes with the incredible (and growing) speed with which content is generated. this term addresses the fact that with the growing amount of content it becomes harder (or even impossible) to comprehend it. "lost in hyperspace" refers to the problem of navigating in large quantities of virtual (typically hypertext) content. while following links and relationships, people might easily lose track of how they got to their current "position." over the last 20 years, search has become an essential activity of our lives and the way people search and what they require from search interfaces has changed. gary marchionini [marchionini, 2006] divides search tasks in two basic types — lookup search tasks and exploratory search tasks. the names of these two concepts already suggest how search has evolved from single-step "fact retrieval" or "question answering" to complex activity that incorporates analysing, comparing and evaluating the content. while exploratory searches constitute a significant proportion of all searches [rose and levinson, 2004], current search interfaces do not sufficiently support them. this issue has been addressed by a number of researchers by exploring the use of information visualisation. visual search interfaces make use of our visual skills in order to help us to navigate through content. an important aspect of visualisations is that they make it easier to communicate structure, organisation and relations in content. they can also be well utilised to improve search experience, by depicting more information than a typical text search interface using the same space, and they can simplify the process of finding relevant information and can provide graphical aid in results diversification. in this paper we study the use of visual interfaces to aid exploratory search in document collections. document collection visualisations typically project content along one or more selected dimensions — this might be time or other properties of documents in the collection. in contrast, we address this problem by exploring generally applicable principles without considering a specific document collection. with these principles in mind, we design a novel visual interface that can work with any type of dimension and any number of dimensions and demonstrate its usability on the domain of research publications. the remainder of this paper is organised as follows. section 2 reviews the current work in the field of visualising document collections. section 3 analyses the design principles that constitute a good practise in the design of visual interfaces. the design of our visual interface is then presented in section 4. finally, the contribution of the approach is discussed and the future work is outlined.   2. related work current approaches to visualising document collections can be divided according to the granularity of information they provide about the collection into the following groups: collection level — visualise attributes of the collection. these visualisations typically aim at providing a general overview of the collection content. document level — visualise attributes of the collection items, their mutual links and relations. intra-document level — visualise the internal structure of a document, such as the distribution of topics within the document. in this paper, we are concerned with document level visualisations, however certain concepts from collection level are also applicable. a well-known example of the first (collection level based) type of visualisations are tag clouds [hassan-montero and herrero-solana, 2006] that visually (using attributes like font size and colour) communicate statistical information (such as word, tag or topic frequency) about the collection. a considerable number of these visualisations focus on depicting topics or themes contained in the collection. while the themeriver [havre et al., 2002] and the tiara [wei et al., 2010] (figure 1) visualisations both show changes of themes in the collection over time, [chaney and blei, 2012] use visualisation to reveal theme structure of a collection (for example, an overview of wikipedia topics from their visualisation can be seen in figure 3). the facetatlas project [cao et al., 2010] (figure 2) focuses on multi-faceted documents and keywords, and combines search with a visualisation depicting the keyword senses and different relations between documents based on these senses. collection level visualisations can also be used for visualising document clusters in a collection. galaxies [wise et al., 1999] or infosky [granitzer et al., 2004] (figure 4) are good examples of document cluster visualisations. visualisations focused on collection level information are well suited for analytical and statistical tasks. they can help in the exploration of the collection by providing an overview of the collection content, like in the facetatlas [cao et al., 2010] (figure 2) or in the tiara [wei et al., 2010] (figure 1). in this case, the exploration happens at the collection level which provides the user with a general overview of the collection's characteristics.       figure 1: tiara visualisation showing visualisation of a field "cause of injury"       figure 2: facetatlas visualisation showing visualisation of a query word "diabetes"       figure 3: overview of wikipedia topics from [chaney and blei, 2012]       figure 4: infosky visualisation     the second group of visualisations focuses on visualising document level properties. in this paper, we are interested mainly in their use to aid information discovery and content exploration. regarding this area, a growing number of researchers have been interested in various collections and networks which emerged on the web in recent years, one such collection being wikipedia. data from wikipedia are hierarchically organised and highly interlinked, which provides good foundation for visualisations. [hirsch et al.,] created two visualisations, one of them built upon freebase (a collection similar to wikipedia) and the other upon wikipedia (figure 6). both visualisations present the user with articles related to the currently browsed article and with types of connections between these articles. this way of visualising related articles helps users to quickly explore relevant topics (information about places, people, etc.). [milne and witten, 2011] (figure 5) chose a slightly different approach. they utilised suggestion of related articles and their clustering to increase legibility of the visualisation. this is an important quality which can influence whether the user will use or abandon the visualisation. the wivi visualisation (figure 7) created by [lehmann et al., 2010] uses a different approach for suggesting relevant articles. it builds a graph of already visited articles and suggests relevant unvisited articles based on relevance to all articles in the browsing history. relevance of unvisited articles is indicated using a varying distance of articles in the visualisation. suggestion of relevant items based on multiple interesting documents (instead of one) is a useful feature which might help to narrow the selection of relevant items. in our visual search interface, we utilise a similar approach. we give the user the possibility to choose and add to the visualisation any documents and any number of documents. visual search interfaces can also be divided according to the way the exploration is carried out. j. zhang [zhang, 2008] divides search tasks into the following two groups that are also applicable to visual search interfaces: browsing-focused — the user starts exploration at a specific point in the collection (typically a root document or a topic; usually the same point is used every time) from which the user navigates through the collection. query-focused — the user starts with a query, which determines the entry point from which the exploration starts. as in textual search interfaces, one way to visually explore document collections is to start with an initial point and browse through the collection by navigating from this initial point. the starting point might be, for example, an overview of the whole collection like in [chaney and blei, 2012] (figure 3) and [granitzer et al., 2004] (figure 4) or it might be a root element of a hierarchy as in the category view of the wikivis visualisation described in [biuk-aghai, 2006]. in contrast to this way of exploring the collection, the query-based search interfaces start with the user specifying a query and building a visualisation based on one ([hirsch et al.], figure 6 and [milne and witten, 2011], figure 5) or multiple ([lehmann et al., 2010], figure 7) documents from this result list. regarding the two previously mentioned divisions, we would describe our visualisation as document and query based. it aims to visualise articles, related to a user query and through showing relations between these articles to help users to decide how and where to focus his further search and exploration.       figure 5: a hopara visualisation showing topics and articles related to article about "tacoma narrows bridge"       figure 6: a thinkpedia visualisation showing articles related to "semantic web"       figure 7: a wivi visualisation       3. design principles of visual search interfaces in this section, we study general design principles for creating document visualisations, select those that are relevant for visual search interfaces and provide examples of their use. an empirical study of 6 visual interfaces was carried out by [chen and yu, 2000]. they concluded that users typically perform better (in terms of efficiency and accuracy) with simpler visual interfaces, regardless of their cognitive abilities. a similar study was conducted by sebrechts et al. [sebrechts et al., 1999] who performed a comparative evaluation of textual, 2d and 3d versions of the nirve search interface. the study pointed out that visual interfaces, in contrast to classical textual interfaces, should simplify the process of accessing information. according to the authors, the usability of visual interfaces is dependent on three factors: the visual interface, the task being performed using this interface and the user performing the task. this means that visual interfaces might be better suited for some information seeking tasks than others (for example, visual search interfaces are probably better suited for exploratory tasks than for lookup tasks). sebrechts et al. also observed several factors that affected the usability of the visual interfaces: the use of colours, number of documents in the visualisation, fixed spatial location of the visualisation and the difference between 2d and 3d interfaces. we have analysed these factors, discovered interesting examples of their use in the design of visual search interfaces and organised them into the following list of design principles. added value1 with respect to a textual solution. every visual interface should provide an advantage over a textual interface. the visual interface can assist in the discovery of different information that might otherwise be difficult to see, it might increase the speed of communicating the information, it might help to organise the information more clearly, etc. according to [sebrechts et al.,1999] and [baeza-yates and ribeiro-neto, 1999] the visual interface should reduce the mental workload of the user. when document collection exploration is considered, relations between documents might be easier to comprehend when using visual representation rather than textual. facetaltas [cao et al., 2010] (figure 2) is a good example of a visualisation which manages to graphically communicate relations (in this case relations and connections between items based on different facets) that would be difficult to present textually. simplicity. one of the main reasons why textual interfaces are often preferred over visual interfaces is that they can be used with little previous knowledge. this is due to their simplicity and the fact that their design mostly follows standard patterns. visual interfaces that are simple and do not require any learning curve have been found generally better than more complex ones [chen and yu, 2000]. we are not aware of any popular visual search interface that would be preferred over a textual one for its simplicity. visual legibility strongly influences user experience with the search interface. hardly readable text labels, overlapping items or too many items in the view may be a reason for the user to prefer a textual interface even if the visual interface conveys more information. for example, hopara search interface [milne and witten, 2011] (figure 5) accomplishes legibility by the use of document clustering and by suggestion of relevant topics. use of colours is a simple but a very powerful tool. colours can help to immediately identify a shared feature, the type of a relation, a membership in a group, etc. the study [sebrechts et al., 1999] pointed out that colours helped to immediately identify groups of articles (regardless of the type or dimension of the interface). [lehmann et al., 2010] (figure 7) and [cao et al., 2010] (figure 2) show how colours can be used in visual search interfaces. dimension of the visualisation projection. 3d interfaces might be useful and legible in some cases but inconvenient in other cases. a disadvantage of 3d interfaces is that not all parts of the visualisation might be visible in a single view (as in wikivis visualisation presented in [biuk-aghai, 2006], figure 8) — this reduces the legibility and makes the navigation more difficult. fixed spatial location of the visualisation. sebrechts et al. [sebrechts et al.,1999] point out that once users started to rotate the 3d visual interface, they lost track of relations that were no longer visible. this might apply also to 2d interfaces which require zooming. as a result, it is important to consider the use of features, such as rotation and zooming, and what effect they have on navigation.       figure 8: a wikivis visualisation       4. the approach the main motivation of our research is to develop a visual interface that can provide better support for exploratory search than current textual interfaces and that is conceptually applicable in any document collection regardless of its type. the interface should also provide an added value by assisting in the discovery of interesting connections that would otherwise remain hidden.   4.1 the considered types of document collections every document in a collection is defined according to a set of dimensions (sometimes called facets). the dimensions are typically of different types. each document can be described by a set of properties, each of which expresses the value of a corresponding dimension. although the specific dimensions are dependent on the document collection domain, they are always present in a real-world document collection. for example, an article in a news collection can be described by the properties corresponding to dimensions, such as time, themes, locations, and relations to other articles. documents describing cultural heritage artifacts can be characterised by artifact type, historical period, style, material, etc. similarly, research articles can be represented by citations, authors, concepts, similarities with other research articles, etc. these properties are either explicit or implicit (similarity with other items). explicit properties relate to user defined properties, typically citations, authors, location. implicit properties refer to properties such as document similarity, which usually need to be discovered. many visual interfaces are tailored to specific domains and their dimensions. in this paper, we consider all types of document collections. since our interest lies in collections of scientific publications, we demonstrate our approach on this domain.   4.2 the objectives of our approach our visual search interface is based on the combination of the following principles which differentiate our approach from previous work: support for comparing and contrasting content. the search interface should offer the means for comparing and contrasting properties of multiple documents. support for exploration across dimensions. the search interface should help assist in the discovery of interesting relationships across dimensions by taking into account multiple aspects simultaneously. universal approach to the visualised dimensions. the visual search interface can be adapted to any document collections. while the first two principles are difficult to realise in textual search interfaces, satisfying the third one is a challenge for visual interfaces. our contribution is in addressing these principles at once.   4.3 the functionality the proposed visual search interface consists of a visualisation area which is supported by a left and right sidebar. the left sidebar features a search box, which is the starting point of visual search, and an area for the search results. in the first step, the user will enter an initial query into the search box and a list of relevant documents will be displayed. the user can select one of the documents and see its details in the right sidebar. any of these documents can be dragged into the visualisation area, which initialises the visualisation. the visualisation enables the user to perform the following activities: exploring document relations, discovering interesting connections across dimensions, comparing and contrasting documents. 4.3.1 exploring document relations the visualisation itself shows the selected document in the centre of the screen in an area we call the document stack. any number of documents can be added to the document stack. this area is surrounded by a set of predefined dimensions that are suitable for the visualised document collection. figure 9 shows the dimensions relevant to the domain of scientific articles. for this example, we chose document authors, concepts (the document topics or themes), similar documents and citations. in a typical collection, there will be many documents related to the content of the document stack and the user can scroll through them. each of the dimensions offers a different view on the related documents. for example, the area showing document authors might reveal other documents from the same authors. similarly, the concepts area enables the user to explore documents discussing the same topics. some of the areas can be customised to further specify the relatedness criteria. this can be achieved by modifying the dimension settings that appear in the bottom right sidebar. for instance, the visualisation allows deselecting any of the concepts in the concepts view and consequently fine-tuning the list of the relevant documents. the document stack can also be minimised (using a small icon in its top left corner) to provide a clear view of all document connections.   figure 9: preview of the visual search interface, showing one document in the document stack and its relations   4.3.2 discovering interesting connections across dimensions just like the documents in the stack, the related documents are also described by the same set of properties — authors, concepts, similar documents and citations. these documents relate not only to the document stack, but also to one another, across the dimensions. for example, one of the cited documents can share an author with a document in the stack. the cited document will appear in two views — in the authors view and in the citations view. our visualisation displays these connections using thin curved lines. if the same document appears in multiple views, it will be connected by a thicker connecting line. any of the related documents can be selected in order to reveal connections across dimensions (using a small reveal connections icon which appears after selecting the related document). this is used to highlight only the documents that relate to both the document stack and the selected document. as shown in figure 10, the interface adjusts the content displayed and hides documents that are not related to the selected document.   figure 10: discovering interesting connections across dimensions by selecting a relevant document   4.3.3 comparing and contrasting documents in the document stack at any time, the user can drag more documents displayed in the visualisation area or in the left sidebar to the document stack. this allows the comparing and contrasting of their properties and relations. the documents in the stack can share any properties. the user can switch between the union and intersection mode as shown in figure 11 to see all the properties and relations of the documents in the stack or only the shared properties and relations. visualised documents can be removed from the document stack.   figure 11: comparing and contrasting documents     4.4 applications in other domains in the previous section, we have described the functionality of the visual search interface and demonstrated how it can be applied in the domain of research papers. the functionality and the interface design are universal and can be used in any document collection. for example, we could apply the interface on a collection of news articles. the dimensions in this case might be time, location, topic, author, links to other news articles, etc. even though we demonstrated the use of the interface on a domain with four dimensions, the principles and the functionality are the same. the only difference is in the number of dimensions. the maximum number of dimensions is in theory not restricted, the only restrictions being the size and resolution of the screen and the limitations of human perception. if more views than can fit on the screen need to be visualised, the interface should allow the user to select the desired combination, but should not allow visualising more than the maximum number to keep the interface simple and legible. another example of using the visual search interface would be its application in the domain of cultural heritage artifacts. in this case, we can recognise the following dimensions — artifact type, location, historical period, style and significant people related to the artifact (creator, owner, etc.). the visual interface could be used, for example, to discover interesting connections between the location of an artifact and a significant person.   5. future work our next step is to perform a user study to evaluate the usability of the presented visual search interface, its adherence to the identified design principles and its advantages and disadvantages. the results of the user study should help us to fine-tune the approach. we also plan to integrate this visualisation with the core system 2, which aggregates research papers from many open access repositories. finally, we would like to provide the technical support for reusing the visual search interface in other domains by the means of an api. this api would make it possible to use this search interface in different types of document collections. the user would provide a definition of dimensions (a description of the types of information shown in each dimension) and a function (search engine) for each dimension that would return an ordered list of relevant results for a selected document or documents. the api would then accept metadata of documents in the collection and would build a visualisation on top of this metadata.   6. discussion in section 2, we divided document collection visualisations according to the granularity of information they provide about the collection. in the field of visualising research papers, a number of tools for creating collection overviews have been developed. these include the ase [gove et al., 2011] and nvss [shneiderman and aris, 2006] tools, which use citation networks, and the gridl [shneiderman et al., 2000] tool, which creates a visual overview of digital library search results. while these tools can be classified as collection level visualisations, we provide a document level visualisation. document level visualisations, such as [van ham and perer, 2009] and [chau et al., 2011], usually provide a visualisation of the local subgraph surrounding a specified document. in contrast, our tool provides a view on multiple dimensions of a specified document (or a set of documents) and relations between these dimensions. another difference is that the designed visual search interface allows search results in these dimensions to be ranked, ordered according to their relevance and paginated, preserving a key feature of traditional search interfaces. this feature is difficult to provide in visualisations of the local subgraph surrounding a specified document and we are not aware of any such interface that would support it. in the design of the presented visual search interface, we aimed at addressing some of the main issues of current search interfaces. the presented interface addresses the two problems mentioned in the introduction: information overload and "lost in hyperspace." the interface mitigates information overload in two ways. it (a) helps users to identify different types of connections between documents and (b) it also helps to explain their meaning. we believe this makes it easier for users to find important information and comprehend it. it also prevents users from "getting lost" in the document space by allowing them to add new documents into the stack without needing to leave the current position. the connections in our visual interface correspond to correlations between dimensions. current search engines typically evaluate the relevance of a user query with respect to all these dimensions at once, which might make it more difficult for the user to discover these correlations. while this behaviour of search engines is often desirable (as it hides complexity), we believe it is not always the case when exploratory search is needed. for example, in the domain of research publications, if there is a large number of connections between similar documents and citations (relevant to the documents in the stack) indicating a strong correlation between these two dimensions, it probably means that the citations used in the research papers cover the visualised domain well. if this correlation appears between authors and citations, but does not appear between other dimensions, it might indicate that authors do not refer to similar work but rather cite their own papers. these correlations are difficult to spot when using traditional search engines.   7. conclusion this paper studied how visual search interfaces can be used to support exploratory search in large document collections. we analysed common design principles of document visualisations and, based on these principles, designed a novel document level query-focused visual search interface. we demonstrated by example how this interface can be applied to explore collections of research papers and discussed the applicability of this approach in different domains. the contribution our approach makes is in the combination of the following aspects: support for comparing and contrasting content, support for the discovery and exploration of content across dimensions, and adaptability of the visual interface to different domains.   notes 1added value stays at a different level of abstraction than the remaining design principles. it refers more to the overall concept of the visual interface rather than how the visual interface is presented. 2http://core.kmi.open.ac.uk   references [1] ricardo a. baeza-yates and berthier ribeiro-neto. 1999. modern information retrieval. addison-wesley longman publishing co., inc., boston, ma, usa. [2] robert p. biuk-aghai. 2006. visualizing co-authorship networks in online wikipedia. in communications and information technologies, 2006. iscit '06. international symposium on, pages 737—742. september. [3] nan cao, jimeng sun, yu-ru lin, david gotz, shixia liu, and huamin qu. 2010. facetatlas: multifaceted visualization for rich text corpora. ieee transactions on visualization and computer graphics, 16(6):1172—1181, november. http://dx.doi.org/10.1109/tvcg.2010.154 [4] allison j. b. chaney and david m. blei. 2012. visualizing topic models. department of computer science, princeton university, princeton, nj, usa. [5] duen horng chau, aniket kittur, jason i. hong, and christos faloutsos. 2011. apolo: making sense of large network data by combining rich user interaction and machine learning. in proceedings of the 2011 annual conference on human factors in computing systems, chi '11, pages 167—176, new york, ny, usa. acm. http://dx.doi.org/10.1145/1978942.1978967 [6] chaomei chen and yue yu. 2000. empirical studies of information visualization: a meta-analysis. international journal of human-computer studies, 53(5):851—866, november. http://dx.doi.org/10.1006/ijhc.2000.0422 [7] r. gove, c. dunne, b. shneiderman, j. klavans, and b. dorr. 2011. evaluating visual and statistical exploration of scientific literature networks. in visual languages and human-centric computing (vl/hcc), 2011 ieee symposium on, pages 217—224, september. http://dx.doi.org/10.1109/vlhcc.2011.6070403 [8] michael granitzer, wolfgang kienreich, vedran sabol, keith andrews, and werner klieber. 2004. evaluating a system for interactive exploration of large, hierarchically structured document repositories. in proceedings of the ieee symposium on information visualization, infovis '04, pages 127—134, washington, dc, usa. ieee computer society. http://dx.doi.org/10.1109/infovis.2004.19 [9] yusef hassan-montero and victor herrero-solana. 2006. improving tag-clouds as visual information retrieval interfaces. in merida, inscit2006 conference.2006, october. [10] susan havre, elizabeth hetzler, paul whitney, and lucy nowell. 2002. themeriver: visualizing thematic changes in large document collections. ieee transactions on visualization and computer graphics, 8(1):9—20, january. http://dx.doi.org/10.1109/2945.981848 [11] christian hirsch, john hosking, and john grundy. 2009. interactive visualization tools for exploring the semantic graph of large knowledge spaces. [12] simon lehmann, ulrich schwanecke, and ralf dörner. 2010. interactive visualization for opportunistic exploration of large document collections. information systems, 35(2):260—269, april. http://dx.doi.org/10.1016/j.is.2009.10.004 [13] gary marchionini. 2006. exploratory search: from finding to understanding. communications of the acm, 49(4):41—46, april. http://dx.doi.org/10.1145/1121949.1121979 [14] david n. milne and ian h. witten. 2011. a link-based visual search engine for wikipedia. in proceedings of the 11th annual international acm/ieee joint conference on digital libraries, jcdl '11, pages 223—226, new york, ny, usa. acm. http://dx.doi.org/10.1145/1998076.1998118 [15] daniel e. rose and danny levinson. 2004. understanding user goals in web search. in proceedings of the 13th international conference on world wide web, www '04, pages 13—19, new york, ny, usa. acm. http://dx.doi.org/10.1145/988672.988675 [16] marc m. sebrechts, john v. cugini, sharon j. laskowski, joanna vasilakis, and michael s. miller. 1999. visualization of search results: a comparative evaluation of text, 2d, and 3d interfaces. in proceedings of the 22nd annual international acm sigir conference on research and development in information retrieval, si-gir '99, pages 3—10, new york, ny, usa. acm. http://doi.acm.org/10.1145/312624.312634 [17] b. shneiderman and a. aris. 2006. network visualization by semantic substrates. visualization and computer graphics, ieee transactions on, 12(5):733—740, september—october. [18] ben shneiderman, david feldman, anne rose, and xavier ferré grau. 2000. visualizing digital library search results with categorical and hierarchical axes. in proceedings of the fifth acm conference on digital libraries, dl '00, pages 57—66, new york, ny, usa. acm. http://dx.doi.org/10.1145/336597.336637 [19] f. van ham and a. perer. 2009. "search, show context, expand on demand": supporting large graph exploration with degree-of-interest. visualization and computer graphics, ieee transactions on, 15(6):953—960, november—december. http://dx.doi.org/10.1109/tvcg.2009.108 [20] furu wei, shixia liu, yangqiu song, shimei pan, michelle x. zhou, weihong qian, lei shi, li tan, and qiang zhang. 2010. tiara: a visual exploratory text analytic system. in proceedings of the 16th acm sigkdd international conference on knowledge discovery and data mining, kdd '10, pages 153—162, new york, ny, usa. acm. http://dx.doi.org/10.1145/1835804.1835827 [21] james a. wise, james j. thomas, kelly pennock, david lantrip, marc pottier, anne schur, and vern crow. 1999. readings in information visualization, chapter visualizing the non-visual: spatial analysis and interaction with information for text documents, pages 442—450. morgan kaufmann publishers inc., san francisco, ca, usa. http://dl.acm.org/citation.cfm?id=300679.300791 [22] jin zhang. 2008. visualization for information retrieval. springer. http://books.google.co.uk/books?id=x5i-tk8j0goc   about the authors drahomira herrmannova received her bachelor degree from brno university of technology, czech republic, in 2007 and is is currently working towards her master's degree. she is currently doing her internship at the knowledge media institute, the open university, uk, where she is involved in research in the field of digital libraries, semantic technologies and visual interfaces.   petr knoth is a research associate in the knowledge media institute, the open university focusing on various topics in natural language processing and information retrieval. he has been involved in four european commission funded projects (kiwi, eurogene, tech-it-easy and decipher) and four jisc funded projects (core, servicecore, diggicore and retain) and has a number of publications at international conferences based on this work. petr received his master's degree from the brno university of technology.   copyright © 2012 drahomira herrmannova and petr knoth the sixth annual 2015 vivo conference search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine september/october 2015 volume 21, number 9/10 table of contents   the sixth annual 2015 vivo conference carol minton morris duraspace cmmorris@duraspace.org doi: 10.1045/september2015-morris   printer-friendly version   abstract the 2015 vivo conference was held in boston, massachusetts, august 11 14, 2015. it was an opportunity for stakeholders and interested community members to exchange information in many areas related to the theory and practice of what it means to "connect, share, and discover" with vivo, an open source semantic web application.   view of the charles river from the vivo poster session and reception.   1 introduction the 2015 vivo conference held in boston, august 11 14, 2015, was an opportunity for 185 stakeholders and interested community members to have a 3-day exchange of information centered around the theory and practice of what it means to "connect, share, and discover" with the vivo open source semantic web application. conference organizers offered workshops and sessions on related vivo topics including the scholarly ecosystem and team science, linked data and ontologies, research networking system optimization, analytics and organizational strategy, and implementation highlights, all aimed at providing an integrated view of the scholarly work of an organization. the vivo conference is somewhat different from related types of technology conferences. many conference delegates were researchers, field practitioners and/or faculty members with active institutional vivo "profiles". attendees included committed users of vivo technology, participants in scholarly research networks, and key stakeholders in vivo technology and ontology advances. knowledge is built not only on published research findings, but also on understanding the scientific and scholarly provenance around discovery. what was the process around how a discovery came to be known? who, and with which institution was it brought forward? who were the collaborators? how was it funded? what was the chain of scholarly validation and review around asserting that the discovery was true? in the past a scholar might have kept an extended chain of connections and related information in his or her head and in a home-grown series of documents, bibliographies, data, presentations, correspondence and contact lists stored on and offline, along with a record of publications. fellow scholars, institutions and funding agencies that are stakeholders in ongoing research did not have common infrastructure to share an understanding of informal or one-dimensional networks and corresponding related information. the ability to gather the threads of discovery within an institution or field to see a complete picture of research dependencies and connections from multiple viewpoints was a missing piece of scholarly infrastructure. enter vivo — a way for scientists and scholars to represent their work, their interests and their network of connections and collaborators. vivo institutional profiles, connected to multiple data streams that can include human resources, publications, course catalogs, funding agencies, teaching resources, external institutional data, patent information and more are now openly discoverable on the web. delegates to the vivo conference are passionate about advancing vivo because it is both an open source tool and an active community of practitioners that together make integrated views of the scholarly work of individuals and organizations visible, accessible and discoverable in the interest of moving scientific and scholarly research forward. vivo project director mike conlon welcomed delegates to the vivo conference. conlon's progress report highlighted vivo 1.8's modular and performance improvements adding that we are "heading for interoperability across triplestores", integrated search capabilities, and improved processes for turning contributions into deliverable components. he praised the vivo open source development community for ongoing and remarkable volunteer efforts to advance the software.   2 keynotes and invited speakers in the opening keynote address david weinberger, senior researcher at harvard's berkman center for internet & society, posed the question, "how can vivo make the internet better?" he asked the audience to consider three ideas in determining the internet's effectiveness as a force for positive societal change: is the internet really exceptional technology (cyberexceptionalism); does the technology behind the internet affect people deeply and leave them feeling like they have no control over it (technodeterminism); and is the internet a benevolent force for good or are our brains being rotted by constant interactions with online media (cyberutopianism)? audience opinions were polled informally and ranged the whole gamut from idealists to skeptics. weinberger went on to illustrate that the "argument of (web) architecture" where values are assumed to be embedded for the greater good is "classic technodeterminism". users tend to feel that whatever happens online is inevitable, hidden, and beyond their control. calling facebook "a billion person prototype" he said that he and colleagues at harvard's berkman center for internet & society are depressed because prevailing online technology seems to supersede other, more individualistic, ways to affect societal change. he concluded by encouraging audience members to remember our human origins and shared history to ensure that the internet delivers on the promise of making the world a better place. dr. james onken, senior advisor to the nih deputy director for extramural research, was also a keynote speaker. he shared information about the research portfolio data infrastructure at the national institutes of health aimed at improving the ability to demonstrate long-term benefits of research investments. invited speakers included dr. justin starren, m.d., phd., facmi, chief, division of health and biomedical informatics, department of preventive medicine, national institutes of health (nih). starren's research focuses on the translation of computer and informatics research into real-world solutions. he offered a glimpse of research networking systems (rns) as key component of biomedical research in an era of big data and distributed research teams. ina blümel, acting professor at hanover university applied sciences and arts, presented on the need to grow the future of interconnected research platforms by supporting university curricula teaching linked open data principles. she uses vivo as a teaching tool to help educate students about what open science means and how they can become part of an interconnected research world by learning to think in terms of objects and relations, not solely attributes. the associate director for programs at the national information standards organization (niso), nettie lagace, was also an invited speaker who offered views on the related nature of standards and interoperable data exchange. "standards do the heavy lifting and help you go where you're going faster," she said. she reviewed examples of niso areas of work and projects that include content and collections development (description), business information (solving business problems and making decisions), and discovery and delivery (focused on user needs and actions). recommended standards and practices around altmetrics (alternative assessments) are based on data beyond citations. lagace reiterated that establishing trust and verifying community support for standards-based assessment is critical to the overall process of scholarly activity.   3 sessions as with most conference reports it is not possible to report on the full range of 36 excellent panels and presentations and 23 posters. all 2015 vivo conference presentations are available online. here is a sample of some key sessions:   the linked data for libraries (ld4l) project: a progress report dean krafft and jon corson-rikert a demonstration search of library bibliographic resources with scholarly profile information and other types of annotated, linked and open data across cornell, harvard and stanford libraries is the goal of this two-year mellon-funded project. the partners are assembling a flexible ontology standard that will enable the exchange of library information in order to "break out of library silos" to enrich our scholarly resources. the ld4l ontology refines the bibframe ontology proposed by the library of congress and reuses components of the vivo-isf ontology along other standard frameworks. krafft reviewed use cases that illustrated how to take advantage of many ld4l data combination capabilities. a management tool for virtual collections across collaborating libraries is one of several potential applications. a full cross-institutional catalog search demonstration will be available in january 2016. jon corson-rikert noted many open questions and issues: how will ld4l data scale up to the real-world magnitude of library catalog data; how will geographic data be added for record enrichment and assessment; more stable and persistent uris will be required to take advantage of queries; and how can the "deep scholarly graph" be leveraged to surface more relevant works?   the isf ontology: lessons in mapping graham triggs symplectic upgraded their harvester to update mappings to vivo-isf earlier this year. in upgrading the data translations to the vivo isf ontology triggs reviewed the lack of standard types, especially in activities, when mapping to a structured ontology. he pointed out that definitions alone aren't a full mapping solution. structure matters. for example some fields map to objects and some objects map to fields. a professional activity is an object with controlled fields, but when mapped to vivo you have to do more object-level joining. this means you need to assign ids when converting a field to an object, and then map those several references to the same id. triggs suggested a gap analysis process between source and target systems: start with the export of the type and field definitions in elements, even if people have customized, list incoming object types and map to vivo-isf classes, then list the fields per object/class and assign them to data properties or object properties. he believes that the community has learned a lot about what we are mapping to, which makes it easier to adapt to different institutions. having well-populated vivo profiles is important. capturing more information and using more standards will bring the community closer to the needs of common ontology.   vivo in a networked research ecosystem jing wang, don elsborg, huda khan, david eichmann, and eric meeks this session presented outcomes from an investigation into integration and interoperability to promote a more open and networked research ecosystem. presenters offered observations and solutions around how to strengthen research network interconnections and how to manage greater network complexity. huda kahn developed a modular solution that adds functionality to how the vivo application already works. she can retrieve data formatted for display, but can also get the underlying rdf. the lookup is accomplished with the remote vivo solr index. it would be useful for the external lookup to go to ctsasearch, an integrated index of researcher information harvested in vivo data format from 65 institutions (or a sparql endpoint on that data). eric meeks pointed out that it is often easier to read html rather than the underlying structured data. linked and open data is challenging as a client. he does not want "the semantic to be the enemy of the good". a personal name resolution service would enable huda kahn's networked connections, suggested david eichmann. he reviewed the services focused on identification of the individual that are integral to a networked research system. moving forward while adhering to system of system (sos) principles and engineering methodology can lead to improving an institution's ability to contribute to and participate in a vivo-powered networked research ecosystem. jing wang concluded the session with suggestions for how to reach that goal. next steps could include integrating existing work into a planned software development sprint, starting to work with small interconnected networks to build trust, manage community expectations, measure performance and keep adding improvements in subsequent releases.   tackling silos and cultures: vivo outreach and engagement at duke university lamont cannon and julia trimmer what's in your outreach toolbox? staff members tasked with encouraging faculty members to interact with campus systems will tell you that it is challenging to convince busy faculty members that the benefits of participation outweigh their investment of time. cannon and trimmer reviewed best outreach and engagement practices at duke university where they have convinced a large portion of 5,600 duke faculty members across schools and departments — medicine, nursing, school of education, law, trinity college of arts and sciences, environment and engineering, public policy and the divinity school — to participate in developing faculty profiles for scholars@duke. there is no single way to reach all faculty members at a university across many school and department cultures. they developed a multi-faceted approach that involved planning different approaches to interest communicators or communication officers at each of the schools, it team implementers and the duke office of news and communications. by focusing on those groups, they found that there would be no single way to engage each set of faculty. they leveraged suggestions and came up with some beginning strategies: integrate campus news stories about faculty accomplishments into vivo profile pages, which was seen as adding value. start regular vivo user group meetings. school communicators (a formal staff designation) were a good source for outreach ideas. regular office hours made training more convenient for faculty — no registration required. host an information table at various venues on campus with materials that explain scholars@duke, promoting the value proposition. schedule one-to-one one meetings with higher-ups and "super users" to understand challenges, manage their concerns and outline plans for moving forward. host focus group lunches for 4-5 faculty members to find out what they liked the most and the least. leverage collections of explanatory digital slides and posters for re-use by faculty and department heads. duke's multi-pronged approach to engaging with faculty around populating scholars@duke has made it simpler to manage interactions between staff and the vivo system. cannon and trimmer agreed that addressing complaints and coming up with solutions were opportunities to foster success. trimmer concluded, " we were surprised by how many people wanted to go to lunch with us."   4 summing up tweets from conference attendees may be found at #vivo15 and #vivo2015. these brief observations are useful in understanding broad concerns and community opinions on conference proceedings. next year the 2016 vivo conference will be held in denver, colorado, august 17 19, 2016. julia trimmer, duke university, will serve as conference chair, and mike winkler, univeristy of pennsylvania, is the program chair.   about the author carol minton morris is director of marketing and communications for duraspace, and is past communications director for the national science digital library (2000-2009) and fedora commons (2007-2009). she leads editorial content and materials development and dissemination for duraspace publications, web sites, initiatives and online events, and helps connect open access, open source and open technologies people, projects and institutions to relevant news and information. she was the founding editor of nsdl whiteboard report (2000-2009) featuring information from nsdl projects and programs nationwide. follow her at http://twitter.com/duraspace.   copyright © 2015 carol minton morris illinois digital cultural heritage community collaborative interactions among libraries, museums and elementary schools search  |    back issues  |    author index  |    title index  |    contents d-lib magazine january 2002 volume 8 number 1 issn 1082-9873 illinois digital cultural heritage community collaborative interactions among libraries, museums and elementary schools   nuala a. bennett visiting special projects librarian digital imaging and media technology initiative university of illinois library at urbana-champaign <[email protected]> beth sandore associate university librarian for information technology policy and planning university of illinois library at urbana-champaign <[email protected]> evangeline s. pianfetti associate director, office of education technology college of education university of illinois at urbana-champaign <[email protected]> abstract this paper describes how face-to-face and online collaboration enabled diverse institutions to achieve the most salient goals of the digital cultural heritage community project. the primary purpose of the project was to create an online database that housed digitized materials from museums and libraries for use by elementary school teachers in their classrooms. section 1 is an introduction that provides background information about the project. the dublin core (dc) metadata schema was chosen for the database but required some qualifications by our partners, particularly museums. section 2 outlines the rationale behind the dc field name choices made for the online database and describes the database framework. section 3 lists the motivations and expectations of participating institutions and an evaluation of how well these expectations were met. the project was evaluated by analysis of web site use statistics and focus group interviews. the results of this final evaluation are described in section 4, together with recommendations and conclusions for future continuation of the project. introduction in collaboration with diverse institutions, the digital cultural heritage community project (dchc) [1], funded by the institute of museum and library services, sought to develop, document, and disseminate both the processes and products of a model program of cooperation between museums, libraries, archives and schools, thereby demonstrating how access to innovative technological resources could enhance educational programs. through the use of digital technologies and the internet, the primary goals of this project were to: build a model and a test electronic database of historical information collected from participating museums, libraries, and archives; establish a framework for the creation of the database; test its viability to meet the curriculum needs of teachers in upper elementary school classrooms in east central illinois; and test the efficacy of the electronic database as an educational resource. this paper describes how collaboration enabled these varied institutions to achieve the final outcomes of the project. the overall goal of the project was to build and evaluate a model collaborative digital environment that would provide elementary classrooms with access to multimedia information on topics that addressed illinois state board of education learning standards for social science [2]. 1.1 funding source the institute of museum and library services (imls) provided grant funding under its library-museum collaboration program. the over-riding goal of this imls program is to support innovative projects that model how libraries and museums can work together to expand their services to the public. the program emphasizes serving the community, using technology and enhancing education [3]. successful projects funded under this program should provide models that can be replicated nation-wide. the dchc project was funded for two years under the 1998 library-museum collaboration program. 1.2 project participants the participants in the dchc project were located in east central illinois, and the project was administered by the digital imaging and media technology initiative at the university of illinois library at urbana-champaign. at the outset of the project, museum partners included the early american museum [4] and the mclean county museum of history [5]. the illinois heritage association [6], which is a museum service organization, also partnered in the project. library partners included the lincoln trail libraries system [7] and the rare book and special collections library of the university of illinois [8]. as the project progressed, other museums and libraries took part in the project at different levels, either by attending workshops or adding data to the online database. three elementary schools [9] also participated, with concentration on third, fourth and fifth grade classrooms. 1.3 goals of the project the dchc was built on the concept of a digital community -institutions would contribute to a database images, text, other multimedia objects and descriptive information addressing common themes. teachers would then use the database to engage their students through more robust lesson plans. the database framework would provide museums, libraries and archives with a basis for identifying common ground among their collections, experimenting with formats, developing best practices, and determining new ways in which they would provide digital access to their materials. the project aimed to make it easy for elementary school teachers to utilize these resources, enabling incorporation of online materials into their classroom activities in meaningful ways for their students. 2. database framework design a number of factors determined the choices of themes for the images and data to be included in the database. primary among these factors were the social sciences curricula that the teachers were using in their classrooms. from the outset, the participating third, fourth and fifth elementary grade teachers shared their social sciences curriculum units with the museum curators and librarians. the curriculum units covered topics such as "how we learn about communities", "celebrations and festivals", "french in illinois", "government in illinois" to "westward expansion" and the "revolutionary war". the curriculum units also incorporated a broad range of the illinois learning standards for social sciences. the illinois state board of education recommends these standards for teachers as guidelines for measuring student progress and ensuring that students are meeting statewide and national expectations. selection of content for the database was determined based on the curriculum units, the corresponding learning standards and subsequent discussions with the museum curators, archivists and librarians. through online discussions, the teachers became more familiar with the collections available. additionally, the curators and librarians became much more aware of what types of artifacts and documents the teachers would find most useful in their classroom presentations. 2.1 choice of dublin core metadata schema the online database was set up using the dublin core (dc) metadata schema. because of the diversity among partners, we chose to adapt some of the dc field names and to make the user interface simpler to understand in the long-term for our schoolteacher users. from the outset, museum curators, in particular, were concerned that the dc fields did not fully allow them to describe their materials. we subsequently broke down several dc fields into sub-fields that all participants agreed would be more appropriate for the database. for example, the description field was broken down into four sub-fields -description, interpretation, curriculum units and state learning standards. museum curators and librarians entering data about an object into the database used these four fields to describe the artifact in the image, to interpret the artifact for the teachers, and to specifically describe which of the teachers' curriculum units and learning standards matched the artifact. in order to ensure uniformity of metadata, we designed the database using a final set of fields that we felt would cover as much of the information as possible coming from the partner institutions [10, 11]. an example of a complete record is shown in figure 1. figure 1. example of complete record from the dchc database. 3. participating institutions' expectations and project evaluation the imls national leadership grant program funded the dchc project as a model program of cooperation. the evaluation component of the project was designed to gather feedback on whether and how the framework of this project could serve as a model for museums, libraries and archives to use and build upon in providing widespread, authoritative, and useful access to digitized primary source materials for k-12 institutions. in this section, we synthesize the evaluative information received from our participants that addresses the success of the project as a model program of cooperation, from the human, organizational, and technology perspectives. a critical and technical prerequisite to building a model digital community was the development and testing of an experimental electronic database to which each museum and library could easily contribute information, and from which teachers and students could easily retrieve information. the structure and the content of information from museums, libraries, and archives provided the basis from which we established a framework for the database creation. a final outcome measure was our testing of the viability of the database to meet the curriculum needs of teachers in elementary school classrooms in east central illinois. the database was designed with the intention that it would be both robust and extensible -able to function fully for at least two years beyond the life of the project with minimal maintenance, and flexible enough to accommodate additional functions as well as changes to metadata formats. variables used to evaluate the project were web use statistics and focus group interviews with teachers and museum and library partners. these analyses suggest that the database structure and the search engine functions have succeeded in addressing the core needs of the participants. this success can be directly attributed to the constant involvement of our user group from the initial design through the final use phase of the project. 3.1 database framework in using the dublin core metadata schema for the framework of the database, several of the fields were qualified. in particular, subject fields were qualified with sub-fields, including keywords. this scheme helped to accommodate the different vocabularies and descriptive structures used at the various partner institutions as well as to include reference to the teachers' curriculum units and the illinois learning standards. in turn, these sub-fields can be easily mapped into the "subject" field to accommodate cross-repository searching, which is now implemented on a preliminary basis using open archives initiative (oai) protocols. through interactions with the museums, we found that they were much more interested than libraries in the interpretation of material. they were also concerned with the provenance and usage of their materials. although the dchc database was designed to make digitized material accessible from the partner institution's various collections, it was not originally intended as an aid to interpreting the meaning of these artifacts for teachers. although the original dchc concept contained no provisions for an interpretation field, after persuasive argument from the museum participants, the database fields were modified to include a separate interpretive sub-field in the subject field. this field provided a method for adding interpretive information that is crucial in the museum community to the understanding of a specific object or document. moreover, the interpretation of the database images helps to make history come alive for the elementary school children as well as decreases the time teachers must spend researching the historic significance of a database artifact. as a result of the dchc project, the library community as well as the other partners understand that the significance of having an interpretive field in the database includes both ensuring proper historic documentation and appropriate integration into the curricular unit. finally, time was spent considering the use of controlled vocabularies at the different partner institutions. each of the partners was encouraged to use standard vocabularies such as the art and architecture thesaurus, library of congress subject headings or the thesaurus for graphic materials when entering data into the database. however, our experience is that the usage of controlled vocabularies varies greatly across participating institutions. we will need to spend a considerable amount of time analyzing the data in the subject field to see how variant it is and how we will work on developing a method of mapping the various terms used. 3.2 motivations to participate in the final quarter of the project, focus group interview sessions were held for all of the project participants in order to obtain feedback on what they felt were the most salient outcomes of the project. these sessions were audiotaped and later reviewed by the project administrators. representatives from each of the participant institutions were interviewed together in groups. someone familiar with the project but not directly participating on a daily basis posed the questions. the participants answered the questions informally, and the questions led to much dynamic group discussion. the teachers, museum curators and librarians were first asked to identify why they initially agreed to participate in the dchc project. teachers indicated that they were attracted to participate in the project because it provided an opportunity to match mandated state learning standards with classroom activities. curators and librarians indicated that they were motivated to join the project because it provided them with the impetus to do a number of things that they considered were institutional priorities but often had been un-funded mandates, including: focusing on a community outreach project; forming new partnerships with previously un-served or under-served groups; and identifying and assessing collections for digitization. 3.3 curators' and librarians' expectations curators and librarians agreed that involvement with other institutions was an important factor in their decision to collaborate with other partners on this project. they indicated that the project and database provided a concrete avenue for community outreach, offered a good fit for a model set at the national level with state and regional partners collaborating with schools, as well as made the endeavor concrete, manageable and useful for standards. their initial expectations were met because of the fit between the partners. all partners listed the practicality of the project as important in their decision to participate. 3.4 teachers' expectations the teachers participating in the project wanted to match illinois state learning standards with a practical project, as well as to discover innovative ways of using information technology in their classrooms. teachers felt their expectations were met, but they expressed interest in continuing to use the database and to find means by which other educators and students could be made aware of its existence. as a result of the teachers' expectations, several handouts were produced that the participating teachers could share with their colleagues who had not been directly involved with the project. access to the dchc database was also extended to teachers throughout the state of illinois and elsewhere. 3.5 usefulness of the dchc database in the classroom teachers felt that the project had the potential to be very useful, "i thought interpretation and links to other resources were extremely important aspects of the database" (museum educator) "so simple, even a teacher can use it!" (3rd grade teacher) teachers were not able to use it as much as they would have liked because networking and display capabilities were limited in the classrooms. "i can't display database objects to the whole class." (4th grade teacher) "i have to ask kids to come around the computer." (4th grade teacher) "there are not many possibilities for either whole lab instruction or for kids to do research independently." (5th grade teacher) teachers did have some specific and constructive suggestions for improving the usefulness of the database. for example, they indicated that it was not easy to tell what was in the collection from the interface search screen, and suggested that more detailed summary information would allow a user to receive a better sense of the database content. teachers also suggested that they would like to have a component where users could submit commentary on objects. curators and librarians welcomed this suggestion and indicated they, too, would find it useful to receive feedback from users about objects, documents, interpretations, and descriptions. 3.6 age-appropriateness of information in database the focus group participants were asked if they thought that the interpretations of database images were understandable by the students or, put simply, at a "kid level". teachers agreed that interpretations were not only age appropriate but, indeed, could be adapted to higher grade levels. there was still further discussion of the usefulness of the descriptive and interpretive information in the database, with teachers reinforcing the point that it was possible to utilize this information for different purposes and for a variety of age groups, not just the 3rd, 4th, and 5th grades targeted for this project. "…in one grade you might be studying the concept of expansionism and might need information about lewis & clark and then in another grade level you are studying explorers and would need information on lewis & clark so different units, different grades would all have reason to want access to the database." (5th grade teacher) 3.7 importance of interpretive information during the focus group discussions, one museum curator posed a follow-up question to the teachers about whether having no interpretation might affect teachers' use of particular images. the curator was particularly concerned that artifacts from her institution did not initially have adequate interpretation because of a lack of manpower at her museum during the course of the project. (museum curators and librarians pointed out that adding textual data to the database alone could take up to 25 minutes per record.) as a result of time constraints, the museum curator decided to go ahead with entering primary factual "thin" information about the artifacts into the database records but to defer adding interpretive information. she was interested in hearing from the teachers whether this method would affect their use of the database. the teachers responded that interpretations were extremely important because sometimes they or their colleagues did not know the significant details about specific primary source materials. they also did not have the time or facilities to research the details about the materials. furthermore, the teachers identified quality of items and descriptive and interpretive information as being more important for them than having a large quantity of thinly documented information. it was widely agreed by curators, librarians and teachers that the interpretive and descriptive information was critical to the success of teachers and students in forming the correct (historical and factual) interpretation of the use and context of primary source materials. one curator commented that "the interpretation provides the historical context -it is a key factor...without an interpretation, the bed key would just be two pieces of wood." (museum curator) teachers agreed that the database gave them the feeling that the internet was "easy to navigate." (4th grade teacher) they commented that in the past, it was difficult to find historical and social science primary source material using the internet because there was so much information available. they liked the dchc database because it put usable information in one central location and "you could trust the information." (3rd grade teacher) 4. recommendations and conclusions partners felt that the project encouraged museum curators, educators and librarians to take a good look at their collections. it forced them to "dust off some pictures that might otherwise have been forgotten." (librarian) additionally, the project helped the partners see how various parties might relate to artifacts, documents, and their descriptions and interpretations in different ways, depending on their institutional perspective (school, museum, or library) and the intended use of the information. in order to see the dchc database used more widely, in late 2000, it was chosen for inclusion in the alpha-testing phase at the university of illinois of the implementation of the open archives initiative (oai) metadata harvesting protocol [12]. oai represents an international effort to standardize digital resource discovery across diverse information formats [13]. it was interesting to hear museum curators, librarians and educators report that they still needed to use traditional print means to inform administrators and boards about this new outreach project. one museum director suggested that she could share brief summaries or periodic milestones about the project with the museum board members, institution directors and ceos, but that they would not be interested in reading an entire interim report. the review of the collected evaluation data resulted in five key recommendations for future continuation of the database. continue linking content to statewide learning standards. linking digitized content to these standards was judged to be very valuable by teachers. continue to ensure robustness of the database. the current database was robust enough to serve searching and deposit of materials from diverse institutions. concentrate more on enabling teachers to utilize images and metadata off-line in the classroom and for assignments. quality -not quantity -of resources was deemed important. continue evaluation of database use. the time line of the dchc project did not allow for adequate use and evaluation of digital resources in the classroom. ensure that information in the database continues to be "trustworthy". educators assigned a high value to the availability of "trustworthy" primary source information via the web. additionally, the successful creation and continued development of a digital community requires a certain amount of ongoing human energy and sustained interest among the partners. framing a project around state learning standards serves as a positive motivation for k-12 teachers, curators and librarians. learning standards are ubiquitous, so this component of the model can be easily replicated in other states -especially those where learning standards exist and teachers are mandated to follow them. in this project, the viability of using curriculum to drive content selection for digitization was tested. the results suggest that this model is successful as long as curriculum is used as one of several driving forces behind content selection. other driving forces include: curator knowledge of collections, institutional priorities, and funding. curriculum units and interaction of k-12 teachers provide tangible examples of interest and needs. from this standpoint, the project demonstrated that curriculum goals ought to be reviewed in tandem with content selection wherever possible. further funding was received from the imls for a second project, "teaching with digital content -describing, finding and using digital cultural heritage materials" [14]. the funding is for a two-year period. using the database that the dchc project developed, ten museums and libraries have signed on to include their digitized materials. in addition, fifteen k-12 teachers will be integrating the database into their classroom activities, and a partnership between the teachers and the university of illinois college of education will be maintained. the emphasis of this new grant project is to develop a program that not only supplements the database and enhances the educational programs of museums and libraries, but also helps teachers and pre-service teachers to further integrate digital primary source materials into k-12 curriculum and assignments. acknowledgements the dchc project was funded by the institute of museum and library services under their national leadership grants program, grant no. nl-80113. references [1] digital cultural heritage community project, . [2] illinois state board of education learning standards, . [3] institute of museum and library services, . [4] early american museum, mahomet, illinois, . [5] mclean county museum of history, bloomington, illinois, . [6] illinois heritage association, champaign, illinois, . [7] lincoln trail libraries system, champaign, illinois, . [8] rare book and special collections library of the university of illinois, . [9] lincoln trail elementary school, mahomet, illinois, . oakland elementary school, bloomington, . thomas paine elementary school, . [10] n. bennett, b. sandore, p. miller, a. grunden, "integration of primary resource materials into the primary classroom", proc. museums and the web 2000. available at . [11] n. bennett, b. sandore, "the illinois digital cultural heritage community: museums and libraries collaborate to build a database for the elementary school classroom," spectra, spring 2001, vol. 28, issue 1, pp. 48-55. [12] the university of illinois at urbana-champaign open archives initiative metadata harvesting project, . [13] x. liu, k. maly, m. zubair and m.l. nelson, "arc an oai service provider for digital library federation", d-lib magazine, vol. 7, no. 4., april 2001. [14] teaching with digital content-describing, finding and using digital cultural heritage materials, . copyright 2002 nuala a. bennett, beth sandore, and evangeline s. pianfetti top | contents search | author index | title index | back issues previous article | conference report home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/january2002-bennett   semantic research for digital libraries d-lib magazine october 1999 volume 5 number 10 issn 1082-9873 semantic research for digital libraries hsinchun chen mcclelland professor of management information systems director, artificial intelligence lab management information systems department university of arizona [email protected]   introduction in this era of the internet and distributed, multimedia computing, new and emerging classes of information systems applications have swept into the lives of office workers and people in general. from digital libraries, multimedia systems, geographic information systems, and collaborative computing to electronic commerce, virtual reality, and electronic video arts and games, these applications have created tremendous opportunities for information and computer science researchers and practitioners. as applications become more pervasive, pressing, and diverse, several well-known information retrieval (ir) problems have become even more urgent. information overload, a result of the ease of information creation and transmission via the internet and www, has become more troublesome (e.g., even stockbrokers and elementary school students, heavily exposed to various www search engines, are versed in such ir terminology as recall and precision). significant variations in database formats and structures, the richness of information media (text, audio, and video), and an abundance of multilingual information content also have created severe information interoperability problems -structural interoperability, media interoperability, and multilingual interoperability. federal initiatives: digital libraries and others in may 1995, the information infrastructure technology and applications (iita) working group, which was the highest level of the country's national information infrastructure (nii) technical committee, held an invitational workshop to define a research agenda for digital libraries. (see http://walrus.stanford.edu/diglib/pub/reports/iita-dlw/main.html.) the participants described a shared vision of an entire net of distributed repositories, where objects of any type can be searched within and across different indexed collections [11]. in the short term, technologies must be developed to search across these repositories transparently, handling any variations in protocols and formats (i.e., addressing structural interoperability [8]). in the long term, technologies also must be developed to handle variations in content and meanings transparently. these requirements are steps along the way toward matching the concepts being explored by users with objects indexed in collections [10]. the ultimate goal, as described in the iita report, is the grand challenge of digital libraries: deep semantic interoperability the ability of a user to access, consistently and coherently, similar (though autonomously defined and managed) classes of digital objects and services, distributed across heterogeneous repositories, with federating or mediating software compensating for site-by-site variations...achieving this will require breakthroughs in description as well as retrieval, object interchange and object retrieval protocols. issues here include the definition and use of metadata and its capture or computation from objects (both textual and multimedia), the use of computed descriptions of objects, federation and integration of heterogeneous repositories with disparate semantics, clustering and automatic hierarchical organization of information, and algorithms for automatic rating, ranking, and evaluation of information quality, genre, and other properties. this paper is a short overview of the progress that has been made in the subsequent four years towards meeting this goal of semantic interoperability in digital libraries. in particular, it describes work that was carried out as part of the illinois digital libraries initiative (dli) project, through partnership with the artificial intelligence lab at the university of arizona. attention to semantic interoperability has prompted several projects in the nsf/darpa/nasa funded large-scale digital libraries initiative (dli) to explore various artificial intelligence, statistical, and pattern recognition techniques. examples include concept spaces and category maps in the illinois project [12] and word sense disambiguation in the berkeley project [14], voice recognition in the carnegie mellon project [13], and image segmentation and clustering in the project at the university of california at santa barbara [6]. in the nsf workshop on distributed knowledge work environments: digital libraries, held at santa fe in march, 1997, a panel of digital library researchers and practitioners suggested three areas of research for the planned digital libraries initiative-2 (dli-2): system-centered issues, collection-centered issues, and user-centered issues. scalability, interoperability, adaptability and durability, and support for collaboration are the four key research directions under system-centered issues. system interoperability, syntactic (structural) interoperability, linguistic interoperability, temporal interoperability, and semantic interoperability are recognized by leading researchers as the most challenging and rewarding research areas. (see http://www.si.umich.edu/santafe/.) the importance of semantic interoperability extends beyond digital libraries. the ubiquity of online information as perceived by us leaders (e.g., "information president" clinton and "information vice president" gore) as well as the general public and recognition of the importance of turning information into knowledge have continued to push information and computer science researchers toward developing scalable artificial intelligence techniques for other emerging information systems applications. in a new nsf knowledge networking (kn) initiative, a group of domain scientists and information systems researchers was invited to a workshop on distributed heterogeneous knowledge networks at boulder, colorado, in may, 1997. scalable techniques to improve semantic bandwidth and knowledge bandwidth are considered among the priority research areas, as described in the kn report (see http://www.scd.ucar.edu/info/kdi/): the knowledge networking (kn) initiative focuses on the integration of knowledge from different sources and domains across space and time. modern computing and communications systems provide the infrastructure to send bits anywhere, anytime, in mass quantities radical connectivity. but connectivity alone cannot assure (1) useful communication across disciplines, languages, cultures; (2) appropriate processing and integration of knowledge from different sources, domains, and non-text media; (3) efficacious activity and arrangements for teams, organizations, classrooms, or communities, working together over distance and time; or (4) deepening understanding of the ethical, legal, and social implications of new developments in connectivity, but not interactivity and integration. kn research aims to move beyond connectivity to achieve new levels of interactivity, increasing the semantic bandwidth, knowledge bandwidth, activity bandwidth, and cultural bandwidth among people, organizations, and communities. semantic research for digital libraries during the dli, which ran from 1994 to 1998, significant research was conducted at all six projects in the area of semantic retrieval and analysis for digital libraries. among the semantic indexing and analysis techniques that are considered scalable and domain independent, the following classes of algorithms and methods have been examined and subjected to experimentation in various digital library, multimedia database, and information science applications: object recognition, segmentation, and indexing: the most fundamental techniques in information retrieval involve identifying key features in objects. for example, automatic indexing and natural language processing (e.g., noun phrase extraction or object type tagging) are frequently used to extract meaningful keywords or phrases from texts automatically [9]. texture, color, or shape-based indexing and segmentation techniques are often used to identify images [6]. for audio and video applications, voice recognition, speech recognition, and scene segmentation, techniques can be used to identify meaningful descriptors in audio or video streams [13]. semantic analysis: several classes of techniques have been used for semantic analysis of texts or multimedia objects. symbolic machine learning (e.g., id3 decision tree algorithm, version space), graph-based clustering and classification (e.g., ward's hierarchical clustering), statistics-based multivariate analyses (e.g., latent semantic indexing, multi-dimensional scaling, regressions), artificial neural network-based computing (e.g., backpropagation networks, kohonen self-organizing maps), and evolution-based programming (e.g., genetic algorithms) are among the popular techniques [1]. in this information age, we believe such techniques will serve as good alternatives for processing, analyzing, and summarizing large amounts of diverse and rapidly changing multimedia information. knowledge representations: the results from a semantic analysis process could be represented in the form of semantic networks, decision rules, or predicate logic. many researchers have attempted to integrate such results with existing human-created knowledge structures such as ontologies, subject headings, or thesauri [7]. spreading activation based inferencing methods often are used to traverse various large-scale knowledge structures [3]. human-computer interactions (hci) and information visualization: one of the major trends in almost all emerging information systems applications is the focus on user-friendly, graphical, and seamless human computer interactions. the web-based browsers for texts, images, and videos have raised user expectations of rendering and manipulation of information. recent advances in development languages and platforms such as java, opengl, and vrml and the availability of advanced graphical workstations at affordable prices have also made information visualization a promising area for research [5]. several digital library research teams including arizona/illinois, xerox parc, berkeley, and stanford, are pushing the boundary of visualization techniques for dynamic displays of large-scale information collections. the illinois dli project: federating repositories of scientific literature the artificial intelligence lab at the university of arizona was a major partner in the university of illinois dli project, one of six projects funded by the nsf/darpa/nasa dli (phase 1). the project consisted of two major components: (1) a production testbed based in a real library and (2) fundamental technology research for semantic interoperability (semantic indexes across subjects and media developed at the university of arizona). the testbed this section is a brief summary of the testbed. readers can find more details in [12]. the illinois dli production testbed was developed in the grainger engineering library at the university of illinois at urbana-champaign (uiuc). it supports full federated searching on the structure of journal articles, as specified by sgml markup, using an experimental web-based interface. the initial rollout was available at the uiuc campus in october 1997 and has been integrated with the library information services. the testbed consists of materials from 5 publishers, 55 engineering journals, and 40,000 full-text articles. the primary partners of the project include: american institute of physics, american physical society, american astronomical society, american society of civil engineers, american society of mechanical engineers, american society of agricultural engineers, american institute of aeronautics and astronautics, institute of electrical and electronic engineers, institute of electrical engineers, and ieee computer society. the testbed was implemented using softquad (sgml rendering) and opentext (full-text search), both commercial software. the illinois dli project developers and evaluators have worked together very closely on needs assessment and usability studies. the production testbed has been evaluated since october 1997. six hundred uiuc user subjects enrolled in introductory computer science classes have used the system; and their feedback has been collected and analyzed. we expect to have collected usage data for about 1500 subjects at the end of the study. usage data consists of session observations and transaction logs. after four years of research effort, the testbed successes include: willingness to build custom encoding procedures for sgml. this enabled the testbed to succeed in federating journals from different publishers with differing use of sgml, where elsevier and oclc failed. canonical encoding for structure tags. the testbed can federate across publishers and journals. willingness to build custom software for searching. the illinois dli system is able to show multiple views on the web and has advanced structure search capabilities. production repositories for real publishers. the project became the r&d arm of several scientific publishers and generated valuable contributions to the scientific publishing industry. changing the nature of libraries with research. the project enables a research prototype to become a standard service capability for engineering libraries. however, in the nature of research, we also have experienced many testbed difficulties: for the user interface, the project originally planned to modify the mosaic web browser (which was developed at uiuc). when the web became commercial, mosaic was quickly taken out of the control of the developers. custom software is hard to deploy widely: the web browser interface has widespread support, but the functionality was too primitive for professional full-text search and display. several practical difficulties were found in using sgml for this project. plans to use standard brs as full-text backend had to be abandoned. it proved essential to use an sgml-specific search engine (opentext). good-quality sgml simply was not available. we had to work with every publisher, since nothing was ready for sgml publishing. sgml interactive display proved not to be of journal quality. physics requires good-quality display of mathematics, which is very difficult to achieve. as the project comes to its end, several future directions are being explored: technology transfer to publisher partners: several partners have expressed interest in receiving the testbed technologies. a contract has been developed with the american institute of physics to clone the testbed software and processing. a similar contract is under development with american society of civil engineers. testbed continuance by university library: an industrial partnership program has been established between the uiuc library and publishers. a uiuc spin-off company has been formed to provide software and future services. semantic research in illinois dli project the university of illinois dli project, through the partnership with the artificial intelligence lab at the university of arizona, has conducted research in semantic retrieval and analysis. in particular, natural language processing, statistical analysis, neural network clustering, and information visualization techniques have been developed for different subject domains and media types. key results from these semantic research components include: scalable semantics become feasible: statistical and neural network clustering proves useful and feasible interactively and for large-scale collections. specifically, the ai lab has developed a noun phrasing technique for concept extraction, a concept space technique for building automatic thesauri, and a self-organizing map (som) algorithm for building category maps. more details are provided below. semantic indexes for large collections: two large-scale semantic indexing simulations were performed in 1996 and 1997, respectively. we analyzed 400,000 inspec abstracts and 4,000,000 compendex abstracts to generate about 1,000 engineering-specific concept spaces (automatic thesauri) using the ncsa supercomputers (convex exemplar and sgi origin 2000). results of such computations could be used for semantic retrieval and vocabulary switching across domains [4]. an example in this section we present an example of selected semantic retrieval and analysis techniques developed by the university of arizona artificial intelligence lab (ai lab) for the illinois dli project. for detailed technical discussion, readers are referred to [4] [2]. a textual semantic analysis pyramid was developed by ai lab to assist in semantic indexing, analysis, and visualization of textual documents. the pyramid, as depicted in figure 1, consists of 4 layers of techniques, from bottom to top: noun phrasing indexing, concept association, automatic categorization, and advanced visualization. figure 1: a textual semantic analysis pyramid   noun phrase indexing: noun phrase indexing aims to identify concepts (grammatically correct noun phrases) from a collection for term indexing. known as az noun phraser, the program begins with a text tokenization process to separate punctuations and symbols and is followed by part-of-speech-tagging (post) using variations of the brill tagger and 30-plus grammatic noun phrasing rules. figure 2 shows an example of tagged noun phrases for a simple sentence. for example, "interactive navigation" is a noun phrase that consists of an adjective (a) and a noun (n). figure 2: tagged noun phrases   concept association: concept association attempts to generate weighted, contextual concept (term) association in a collection to assist in concept-based associative retrieval. it adopts several heuristic term weighting rules and a weighted co-occurrence analysis algorithm. figure 3 shows the associated terms for "information retrieval" in a sample collection of project reports of the darpa/ito program tp (term phrase) such as "ir system," "information retrieval engine," "speech collection," etc. figure 3: associated terms for "information retrieval"   automatic categorization: a category map is the result of performing a neural network based clustering (self-organizing) of similar documents followed by automatic category labeling. documents that are similar to each other (in noun phrase terms) are grouped together in a neighborhood on a two-dimensional display. as shown in the colored jigsaw-puzzle display in figure 4, each colored region represents a unique topic that contains similar documents. topics that are more important often occupy larger regions. by clicking on each region, a searcher can browse documents grouped in that region. an alphabetical list that is a summary of the 2d result is also displayed on the left-hand-side of figure 4, e.g., adaptive computing system (13 documents), architectural design (9 documents), etc. [high resolution version of image] figure 4: category map   advanced visualization: in addition to the 2d display, the same clustering result can also be displayed in a 3d helicopter fly-through landscape as shown in figure 5, where cylinder height represents the number of documents in each region. similar documents are grouped in a same-colored region. using a vrml plug-in (cosmo player), a searcher is then able to "fly" through the information landscape and explore interesting topics and documents. clicking on a cylinder will display the underlying clustered documents. figure 5: vrml interface for category map   discussions and future directions the techniques discussed above were developed in the context of the illinois dli project, especially for the engineering domain. the techniques appear scalable and promising. we are currently in the process of fine-tuning these techniques for collections of different sizes and domains. significant semantic research effort for multimedia content has been funded continuously by a multi-year darpa project (1997-2000). in the new digital libraries initiative phase 2 project entitled: "high-performance digital library classification systems: from information retrieval to knowledge management," we will continue to experiment with various scalable textual analysis, clustering, and visualization techniques to automatically categorize large document collections, e.g., nlm’s medline collection (9 million abstracts) and the indexable web pages (50 million web pages). such system-generated classification systems will be integrated with human-created ontologies (nlm’s unified medical language systems and the yahoo directory). selected semantic analysis techniques discussed above have recently been integrated with several internet spider applications. in ci (competitive intelligence) spider, users supply keywords and starting urls for fetching web pages. the graphical ci spider automatically summarizes and categorizes the content in fetched pages using noun phrases and graphical concept maps. in meta spider, users supply keywords to extract web pages from several major search engines (e.g., alta vista, lycos, goto, snap). the meta spider then summarizes content in all web pages, similar to ci spider. both software programs are available for easy free download at the artificial intelligence lab web site: http://ai.bpa.arizona.edu. acknowledgments this work was funded primarily by: nsf/cise "concept-based categorization and search on internet: a machine learning, parallel computing approach," nsf iri9525790, 1995-1998. nsf/arpa/nasa illinois digital library initiative phase 1 project, "building the interspace: digital library infrastructure for a university engineering community," nsf iri9411318, 1994-1998. national center for supercomputing applications (ncsa), "parallel semantic analysis for spatially-oriented multimedia gis data," high-performance computing resources grants (peer review board), on convex exemplar and sgi origin 2000, june 1996-june 1999 (iri960001n). department of defense, advanced research projects agency (darpa), "the interspace prototype: an analysis environment based on scalable semantics," june 1997-may 2000 (n66001-97-c-8535). bibliography [1] h. chen. machine learning for information retrieval: neural networks, symbolic learning, and genetic algorithms. journal of the american society for information science, 46(3):194-216, april 1995. [2] h. chen, a. l. houston, r. r. sewell, and b. r. schatz. internet browsing and searching: user evaluations of category map and concept space techniques. journal of the american society for information science, 49(7):582-603, may 1998. [3] h. chen and d. t. ng. an algorithmic approach to concept exploration in a large knowledge network (automatic thesaurus consultation): symbolic branch-and-bound vs. connectionist hopfield net activation. journal of the american society for information science, 46(5):348-369, june 1995. [4] h. chen, b. r. schatz, t. d. ng, j. p. martinez, a. j. kirchhoff, and c. lin. a parallel computing approach to creating engineering concept spaces for semantic retrieval: the illinois digital library initiative project. ieee transactions on pattern analysis and machine intelligence, 18(8):771-782, august 1996. [5] t. defanti and m. brown. visualization: expanding scientific and engineering research opportunities. ieee computer society press, ny, ny, 1990. [6] b. s. manjunath and w. y. ma. texture features for browsing and retrieval of image data. ieee transactions on pattern analysis and machine intelligence, 18(8):837-841, august 1996. [7] a. t. mccray and w. t. hole. the scope and structure of the first version of the umls semantic network. in proceedings of the fourteenth annual symposium on computer applications in medical care, pages 126-130, los alamitos, ca, november 4-7 1990. institute of electrical and electronics engineers. [8] a. paepcke, s. b. cousins, h. garcia-molino, s. w. hasson, s. p. ketcxhpel, m. roscheisen, and t. winograd. using distributed objects for digital library interoperability. ieee computer, 29(5):61-69, may 1996. [9] g. salton. automatic text processing. addison-wesley publishing company, inc., reading, ma, 1989. [10] b. r. schatz. information retrieval in digital libraries: bring search to the net. science, 275:327-334, january 17 1997. [11] b. r. schatz and h. chen. digital libraries: technological advances and social impacts. ieee computer, 32(2):45-50, february 1999. [12] b. r. schatz, b. mischo, t. cole, a. bishop, s. harum, e. johnson, l. neumann, h. chen and t. d. ng. federating search of scientific literature. ieee computer, 32(2):51-59, february 1999. [13] h. d. wactlar, t. kanade, m. a. smith, and s. m. stevens. intelligent access to digital video: informedia project. ieee computer, 29(5):46-53, may 1996. [14] r. wilensky. toward work-centered digital information services. ieee computer, 29(5):37-45, may 1996. copyright © 1999 hsinchun chen top | contents search | author index | title index | monthly issues previous story | next story home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/october99-chen evaluation methodologies for information management systems search  |    back issues  |    author index  |    title index  |    contents d-lib magazine september 2002 volume 8 number 9 issn 1082-9873 evaluation methodologies for information management systems   emile l. morse national institute of standards & technology 100 bureau drive, stop 8940 gaithersburg, md 20899-8940 [email protected]> abstract the projects developed under the auspices of the defense advanced research projects agency (darpa) information management (im) program are innovative approaches to tackle the hard problems associated with delivering critical information in a timely fashion to decision makers. to the extent that each of the information management systems interfaces with users, these systems must undergo testing with actual humans. the darpa im evaluation project has developed an evaluation methodology that can assist system developers in assessing the usability and utility of their systems. the key components of an evaluation plan are data, users, tasks and metrics. the darpa im evaluation project involved six im project principal investigators (pi's) who devoted a year's effort toward developing a method for getting beyond exploring and implementing systems to actually planning and performing structured, hypothesis-based evaluations of those systems. five im projects participated in this effort while a sixth im project was integrated into and evaluated within a larger effort. this article describes component systems, evaluation. introduction the darpa im program was created to "...address the traditional, and still vexing, challenge of getting critical information to those who need it in a sufficiently timely fashion that it can contribute to the quality of the decisions they make. this problem is made more complex given the accelerating rate of scientific and technical discovery, typified by the ever-shortening time period for the doubling of information (currently estimated at 18 months). the objectives of the im program are intended to explore and evaluate potentially radical alternatives to traditional approaches to information management." — darpa im program description. from the above program description, it is clear that exploration and evaluation are complementary activities. however, it is also a fact that system developers expend more effort on the former than the latter. exploring new im approaches and developing systems to express these new ideas must precede evaluation of the systems, but it is our contention that developing an evaluation plan can and should occur in parallel with system development. this article discusses the experiences of the darpa im evaluation project team in designing and evaluating the systems described below. during the development of information management systems—or any other type of complex application—most of the effort expended is on getting the system to run, incorporating novel features, and allocating resources to accomplish project goals in a timely fashion. many times evaluation of these systems is viewed as something that can be postponed until the end of the process, but all too frequently there is no time to do the needed testing then; at other times, evaluation is not even factored into the goals of the development effort. if convenient, easy to use methods were available in an environment in which evaluation were being fostered, then evaluation might be an activity that serves as an end-point for development. by providing: flexible, well-known data collections, profiles of user populations, a classification scheme for im systems, collections of representative tasks based on system type, and metrics for measuring effectiveness, efficiency, and satisfaction, we envision that evaluation could become as integral to im system development as is documentation or any other well-accepted facet of the software development cycle. im project pi's are experts at developing innovative systems, but they are not necessarily experts at performing usability tests or other types of evaluations. if this project succeeds in identifying and standardizing the components of a good evaluation methodology, investigators in the future will be able to: select appropriate data sets and associated sets of tasks that can be accomplished with the data; determine quickly what user characteristics are important when deciding on a test population; choose metrics that have been shown to have the greatest degree of utility; know how much time has been required in other studies to perform tests similar to the ones they are contemplating. the availability of these resources should make the process of evaluating systems more manageable. the benefit to darpa and other funding agencies is that those systems for which the agencies have contracted will have been evaluated to determine one or more specific benefits of using the system. standard sets of test components have the potential to produce a win-win situation for both developer and funder. of course, there is the potential problem of developers designing to meet the implied criteria, but if the criteria are well chosen, this will add to the quality of the project products. to summarize, the goal of the darpa im evaluation project was not evaluation of systems. the goal was to: document the complexities of evaluation for im projects, provide road maps and warnings for future evaluators, and put the evaluation of darpa funded im projects on a sound basis. the im component projects six principal investigators for im projects already in progress were recruited to participate in the darpa im evaluation project. no attempt was made to choose the pis based on particular systems; willingness to participate and interest in the topic of evaluation were the sole criteria for project participant selection. initial brainstorming sessions were dedicated to developing a structure for the evaluation. the participants found that their projects logically fell into three categories: resource location, collaborative filtering, and sense-making. this categorization was purely ad hoc and was not constrained by any pre-existing taxonomy. after the pairings were established, one of the projects was enlisted to participate in another larger evaluation effort, leaving five groups in the evaluation project described here. each of the three category groupings is described below along with a brief description of the component projects of each category. in the remainder of this article, the system names for the component projects and the names of the investigators will be used interchangeably (note 1). resource location information seeking is often viewed as a cyclical process. the first step is to identify which collections are likely to contain the answers to the user's current query. the work of french (1-3) and gey & buckland (4) target this phase of the process. pie (french) "the personalized information environment or pie is a framework within which users may build and conduct highly customized searches on a distributed document collection of their own choosing. there are four driving principles behind the pie: customizability, efficient and effective search, controlled sharability, and privacy and security. in contrast to a typical internet search of multiple information resources, where control of which resources are searched is in the search engine's hands, a pie places the control in the user's hands. in the pie formulation, descriptions of resources are made available to users who decide which resources to include in a search. the process of resource selection is highly interactive and might involve sample searches and then selection or de-selection of resources from the user's current personalized collection. regardless of the degree of interactivity, efficient and effective search is provided within whatever context the current collection of resources defines. since a user may spend considerable effort customizing a personal resource collection, it makes sense to allow sharing of that collection in constrained ways or using pre-defined policies while maintaining whatever privacy or security constraints might be placed on particular resources or users." (2) search support for unfamiliar metadata vocabulary (gey & buckland) fred gey at the university of california at berkeley contributed to this evaluation project using his work with buckland on aligning metadata vocabularies. the basic idea is that different collections use different terms in their indexing schemes even if they refer to items or properties that a searcher would deem to be the same. for example, one source might provide 'car' as a keyword, while another referred to the concept as 'automobile'. in addition, the term used to index a document may be different from the term used in the underlying document. the approach taken by this project is to mine existing electronic library catalogs to create statistical mappings between vocabularies. software modules called evm's (entry vocabulary modules) are then used to enhance search by mapping from the users' ordinary language to the metadata of the digital resource. collaborative filtering the systems developed by kantor (5) at rutgers and daily & payton (6) at hrl laboratories depend on other information seekers to enrich the data available to a current questioner. many applications currently being developed have this characteristic. whether the goal is to provide subsequent searchers with the relevance ratings of prior investigators, or to put people in touch with others with similar interests, it seems advantageous to leverage prior information seeking work so that the subsequent searchers have the opportunity to explore enriched environments. antworld (kantor) antworld has been described in a number of reports (7). briefly, if a user of antworld searches the web, the antworld system invites the user to provide judgments on pages he or she finds. the combined collection of those judgments and the text of the pages becomes a representation of the user's quest. the antworld system then computes the similarity between the current user's quest and the stored representations of previous users' quests. in a two-step process, antworld finds quests most similar to the current quest, and then finds pages that were highly scored by the owners of those quests. this information is integrated to provide a composite ranking of candidate pages on the web. using the computed similarity, the antworld system then permits the current user to jump directly to those pages that received the highest collective recommendations from users whose quests were similar. packhunter (daily & payton) the packhunter collaborative tool is based on the idea that people who browse the same information spaces are likely to share common interests. once the trails are captured, they can be analyzed to help potential collaborators find each other. in addition to collaborator discovery, people who already know others with whom they need to collaborate can use a feature called collaborative browsing (cb). cb is mediated through a visualization interface that depicts a user trail as a network of nodes. during a collaborative session, the interface will highlight current user locations on paths, mark pages for others as 'interesting', allow jumping to pages pointed out by others, and enable user paths to overlap at common pages or allow the paths to be viewed independently. although the packhunter system has other features, the investigators decided that during the darpa im evaluation project they wanted to restrict evaluation to the collaborator discovery and collaborative browsing features. sense-making the fifth project, genre, is targeted for people to use during active information seeking. genre relies on both structured and semi-structured data collections. the category into which genre fits was created based on its similarity to other projects from the darpa im program. genre (sankar) "genre supports situation understanding by supporting exploration of the information space that is relevant to the situation being analyzed. genre supports exploration by helping the users relax or refine queries based on user access patterns and based on the task model. the query modulation is based on classifications and clusters that are learned by monitoring users' actions in sessions and based on query performance over the wan. furthermore, users can assign their own semantic categorization to these sessions. events that happen within the context of the modulated semantics of the queries are sent to the user by the system. this mode of the system sending events to the user while the user is in the midst of a query modulation session is what we call mixed-initiative exploration." (8) key issues and implementation decisions in devising a structure that could be applied to all the component projects, we started off with the idea that each of the investigators would perform an independent evaluation; each evaluation would entail the use of human subjects; the subjects for each evaluation would be domain experts; the tasks would be realistic in terms of the target users of the system; and the design of each evaluation would be based on hypothesis testing rather than on alternatives that are primarily qualitative. however, the fact of the matter is that significant deviations were experienced in most of these primary goals. for example: while all investigators produced evaluations of their systems, some chose to work in pairs. one pair of testers (resource location) devised an ingenious alternative to human subjects. one set of human subjects did not have domain expertise; they were college students. these and other variations from the overall plan seem not to have had a significant impact on the results of the study. the mere existence of a plan forced the investigators to develop a rationale for their modification(s), and the project group provided a forum where the changes could be debated. the following sections will present the rationale for constraining each of the factors considered in this project. each section will also detail the modifications made by the project participants. experimental designs all experimenters were expected to develop hypotheses and test them. this precluded participants performing iterative rounds of formative usability testing. the goal was to produce summative data, i.e., data than can be described by a measure of central tendency and variation. the comparisons the teams made differed, based mainly on the system that they were testing and the features of the system the teams deemed most critical to demonstrate. since the resource location groups (gey & french) worked as a team, their hypotheses were tested together. the fundamental question they were addressing was whether augmented queries had utility for collection selection and/or document retrieval. the output from one system provided input to the other, and the ultimate results provided evidence for both information activities. both collaborative filtering groups decided to test whether their collaborative methods were superior to a similar system that did not employ collaboration. essentially each collaborative filtering group developed a defeatured version of their system to use as a basis of comparison. finally, genre, the sense-making system, was compared with a system that was currently in place. the rest of the design employed actual analysts attached to various defense agencies as subjects, and the tasks were the things that the analysts did routinely. this comparison, on its face, is straightforward. however, working with busy analysts in demanding, real-world environments proved to be a high-risk challenge. in summary, each participant developed testable hypotheses that addressed the key features of their systems. most of the early group discussions centered on determining how to develop worthy comparisons and the effort resulted in a variety of valid experimental designs. subjects it was assumed that the evaluations would be user-centered. problems related to algorithm implementation, system efficiency, accuracy of data sources, etc., were not what needed to be tested. the desired evaluations would answer questions like: 'can people use this system?' or 'does the system help people do their jobs better?'. 'better' could mean improved in terms of efficiency, productivity, satisfaction, timeliness, or any of a set of similar qualities. information analysts are envisioned as the users of darpa-funded information management systems. an analyst's time is valuable and access to analysts was limited. the teams and individual projects solved this problem in a variety of ways. the french & gey team resorted to an approach that required no subjects at all. they devised a testing paradigm that utilized an 'oracle' which acted as a pseudo-subject (9). the parts of the resource selection activity performed by the system alone were tested in situ, while the parts that would normally be performed by a human user were simulated statistically. the kantor project recruited not only professional information searchers, but also retired nsa analysts. this is the same approach used in the text retrieval conference (trec) studies (10). the daily/payton project used college students. although studies using non-experts might seem to stray significantly from the goals of mimicking the target users, the investigators used the study to determine the utility of various metrics in their collaborative environment. in this case, it appeared less likely that expert/novice differences would exist. lastly, the genre project planned to use actual analysts. these people were made available through personal contact with a darpa customer. a sub-issue with regard to the use of human subjects is the requirement of obtaining institutional review board (irb) permission to perform the studies. each of the teams using human subjects prepared the required documents and received approval. this process was foreign to several of the groups, and the existence of a project group that did have experience in preparing the documents was helpful. the experienced group provided templates that could be used to produce the submission materials for a variety of organizations. in summary, of the five teams that developed experimental designs, two used simulated humans, one used college students, one used not only professional information searchers, but also retired analysts, and one used currently employed analysts. collections one of the goals of this meta-study was to determine if a single data collection could be used for all evaluation efforts. it was hoped such a collection could form a core resource that could be used by later investigators. however, initial discussions showed that this goal would overly constrain the testing of the various systems. no single collection or internet resource appeared to be able to suit the strengths of all of the im projects. therefore, the darpa im evaluation project group considered alternatives that matched the needs of their particular systems. each team kept in mind that small or tailored data sets would be less desirable than larger, more flexible ones. the collaborative filtering teams had particularly interesting problems in selecting and conditioning data sets. since both teams intended to compare the effect of prior user interaction with the data vs. no prior value-added activity, it was necessary to consider how such a data set could be created and maintained. since each test subject would need to see the data in precisely the same state as each other test subject, static collections would have to be generated. the situation was like seeing a map of all the paths that people have taken through a landscape; during certain phases the emphasis was on who laid down footprints and where the footprints were placed, while at other times, subjects saw only the final map of tracks. the teams solved the problem of data set/collection in the following ways. the french/gey team used the ohsumed (11) data set. they found advantages to using a relatively large set of collections from this medical literature. the indexing was performed by applying the mesh (medical subject headings) indexing scheme. the systems of the collaborative filtering group most naturally address the web as a whole. however, the need to produce conditioned (pre-tracked) trails required the collaborative filtering groups to select portions of the collected documents. they prepared subsets by using a two-step process; in the first pass, one set of subjects laid down tracks and, in the second, the subjects were restricted to following those paths. the genre project managed to gain access to the actual data used by their intended analyst subjects. in summary, all the projects used large collections appropriate for the tasks the subjects/systems would be required to perform. in all cases, the participants felt that their systems were being overly constrained and that they could deal with significantly larger problem spaces. tasks the a priori vision was that the tasks subjects/systems would be asked to perform would be realistic realistic in terms of data set and realistic in terms of subjects/target users. the members of the resource location team took advantage of the fact that the ohsumed collection contained queries in addition to documents. once again, it is pertinent to note the similarity with the method used in trec (10). the collaborative filtering team used two distinctly different approaches based on their prior choice of subject population and goal of the experimental design of their projects. kantor's design required subjects to produce summary documents similar in content to actual analyst reports. the subjects were asked to prepare a report on one of the following topics: anthrax: detection, prophylaxis, treatment and use as a weapon; terrorism on the web: overtly terrorist sites; sites that covertly support or link to overt sites, under guise of charities; sites that seem to be endorsed by either of the other two kinds of sites; or development of nuclear weapons by non-governmental organizations: reports of loss of nuclear raw materials; reports on capabilities for making weapons; issues of transporting nuclear weapons to the target locations. daily & payton were interested in developing valid metrics for evaluating systems. since they were using students as their subject material, they chose to use tasks with which their subjects would feel comfortable (jazz/louis armstrong, sports/babe ruth, french impressionism/edouard manet, film/charlie chaplin). they prepared questionnaires that would probe their subjects' knowledge of collected material. the methods used by their subjects were enforced by the system and its interface, and would be the same even in the hands of domain experts. it seems reasonable to assume that similar results would be obtained if experts were tested in more realistic environments. finally, the sense-making, genre system employed not only real data and real analysts but used the actual tasks analysts were called upon to make with their current, non-genre tool. results the results of this meta-study are not the results of the individual projects but rather descriptions of the various studies that were devised and, in 4 out of 5 instances, performed. the following table shows how the five project studies were structured. table 1: summary of evaluation components   experimental design subjects collections tasks measures french retrieval with base and augmented queries - ohsumed ohsumed queries precision; merit gey free text vs. augmented queries - ohsumed ohsumed queries kantor antworld vs. null system retired intel analysts and reference librarians web prepare reports on timely topics subjects rate reports of others on multiple criteria daily/payton packhunter with and without collaboration feature college students web collect documents relevant to a topic area; answer questions in an open-book format performance on 30-question test sankar current system vs. genre u.s. pacific command (pacom)/joint operation planning and execution system (jopes) personnel time-phased force and deployment data (tpfdd) tpfdd query and modification subject ratings of 'ease of use'; time to completion the experimental designs, subjects, collections and tasks have been described in the previous sections of this article. the measures shown in the final column in the table above were not controlled by the study's design but were chosen by each investigator based on other variables. the measures were selected to test the specific hypotheses and were shown to be sensitive in the comparisons that were made. french & gey used the classical precision metric for assessing retrieval performance. the use of merit to evaluate collection selection has been discussed previously (12). kantor's goal was to determine if subjects using antworld would produce better reports than those who used the control analogue. he chose to have his subjects rate each other's reports. another alternative would have been to find independent raters. however, the analysts who prepared the reports were undoubtedly more knowledgeable about the topics than a board of independent raters would have been. as mentioned previously, daily & payton used this study as an opportunity to evaluate metrics. they asked their subjects to collect pages that they believed were useful for the topic they had been assigned. they gathered materials using packhunter's collaborative features or using the control interface that did not provide collaboration. later, they tested their subjects' knowledge of the topic by administering a 30-question test; subjects were permitted to refer to their collected materials. subjects who used packhunter scored higher than subjects who did not. the test results showed that the measure was sensitive enough for their comparison. sankar planned to use classic usability metrics — efficiency and satisfaction. of the six projects initially recruited, the five described here developed full-scale evaluation plans. all but the genre project went on to perform the study detailed in the plan. the results of the resource location team have been published (9). a detailed description of kantor's study is available (13). the genre study was aborted due to the events of sept 11, 2001. the personnel who were scheduled to take part in the study received new, high-priority orders that precluded their participation. the goal of this meta-study was to encourage and support the development of evaluation protocols. we believe that all five projects succeeded with respect to the meta-study. conclusions the key observation of this meta-study is that evaluation of complex information management systems is not only possible but also feasible. further, evaluations can be performed by people with widely divergent backgrounds in designing experimental protocols. the evidence for these observations and the following conclusions is based on information provided by the investigators of the component projects. why was it so easy to get to the evaluation phase? the best explanation is that the investigators were given sufficient resources to devote significant effort to evaluation activities. it is highly likely that unless pressure is brought to bear on system developers to perform serious testing, there will never be enough time or money in the budget to arrive at a final assessment of the usefulness, utility or usability of systems. perhaps the best advice for program managers is that if they are truly interested in having systems evaluated, they should require a plan for system evaluation as a separately funded project stage and then require the proof to be delivered. a new balance needs to be achieved between "explore and evaluate" as stated in the quote in the introduction. contributing to the success of this project were the use of a team approach and the formation of initial pairings. collaboration within the larger group and smaller teams kept the level of discussion high. the investigators shared what they knew to the advantage of all project participants. the project environment was non-threatening, and less experienced members asked questions easily. the groups provided a forum that fostered creativity but could be tough on approving design modifications. my personal impression is that we worked in much the same way that a doctoral seminar group works — by being critical yet supportive. the project groups made progress around obstacles that sometimes can kill the best intentions. one example was in the templates for irb forms. the participants who were accustomed to filling out the many pages of required documentation offered samples for others to tailor. consequently, reviews were handled expeditiously and without delay. networking with the trec researchers at nist to access the retired analysts used in the kantor study provides another example of how an obstacle was overcome. although the individual projects were successful, we did not discover a magic bullet that will solve all the problems in 'getting to evaluation'. somewhat contrary to expectations, toolkits of interchangeable data sets, user profiles, study designs, task collections, and metrics were not developed. instances of each of these are available by contacting the author and/or the individual pi's. the interdependence of the factors makes it hard to envision the performance of only 'clean' studies, i.e., designs composed of large, recent, well defined data sets tested with highly motivated domain experts using timely, significant tasks and measured with numerous, high-quality metrics. what we know from doing this study is: it isn't necessary to perform a perfect test, and high-quality testing is within the capabilities of the research teams who develop the systems. with proper management, motivation, and support, program managers can ensure that effective evaluation will be a part of any project for which it is appropriate. acknowledgments this work was supported by darpa agreement #k928. the conclusions are not necessarily those of darpa. the author is indebted to the investigators of the im projects tested in this study. my thanks to michael buckland, mike daily, jim french, fred gey, paul kantor, dave payton, and sankar virdhagriswaran for thoughtful discussions, their boundless enthusiasm and lots of hard work. references [1] personalized information environments, . [2] personalized information environments, explanation, . [3] j. c. french and c. l. viles. "personalized information environments: an architecture for customizable access to distributed digital libraries," d-lib magazine 5(6), june 1999, . [4] sims metadata research program, . [5] antworld papers, . [6] packhunter, . [7] how the antworld works, . [8] genre presentation, . [9] j. c. french, a. l. powell, f. gey, and n. perelman, "exploiting a controlled vocabulary to improve collection selection and retrieval effectiveness," tenth international conference on information and knowledge management (cikm 2001), nov. 2001, pp. 199-206. [10] e. m. voorhees, and d. harman, overview of trec 2001, . [11] w. hersh, c. buckley, t. j. leone, and d. hickam. "ohsumed: an interactive retrieval evaluation and new large test collection for research." in proc. acm sigir '94, pp 192-201, 1994. [12] j. c. french and a. l. powell. "metrics for evaluating database selection techniques." world wide web, 3(3), 2000. [13] kantor pb, sun y., rittman, r. prototype for evaluating a complex collaborative information finding system for the world-wide web: evaluation of the antworld system final report . note [note 1] pie = french; search support for unfamiliar metadata vocabulary = gey; antworld = kantor; packhunter = daily & payton; genre = sankar. (30 september 2002, the following corrections have been made to this article: a url for reference 3 was added, and in reference 12, the first initial (j) was added to j.c. french's name.)   top | contents search | author index | title index | back issues editorial | next article home | e-mail the editor d-lib magazine access terms and conditions doi: 10.1045/september2002-morse   digital preservation file format policies of arl member libraries: an analysis search d-lib:   home | about d-lib | current issue | archive | indexes | calendar | author guidelines | subscribe | contact d-lib   d-lib magazine march/april 2014 volume 20, number 3/4 table of contents   digital preservation file format policies of arl member libraries: an analysis kyle rimkus, thomas padilla, tracy popp and greer martin university of illinois at urbana-champaign point of contact for this article: kyle rimkus, [email protected] doi:10.1045/march2014-rimkus   printer-friendly version   abstract whether overseeing institutional repositories, digital library collections, or digital preservation services, repository managers often establish file format policies intended to extend the longevity of collections under their care. while concerted efforts have been made in the library community to encourage common standards, digital preservation policies regularly vary from one digital library service to another. in the interest of gaining a broad view of contemporary digital preservation practice in north american research libraries, this paper presents the findings of a study of file format policies at association of research libraries (arl) member institutions. it is intended to present the digital preservation community with an assessment of the level of trust currently placed in common file formats in digital library collections and institutional repositories. beginning with a summary of file format research to date, the authors describe the research methodology they used to collect and analyze data from the file format policies of arl library repositories and digital library services. the paper concludes with a presentation and analysis of findings that explore levels of confidence placed in image, text, audio, video, tabular data, software application, presentation, geospatial, and computer program file formats. the data show that file format policies have evolved little beyond the document and image digitization standards of traditional library reformatting programs, and that current approaches to file format policymaking must evolve to meet the challenges of research libraries' expanding digital repository services.   background nearly twenty years ago in their seminal publication preserving digital information: report of the task force on archiving of digital information, waters and garret wrote on the important role trusted file formats would soon go on to play in the burgeoning field of digital preservation: another migration strategy for digital archives with large, complex, and diverse collections of digital materials is to migrate digital objects from the great multiplicity of formats used to create digital materials to a smaller, more manageable number of standard formats that can still encode the complexity of structure and form of the original (waters and garret, 1996, 28). indeed, the identification of such "standard formats" would soon begin to occupy the attention of many information professionals working in digital libraries. risk management of digital information: a file format investigation, for example, details cornell university library's efforts, in the late 1990s, to develop file format migration policies based on principles of risk management. while the report's authors note that at the time of the study itself, few cultural memory organizations were even willing to risk endorsing specific file formats (lawrence, et al., 2000, 1), many of the file formats singled out for preservation purposes during this period, particularly those intended for use in digital reformatting efforts, are still held in high regard today. as a case in point, the united states national archives and records administration (nara) published a best practices document in 1998 for its electronic access project to digitize selected archival materials for online access that endorsed the tagged image file format (tiff) for production master files—a recommendation echoed to this day by many practitioners in the field of digital library imaging (rieger, 2008). similarly, the standard generalized markup language (sgml) and its successor the extensible markup language (xml) began to garner trust in text encoding circles (cohen and rozenzweig, n.d.), while waveform audio, or the wave file format gained traction for use in digital audio preservation (bamberger and brylawski, 2010). why then are some file formats considered better-suited to preservation than others? open file formats are generally preferred to closed, proprietary formats because the way they encode content is transparent. on the other hand, adoption of a proprietary file format by a broad community of content creators, disseminators and users, is often considered a reliable indicator of that format's longevity. additional qualities such as complexity, the presence of digital rights management controls, and external dependencies are also seen as relevant factors to consider when assessing file formats for preservation (rog and van wijk, 2008, 3-4). there is, however, no failsafe formula for file format policy decisions. while stanford university prototyped an empirical walker to combine machine-automated and human assessments of file formats in use in their own digital preservation repository (anderson, 2005), and the online computer library center (oclc) developed the inform methodology to assess the long-term reliability of file formats considered for use in digital preservation environments (stanescu, 2005), how to weigh the relative value of the preservation qualities of file formats often differs from one institution to another. these considerations came to the fore with the advent of institutional repositories in the early 2000s. as institutional repository managers sought to strike a balance between lowering barriers to deposit and acquiring content that would stand the test of time, they often expressed their file format policies, in contrast to the prescriptive requirements of digitization guidelines, as recommendations. the implementation of the dspace institutional repository software platform at its original home institution, the massachusetts institute of technology (mit), is an excellent example. its policy differentiates file formats by the categories of "supported," "known," and "unsupported" (mit libraries, 2013). likewise, the illinois digital environment for access to learning and scholarship (ideals) at the university of illinois at urbana-champaign categorizes file formats as "highest confidence-full support," "moderate confidence-intermediate support," and "low confidence-basic preservation only" (illinois digital environment for access to learning and scholarship, 2013). elaborating the specific terms of preservation services to a designated community of users is a key concept in the open archival information system (oais) specification (consultative committee for space data systems, 2002) and the framework for a trusted digital repository (tdr) (consultative committee for space data systems, 2011) and its predecessor trusted repository audit certification (trac) (dale and ambacher, 2007). widespread knowledge of these and similar frameworks and models has spurred the development, in certain quarters, of repository services built expressly for the digital preservation function. of all digital library services, these repositories generally feature the most carefully conceived file format policies of all. the florida digital archive (fda), a digital preservation service available to all libraries affiliated with the state university system in florida, provides its users with detailed action plans for specific file formats, as well as assessments of high, medium, or low file format confidence levels. in addition, the fda developed extensive guidelines related to preservation risks such as encryption, password protection, compression, proprietary fonts, and digital rights management controls (florida virtual campus, 2013). this trajectory would suggest that digital library file format policies have become more expansive over time to meet the changing needs of evolving repository services. but is this borne out by today's digital preservation file format policies in research libraries? this question, informed by the trends in file format development summarized above, furnished the focal point of this study.   methodology by gathering and assessing data on the level of confidence currently placed in file formats by member libraries of the arl, a non-profit organization of north american academic research libraries whose membership requirements include institutional commitment to sustaining significant research collections, including those in digital format (association of research libraries, 2013), this study seeks to contribute evidence of value to the profession's evolving discussion of best practices in digital preservation. the paper's authors collected data from october 2012-june 2013, and began by identifying a data model to reflect the terms and relationships designated below in bold, and fully defined in appendix i: each arl library has zero or more instances of a repository or digital library service. each repository or digital library service may enforce no more than one file format policy. each file format policy must include one or more file format(s). each file format must belong to a file format type of the category application, audio, computer program, geospatial, image, presentation, spreadsheet/database, text/document, or video. each file format in a file format policy is supported at a confidence level of high confidence or moderate confidence. using a locally developed database designed to reflect the data model described above, the authors followed the following process: for each arl library in the official list of 175 institutions, they browsed websites to identify every repository or digital library service with an online presence; next, they browsed the websites for each repository or digital library service in search of public documentation on file format policies; finally, for each accepted file format identified in a file format policy, they assigned a confidence level placed in it by policy's wording of high confidence for file formats whose encoded content was guaranteed functional preservation, and medium confidence for file formats only guaranteed bit-level preservation or designated as acceptable but not preferred (a full explanation of these distinctions is available in appendix i: definitions of terms). for those repository or digital library services without readily available public documentation, the authors requested information via email with a service manager identified on the library website. this approach afforded the authors a comprehensive view of exactly how much digital preservation policy information institutions are making available on their websites; their findings are summarized below.   findings a spreadsheet of the file format policy data collected for this paper is available at appendix ii of this paper. the data were drawn from the file format policies of 118, or 51% of 253 arl repository or digital library services identified by the authors following the methodology described above. they discovered 73 of these file format policies on publicly available websites, whereas 45 were provided to them by repository managers in response to direct email queries. 174 file formats appear in these 118 policies. by type, they break down into the categories application (14), audio (19), computer programs (17), geospatial (6), image (28), presentation (10), spreadsheet/database (28), text/document (36), video (15). the five most commonly occurring file formats in all policies (see table 2 for more information) are the tagged image file format (extension tiff, or tif) (115), the waveform audio file format (wav) (80), the portable document format (pdf) (74), jpeg (jpg, jpeg) (70), and plain text document (txt, asc) (69). the five most frequently occurring file formats given high confidence in all policies are the tagged image file format (tiff, tif) (88), plain text document (txt, asc) (52), the portable document format (pdf) (49), the waveform audio file format (wav) (47), and the extensible markup language (xml) (47). the five most frequently occurring file formats given medium confidence in all policies are quicktime (mov, qt) (47), microsoft excel (xls) (39), microsoft word (doc) (38), microsoft powerpoint (ppt) (38), and realaudio (ram, ra, rm) (35). table 1. top 15 file formats listed by occurrence format name extension format type medium confidence high confidence total occurrence in policies tagged image file format tiff, tif image 27 88 115 waveform audio file format wav audio 33 47 80 portable document format pdf document 25 49 74 jpeg jpg, jpeg image 28 42 70 plain text document txt, asc document 17 52 69 jpeg 2000 jp2 image 23 45 68 quicktime mov, qt video 47 20 67 extensible markup language xml document 16 47 63 audio interchange file format aif, aifc, aiff audio 14 45 59 portable network graphics png image 14 39 53 graphics interchange format gif image 22 31 53 hypertext markup language html, htm document 18 33 51 microsoft word doc document 38 8 46 microsoft powerpoint ppt presentation 38 6 44 microsoft excel xls spreadsheet/database 39 5 44 using the data referenced above, the authors used a simple calculation to assign levels of relative confidence to file formats. this number, expressed as a percentage, was arrived at by subtracting the number of moderate confidence recommendations from high confidence recommendations for a particular file format, and then dividing the difference by the total number of recommendations for that format. if the resultant percentage is positive, it indicates a greater proportion of high confidence recommendations relative to the moderate confidence recommendations for a given file format. to weed out false positives, this percentage was only calculated for file formats that appear in at least 10 policies. the five file formats with the highest relative confidence values (table 2) are comma separated values (csv) (73%), the machine readable cataloging record (marc) (68%), the tagged image file format (tiff, tif) (53%), the audio interchange file format (aif, aifc, aiff) (53%), and plain text document (txt, asc) (51%). table 2. file formats identified with positive relative confidence values file format name occurrence in policies expressing medium confidence occurrence in policies expressing high confidence relative confidence comma separated values 3 19 73% machine readable cataloging record 3 16 68% audio interchange file format 14 45 53% tagged image file format 27 88 53% plain text document 17 52 51% extensible markup language 16 47 49% portable document format/a 9 25 47% portable network graphics 14 39 47% motion jpeg 2000 4 10 43% portable document format 25 49 32% jpeg 2000 23 45 32% hypertext markup language 18 33 29% postscript 11 19 27% jpeg 28 42 20% waveform audio file format 33 47 18% graphics interchange format 22 31 17% rich text format 18 24 14% regarding this paper's central question of the effect expanding digital library services have had on digital preservation file format policies, the data show that practitioners place high levels of confidence in trusted formats for documents and images with origins in library reformatting programs. these categories feature eight and five formats with positive relative confidence values, respectively (tables 3 and 4). table 3. document file formats with more than 10 occurrences file format name extension relative confidence machine readable cataloging marc 68% plain text document txt, asc 51% extensible markup language xml 49% portable document format/a pdf/a 47% portable document format pdf 32% hypertext markup language html, htm 29% postscript ps, eps, ai 27% rich text format rtf 14% microsoft word xml docx -9% microsoft word doc -65% standard generalized markup language sgml, sgm -67% tex dvi dvi -92% latex latex -100% tex tex -100% wordperfect wpd -100% microsoft visio vsd -100%     table 4. image formats with more than 10 occurrences file format name extension relative confidence tagged image file format tiff, tif 53% portable network graphics png 47% jpeg 2000 jp2 32% jpeg jpg, jpeg 20% graphics interchange format gif 17% bitmap image file bmp -88% photo cd cd -100% by contrast, digital preservation managers appear to take a less generous view of file format types that do not have their roots in longstanding library digitization efforts. the categories of application, computer program, geospatial, and presentation files do not count a single format among them with a positive relative confidence value. the spreadsheet/database and video categories (table 5) have one positively ranked file format each. for audio formats, there are only two. table 5. video formats with more than 10 occurrences file format name extension relative confidence motion jpeg 2000 mu2 43% avi avi 0% mpeg mpeg, mpg, mpe -20% mpeg-2 mp2 -37% quicktime mov, qt -40% mpeg-4 part 14 with h.264 codec mp4/h.264 -50% mpeg-1 mp1 -67% flash video flv, f4v -86% windows media video wmv -88% these results point to a common compromise repository managers make for file formats they are not accustomed to managing within internal digital production workflows, namely, guaranteeing them "bit-level" preservation storage without implying that the content their files encode will stand the test of time. table 6. top 15 file formats listed by order of occurrence in policies expressing medium confidence file format name extension file format type occurrence in policies expressing medium confidence quicktime mov, qt video 47 microsoft excel xls spreadsheet/database 39 microsoft word doc document 38 microsoft powerpoint ppt presentation 38 realaudio ram, ra, rm audio 35 waveform audio file format wav audio 33 mpeg-3 mp3 audio 31 bitmap image file bmp image 31 adobe photoshop psd, pdd image 29 jpeg jpg, jpeg image 28 au au, snd audio 28 tagged image file format tiff, tif image 27 latex latex document 26 tex tex document 26 wordperfect wpd document 26 the implications of these findings are explored in the next section of this paper.   conclusions the data gathered in the course of this study would suggest that, as of mid-2013, research library professionals in north america appear to trust only 18 file formats in all (table 2). the numbers, however, only tell part of the story. in the course of their research, the authors learned as much from the data gathering process—reviewing the way file format policies were expressed online or the way that repository managers described their approach to file format management in emails—as from the data themselves. despite the intense focus on digital preservation in recent years, for instance, only a meager number of repositories have taken the step of formulating thorough file format policies. in addition to the example furnished by the florida digital archive cited above, deep blue at the university of michigan ("deep blue preservation and format support", 2013), boston university's digital preservation policy (boston university, 2013), and the university of minnesota's digital conservancy ("university of minnesota digital conservancy", 2013) furnish examples of thoughtfully conceived approaches to file format policymaking. it is also clear that many institutions are relying on the judgment of perceived experts to inform their own file format policy decisions, and that they are looking in particular to the creators of broadly adopted repository management software platforms for guidance. the massachusetts institute of technology (mit), home to development of the open source dspace institutional repository software prior to the establishment of the duraspace not-for-profit, is a case in point. numerous repository managers identified in this study either referred to mit's file format policies (mit libraries, 2013) as those they had adopted for their own use, or presented actual charts and terminology breaking down file format policies in a manner nearly identical to the mit dspace model. comments made by repository managers during the data gathering period would imply that archivematica is poised to play a similar role for the growing number of institutions that deploy it. archivematica is an open source suite of digital preservation "microservices" that enable collection managers to oversee such digital preservation actions as file format normalization and the management of content in accordance with the oais concept of submission, archival, and dissemination information packages (archivematica, 2013). several digital preservation managers referred to archivematica's ongoing file format policy registry and associated migration paths as the policies they intended to adopt at their own institutions. as far as the future of digital preservation policy management is concerned, it bears emphasis that contemporary file format policies are very much rooted in relatively small-scale data management practices—stewarding files through digitization workflows, for example, or curating a university's research publications. in many cases, bit-level preservation services are offered to obviate the need to make hard decisions about unappealing file formats. for example, the realaudio format appears 35 times in all identified format policies, but is promised exclusively moderate confidence, or bit-level support, without a single high confidence rating. this is not to be read as an endorsement of realaudio as a preservation file format so much as an acknowledgment that realaudio files exist within many academic libraries' designated communities of users, and that 35 repositories have taken it upon themselves to preserve them as-is. bit-level support, however, is not necessarily a vote of confidence for the preservation characteristics of a file format. especially in the case of institutional repositories, the provision of a storage service for all commonly encountered file types is more often than not a recognition that file format use frequently extends well beyond a short-list of preferred archival formats. in this respect, bit-level support for everything that comes into a given repository implies a compromise with a social reality rather than a hard-line application of digital preservation format assessment methodologies. it is instructive to view these trends in light of recent research from the world of large-scale, long-term web archiving. in formats over time: exploring uk web history, andrew n. jackson presents a file format analysis of 2.5 billion resources harvested in the .uk domain for the archives of the united kingdom, with the conclusion that "most file formats last much longer than five years, that network effects appear to stabilise formats, and that new formats appear at a stable, manageable rate" (jackson, 2012, 4). in particular, jackson's research highlights the persistence on the web of image formats such as jpeg, tiff, png, and gif—all of which rank highly in this study—while pointing to the decline and near disappearance over time of the once common x bitmap (xbm) format, which, interestingly, does not figure at all in any known arl policies. despite the web's importance as an indicator of file formats trusted for sharing access to digital information, it also conceals an entire world of digital content production from view. to remain with image formats, few photographers or graphic designers begin their work in the gif, png, or jpeg format, even if these are what they eventually use for the web distribution of their images. rather, their files more frequently begin their lives in proprietary production master formats such as raw, digital negative (dng), or photoshop document (psd). such production file formats are likely to be found in collections of electronic records, not to mention a broad variety of other file formats saved on donors' hard drives, as libraries and archives begin to increase the acquisition of born digital materials. this is not dissimilar to the challenges libraries and their collaborators in information technology face as they articulate strategies to effectively steward scientific data and the broad variety of files produced throughout the research process in different disciplines. the way that managers of these emergent services craft their own file format policies will certainly have a significant influence on the future of digital preservation planning. these looming frontiers notwithstanding, traditional notions of file format recommendations in libraries are beginning to receive scrutiny and challenge. de vorsey and mckinney, in writing of the digital collections stewarded by the national library of new zealand, take issue with efforts to anoint certain file formats as "archival." in practice, they observed considerable variance between specimens of even the most common "preservation" file formats, these most often resulting from differing interpretations of format standards by the software that encoded them. as a result, they advocate shifting the focus from file formats per se, and instead matching file profiles against application profiles to determine an institution's ability to provide access to content: our experience with new zealand's documentary heritage is that files contain multifarious properties. these are based on the world of possibilities that the format standard describes, but can also include non-standard properties. the range of possibilities and relationships between them is such that it is quite meaningless to purely measure a file's adherence to a format standard (de vorsey and mckinney, 2010, 43). such developments would suggest that the already challenging prospect of file format policymaking for research library collections is about to become even more daunting. at present, arl member file format policies largely reflect a high level of confidence with a limited number of file formats used in library digitization programs and the web transmission of scholarly communication. outside of these file formats, however, policies indicate a much lower level of confidence in their respective repositories' abilities to provide adequate preservation services for file formats in the categories of application, computer program, geospatial, and presentation, and, to a lesser extent, audio, tabular data, and video. as libraries and archives begin to set their sights on collections of heterogeneous files such as born-digital electronic records and research data, this is expected to spur on further evolution not only in the file formats that appear in digital preservation policies, but in the way file format policies are articulated and implemented.   acknowledgements the authors wish to acknowledge the research and publication committee of the university of illinois at urbana-champaign library, which provided support for the completion of this research.   references [1] anderson, richardson, hannah frost, nancy hoebelheinrich, and keith johnson. 2005. "the aiht at stanford university: automated preservation assessment of heterogeneous digital collections." d-lib magazine 11 (12) (december): 10. http://doi.org/10.1045/december2005-johnson [2] archivematica. 2013. "format policy registry requirements." archivematica wiki. [3] association of research libraries. 2013. "membership." arl. [4] bamberger, rob, and sam brylawski. 2010. "the state of recorded sound preservation in the united states: a national legacy at risk in the digital age". council on library and information resources and the library of congress. [5] boston university. 2013. "bu libraries digital preservation policy." [6] cohen, daniel j., and roy rosenzweig. 2013. "digital history: a guide to gathering, preserving and presenting the past on the web." [7] consultative committee for space data systems. 2002. "reference model for an open archival information system (oais)". ccsds secretariat. [8] consultative committee for space data systems. 2011. audit and certification of trustworthy digital repositories: recommended practice. recommended practice issue 1. washington, dc: ccsds secretariat. [9] dale, robin l., and bruce ambacher. 2007. "trusted repositories audit & certification: criteria and checklist". chicago: online computer library center and the center for research libraries. [10] de vorsey, kevin, and peter mckinney. 2010. "digital preservation in capable hands: taking control of risk assessment at the national library of new zealand." information standards quarterly 22 (2): 41—44. [11] "deep blue preservation and format support." 2013. [12] derrot, sophie, louise fauduet, clément oury, and sébastien peyrard. 2013. "preservation is knowledge: a community-driven preservation approach." in ipres2012: proceedings of the 9th international conference on preservation of digital objects, 11—18. toronto, on, canada: university of toronto faculty of information. [13] florida libraries virtual campus. 2013. "florida digital archive: fda file preservation strategies by format." [14] "global digital format registry (gdfr)." 2013. sourceforge. [15] illinois digital environment for access to learning and scholarship. 2013. "file format recommendations." ideals wiki. [16] jackson, andrew n. 2012. "formats over time: exploring uk web history". arxiv e-print 1210.1714. [17] kenney, anne r. 1996. digital imaging for libraries and archives. ithaca, n.y.: dept. of preservation and conservation, cornell university library. [18] lawrence, gregory w., and et al. 2000. risk management of digital information: a file format investigation. washington, d.c.: council on library and information resources. [19] mit libraries. 2013. "format support." dspace: mit libraries. [20] pearson, d., and c. webb. 2008. "defining file format obsolescence: a risky journey." national library of australia staff papers. [21] premis editorial committee. 2012. "premis data dictionary for preservation metadata: version 2.2." [22] rieger, oya y. 2008. "preservation in the age of large-scale digitization: a white paper". washington, d.c.: council on library and information resources. [23] rog, j., and c. van wijk. 2008. "evaluating file formats for longterm preservation." koninklijke bibliotheek 2: 12—14. [24] stanescu, andreas. 2005. "assessing the durability of formats in a digital preservation environment: the inform methodology." oclc systems & services 21 (1) (march 1): 61—81. http://doi.org/10.1108/10650750510578163 [25] "unified digital format registry (udfr)." 2012. [26] united kingdom national archives. 2013. "pronom." the technical registry pronom. [27] "university of minnesota digital conservancy." 2013. [28] waters, donald, and john garrett. 1996. "preserving digital information: report of the task force on archiving of digital information". the commission on preservation and access and the research libraries group. [29] wheatley, paul. 2004. "institutional repositories in the context of digital preservation." microform & imaging review 33 (3) (january): 135—146. http://doi.org/10.1515/mfir.2004.135   appendix i: definitions of terms arl library one of the 125 member libraries listed in the online membership directory of the association of research libraries (association of research libraries, 2013) during this study's data collection period of october 2012-june 2013. repository or digital library service any digital library repository or production unit that serves the preservation planning function of recommending file formats for the long-term viability of digital content. this includes institutional repositories that manage digital items submitted by a community of external users, often research faculty, for long-term access; digital production units that generate content for digital library collections, often through the digital reformatting of analog materials in libraries and archives; and digital preservation repositories with a clear charge to maintain enduring access to digital content. file format policy an official statement of preference for specific file formats over others, sometimes expressed as a recommendation, other times as a set of requirements for deposit into a digital library collection or repository. file format a standardized way to structure the data stored in a computer file, or a self-contained data-stream or package of related data-streams made available as a discrete entity to a computer's operating system and its programs. that is, in this study, the term file format is used in a broad sense to encompass discrete data packages that store homogenous content (e.g. a text file), as well as complex digital objects composed of several file or bitstream objects encased in a wrapper or bundling file format (e.g. h.264-encoded video stored within a quicktime file wrapper). essentially, anything that appears to a modern operating system's file browser as a packet of information represented by a character string, a dot, and an extension is considered a representative example of a file format. many file format policies contain stipulations that go beyond the file format level. a repository may accept jpeg2000 files, for example, but only on the condition that they were created utilizing a lossless compression algorithm; tiff files as long as they are version 6.0 of the standard; or document formats such as the portable document format (pdf), provided that they do not contain embedded media content. as important as these distinctions are, this study focuses on file format designations understood in a very broad sense—that of a packet of data represented on the file system level by a dot and an extension, and considers such refinements, while important, out of scope. file format type a categorization of file formats based on common use categories, as defined below: application: file formats used to encode executable files that launch applications. for the sake of convenience, this category also comprises "container" file formats like tar and zip. audio: file formats used to encode sound data. computer program: file formats used to encode computer programming data. geospatial: file formats used to represent geographic or map data. image: file formats used to encode still images. presentation: file formats used to encode presentations or slideshows. spreadsheet/database: file formats used to encode tabular data, such as spreadsheets and databases. text/document: file formats used to encode textual information or documents. video: file formats used to encode moving images. confidence level few repository or digital library policies designate confidence levels in file formats in quite the same way. many differentiate between service levels, guaranteeing, for example, content migration for trusted file formats but bit-level preservation services for others. others rank file formats by levels of confidence in their long-term accessibility. as such, it is not uncommon to encounter repository policies that categorize file formats using subjective terms. in seeking to find common ground across so much variety, the authors of this study settled on two categories: high confidence and moderate confidence, and created the guidelines below to differentiate between them: high confidence any file format guaranteed functional preservation by virtue of the anticipated ability to preserve its content over time, to include formats designated for normalization or eventual migration to a secondary trusted file format. in lieu of a guarantee for functional preservation, any file format designated using language like "highly recommended," "high trust," or "high probability for digital preservation" in a file format policy that differentiates between high, moderate, and/or low levels of confidence. any file format specifically required by the digitization guidelines of a digital library program. moderate confidence any file format guaranteed bit-level but not functional preservation. any file format designated using language like "weak" or "low trust" for digital preservation in a policy that differentiates between high, moderate, and/or low levels of confidence. any file format listed as "accepted by" a repository or digital library service without any specific language designating the services implied by this acceptance.   appendix ii file format policy data collected for this article and submitted to d-lib magazine is available here in pdf. the data is also downloadable in pdf, xls and csv formats from the ideals institutional repository at http://hdl.handle.net/2142/47421.   about the authors kyle rimkus is the preservation librarian and an assistant professor of library administration at the university of illinois at urbana-champaign (uiuc), where he leads the library's digital preservation program. before joining the uiuc library in 2012, he was head of digital scholarship and programs at the university of miami libraries and project coordinator for the digital library of the caribbean at florida international university. kyle has an m.s. in library and information science and an m.a. in french studies from uiuc, and a b.a. in germanic studies from the university of illinois at chicago.   thomas padilla is digital humanities librarian at michigan state university, where he works to augment digital collections, foster digital humanities instruction and collaboration, and expand data curation services. while earning his mslis, he worked in the preservation unit and the scholarly commons of the university of illinois at urbana-champaign libraries. before attending uiuc he worked at the library of congress helping to extend digital preservation professional development opportunities throughout the united states. he holds an mslis from the university of illinois at urbana champaign and an ma in world history from san francisco state university.   tracy popp is digital preservation coordinator at the university of illinois at urbana-champaign library, where her work focuses on the preservation and access of born-digital library collections. she manages the the library's born digital content preservation service and its web archiving program. currently, she is investigating practical and scalable access and preservation methods for complex and obsolete file types. she has an m.s. and certificate of advanced study in library and information science from the university of illinois at urbana-champaign.   greer martin is the digital public library of america metadata librarian/archivist at the university of georgia. she describes digitized cultural heritage materials from local institutions for inclusion in the digital public library of america (dpla) and enhances metadata for aggregation by the dpla. she completed a master of science in library and information science in 2013 from the graduate school of library and information science at the university of illinois at urbana-champaign.   copyright © 2014 kyle rimkus, thomas padilla, tracy popp and greer martin search web services the oasis sws technical committee work: the abstract protocol definition, opensearch binding, and sru/cql 2.0 search   |   back issues   |   author index   |   title index   |   contents d-lib magazine january/february 2009 volume 15 number 1/2 issn 1082-9873 search web services the oasis sws technical committee work the abstract protocol definition, opensearch binding, and sru/cql 2.0   ray denenberg library of congress abstract the oasis search web services technical committee is developing search and retrieval web services, integrating various approaches under a unifying model, an abstract protocol definition. sru/cql and opensearch are the two approaches featured by the current work, and we hope that additional protocols will be similarly integrated into this model. the model provides for the development of bindings. three bindings will be developed by the committee: sru 1.2, opensearch, and sru 2.0. these three are so-called "static" bindings; they are human-readable documents. the first two are simply renderings of the respective existing specifications. the sru 2.0 binding however is a major new version of sru, and there will also be a new version of the companion query language, cql 2.0. the model also defines the concept of a "dynamic" binding, a machine-readable description file that a server provides for retrieval by a client that may then dynamically configure itself to access that server. the premise of the dynamic binding concept is that any server – even one that pre-dated the concept – need only provide a self-description in order to be accessible. a client will be able to access the server simply by reading and interpreting the description and, based on that description, formulating a request (including a query) and interpreting the response. of course, the premise behind this concept is a standard description language, and that will also be part of the oasis work. 1. introduction in 2007 oasis [1] created the search web services technical committee. its charge is to define search and retrieval web services based on various current technologies, most notably, search and retrieval via url (sru) [2] – along with its companion contextual query language (cql) [3] – and opensearch [4]. sru (together with cql) and opensearch are at different ends of the search/retrieval complexity spectrum, and these two protocols are the focus of the oasis work. (a major part of the work will be revisions to sru and cql.) the foundation of the work will be an abstract protocol definition (apd), a reference model by which these and other protocols can be described. 1.1 overview of this article section 1.2 provides some brief historical notes about sru and cql. section 2 describes the abstract protocol definition and how concrete specifications, called bindings, are derived. a binding may be static or dynamic. the committee's focus is on static bindings, but the apd lays the foundation for dynamic bindings. the nature of the description language, used for dynamic bindings, is briefly described and an example is provided in section 2.2. the opensearch binding is described in section 3 and the sru 2.0 binding in section 4; new features of sru and cql are described in detail. finally, the projected schedule of work is outlined in section 5. this is followed by a concluding section. 1.2 historical notes on sru and cql sru was originally conceived as one of two companion protocols, srw [5] and sru. in srw (search and retrieve web service) messages are conveyed using xml over http via soap. with sru, clients send messages to servers by url. srw is no longer presented as a separate protocol but rather as a variation of sru, referred to as "sru via http soap" [6]. collectively, this suite (the two protocols and cql) was originally called "z39.50 next generation", and subsequently, "z39.50 international next generation" ("zing"). these names are no longer used. development of srw/sru/cql began in 2000. an experimental version, 1.0, was released in november 2002, and the first official version, 1.1, in february 2004. a minor revision, 1.2, was released in 2007, in preparation for the work that was soon to begin within oasis, to include development of sru/cql version 2.0. 2 the abstract protocol definition and its bindings although version 2.0 of sru and cql will be the most visible specifications produced by the committee, the foundation of the work will be the abstract protocol definition, specifying abstract search request parameters and abstract search response elements. these abstractions provide the framework for the definitions of application protocol bindings. 2.1 bindings a binding is a realization of the apd: it is a concrete protocol specification derived by mapping abstractions within the apd to real objects and concepts. the first such binding developed by the committee will be for sru version 1.2. the sru 1.2 specification already exists, and the binding is being developed in part as proof-of-concept. it describes how the abstractions defined in the apd are realized in sru 1.2, and it supplies the abstract-to-concrete mappings of request parameters and response elements. 2.1.1 example of an abstraction and its realization as an example, the apd describes the following abstraction: a server exposes a datastore for access by a remote client for purposes of search and retrieval. the datastore is a collection of units of data. such a unit is referred to as an item. thus the apd introduces the abstractions datastore and item. the sru binding explains that a 'datastore' as described in the apd is realized as a database in the binding, and that an 'item' is realized as a database record. corresponding to the item abstraction, the apd defines the abstract parameter 'maximumitems' (the number of items requested to be returned). the analogous parameter in the sru 1.2 binding is 'maximumrecords'. as another example, the apd defines the abstract parameter 'responseitemtype', and the corresponding sru 1.2 parameter is 'recordschema'. 2.1.2 static and dynamic bindings a binding may be static or dynamic. the sru 1.2 binding is a static binding: it is specified by a human-readable document. the concept of a static binding isn't very interesting for sru 1.2, because an sru 1.2 specification already exists. an sru 2.0 binding will be more interesting, since no sru 2.0 specification yet exists. (the oasis sru 2.0 binding will be the first.) in contrast to a static binding, which takes the form of a human-readable document, a dynamic binding takes the form of a machine-readable description file that a server provides for retrieval by a client, which may then dynamically configure itself to access that server. it is not a specification or standard; its only manifestation is a file on a server. the premise of the dynamic binding concept is that any server – even one that pre-dated the concept – need only provide a self-description. it need make no other change in order to be accessible. a client will be able to access a server that provides a description simply by reading and interpreting the description and, based on that description, formulating a request (including a query) and interpreting the response. of course, the premise behind this concept is a standard description language, and that will be part of the oasis work. in section 2.2, there is an example description file. 2.2 a description file example before looking at the example below, a word of caution: the description language has not yet been drafted, so not only is this example hypothetical, but the syntax is as well. the hypothetical description file below contains: a general description (element ); a request-formulation element (); and a response-interpretation element (). science fiction database scifi ralph levan [email protected] http://orlabs.oclc.org/srw/search/scifi?query= "it's a good life"&version=1.2 &operation=searchretrieve&maximumrecords=12&startrecord=1 /srw:searchretrieveresponse/numberofrecords /srw:searchretrieveresponse/srw:records/srw:record/srw:recorddata /srw:searchretrieveresponse/srw:diagnostics 2.2.1 request formulation within the request-formulation element, , the