Category Archives: Beyond Citation

Announcing the launch of Beyond Citation

The Beyond Citation project team is thrilled to announce the public launch of our website at BeyondCitation.org. Even though scholars use academic databases every day, it is difficult to find information about how the databases work and what is in them. Beyond Citation gathers information about academic databases in one place to enable traditional humanities scholars and digital humanists to get a better sense of the content and searching mechanisms in databases. The goal of Beyond Citation is to make academic databases more transparent to users and to encourage critical thinking about academic databases.

The audience for the site is scholars, librarians, research enthusiasts or anyone who:

  • Uses academic databases and wants to learn more about what is in them
  • Is frustrated with academic databases and wants tips about how to more effectively search them
  • Wants to share their knowledge or experiences of academic databases with others

We invite you to participate in Beyond Citation by:

  • Starting or adding to a thread in the Community Forum.
  • Proposing an article or blog post that they would like to write
  • Offering to write an entire entry for a new database
Please visit BeyondCitation.org. Follow us on Twitter @beyondcitation

Continue reading

Academic Databases: Beyond Digital Literacy

Basic digital literacy for scholarly research includes knowing how to access digital archives, search them, and interpret their results.

Another component of digital literacy is familiarity with the semiotics of the interface; knowing how to “read” the instructions and symbols that give the user an idea of what invisible material lives in a database. These portals make the contents accessible, and also convey, before a search is even conducted, a range of search possibilities. The interface suggests something about the most useful metadata that the archive contains and the way the data can be accessed.

A user, then, can glean understanding about the mechanics of the database through the interface alone. This additional level of digital literacy is helpful, but still represents a limited understanding of databases. Many of the commonly used archives that humanities scholars, librarians, and historians use are proprietary, and even with some information and educated guesses about these archives’ metadata structures, it’s difficult or impossible to go a step deeper and discern exactly how the search algorithms work and how the database is designed.

This is an issue of emerging importance for digital scholars, and is prompting historians and others to think about what appears in search results and what doesn’t. But even if researchers knew how every database and its search algorithms worked, that wouldn’t resolve all the issues and theoretical implications of digital research and scholarship. As Ben Schmidt has pointed out, “database design constrains the ways historians can use digital sources.”

The limits of database design are an important window into the computational disciplines that enable information science in the first place. Programming machines to search a hybrid of digitized source materials is of course a wide problem, involving a myriad of methods, employing methods that are constantly evolving and becoming more powerful. Therefore, it’s interesting to ask: When are the issues associated with digital research contingent on computational science and when are they contingent on the way that proprietary archives and databases choose to implement the latest algorithms?

An interesting consideration in addressing this question might start with a distinction that William J. Turkel makes between scholars who use subscription archives and those who write code to mine massive data sets themselves. The literary scholar Ted Underwood has also discussed searching academic databases and data mining in parallel, commenting, “I suspect that many humanists who think they don’t need “big data” approaches are actually using those approaches every day when they run Google searches . . . Search is already a form of data mining. It’s just not a very rigorous form: it’s guaranteed only to produce confirmation of the theses you bring to it.”

Thinking about the distinction between proprietary database engineer and dataset hackers might foster the assumption that those two parties have radically different agendas or methods for searching born-digital and and digitized archive material. But while independent programmers represents a new frontier of sorts—scholars willing to learn the methods needed to do their own research and retrieve information from their own source material—they aren’t necessarily confronted by any fewer database design limitations than the engineers who work at Gale. This gets at the heart of what’s at stake for researchers in a digital age, and why this is an apt time to explore the way digital archives work, on a computational level.

Many automated, machine-driven search techniques are a set of instructions that don’t always produce predictable results, and can be difficult to reverse engineer even when bugs are discovered. Corporate engineers don’t have full control over the results they get, and neither do hackers or the authors of open-source software.

Why is that important? One goal of Beyond Citation is to explore and provide information on how databases work, so that scholars can better understand their research results. One could argue that scholars require so-called “neutral” technology; systems that don’t favor any one type or set of results over another. And it’s easier to understand and confirm search neutrality if algorithms and source code are publicly available. But exactly what is such neutrality, and would we know it if we saw it? Any algorithm, secret or otherwise, is a product of disciplinary constraints and intersections, and reveals the boundaries of what’s computationally possible. In short, the “correctness” of any algorithm is hard to nail down.

When we look more closely at the concept of neutrality, we see that both the user and the engineer are implicated in algorithmic design choices. James Grimmelman, a lawyer, has made a compelling argument that, “Search is inherently subjective: it always involves guessing the diverse and unknown intentions of users.” Code that’s written as a service to users is written with an interaction already in mind. Evaluating the nuances of search algorithms and determining the impact they make on the integrity of one’s research involves acknowledging these kinds of imagined dialogues.

These are just some exploratory thoughts, as none of these questions about database design and search can be taken in isolation. Beyond Citation, then, is a starting point for going beyond digital literacy in multiple directions. We are gathering and presenting the kinds of knowledge that might allow scholars to distinguish between computational limitations, the limits of metadata and the ways it’s structured, and the agendas of a proprietary company. As the project evolves, we ourselves hope to deepen the kinds of skills and knowledge that allow us to present such information in the most meaningful and usable ways.

Thinking About Authority and Academic Databases

Beyond Citation hopes to encourage critical thinking by scholars about academic databases. But what do we mean by critical thinking? Media culture scholar Wendy Hui Kyong Chun has defined critique as “not attacking what you think is false, but thinking through the limitations and possibilities of what you think is true.”

One question that the Beyond Citation team is considering is the scholarly authority of a database. Yale University Library addresses the question of scholarly authority in a handout entitled the “Web vs. Library Databases,” a guide for undergraduates. The online PDF states that information on the web is “seldom regulated, which means the authority is often in doubt.” By contrast, “authority and trustworthiness are virtually guaranteed” to the user of library databases.

Let’s leave aside for the moment the question of whether scholars should always prefer the “regulated” information of databases to the unruly data found on the Internet. While Yale Library may simply be using shorthand to explain academic databases to undergraduates, to the extent that they are equating databases and trustworthiness, I think they may be ceding authority to databases too readily and missing some of the complexity of the current digital information landscape.

Yale Library cites Academic Search and Lexis-Nexis as examples of databases. Lexis-Nexis is a compendium of news articles, broadcast transcripts, press releases, law cases, as well as Internet miscellany. Lexis-Nexis is probably authoritative in the sense that one can be comfortable that the items accessed are the actual articles obtained directly from publishers and thus contain the complete texts of articles (with images removed). In that limited sense, items in Lexis-Nexis are certainly more reliable than results obtained from a web search. (Although this isn’t true for media historians who want to see the entire page with pictures and advertisements included. For that, try the web or another newspaper database). Despite its relatively long pedigree for an electronic database, careful scrutiny of results is just as crucial when doing a search in Lexis-Nexis as it is for an Internet search.

In some instances, especially when seeking information about non-mainstream topics, searching the Internet may be a better option. Composition and rhetoric scholar Janine Solberg has written about her experience of research in digital environments, in particular how full-text searches on Amazon, Google Books, the Internet Archive and HathiTrust enabled her to locate information that she was unable to find in conventional library catalogs. She says, “Web-based searching allowed me not only to thicken my rhetorical scene more quickly but also to rapidly test and refine questions and hypotheses.” In the same article, Solberg calls for “more explicit reflection and discipline-specific conversation around the uses and shaping effects of these [digital] technologies” and recommends as a method “sharing and circulating research narratives that make the processes of historical research visible to a wider audience . . . with particular attention to the mediating role of technologies.”

Adding to the challenge of thinking critically about academic databases is their dynamic nature. The terrain of library databases is changing as more libraries adopt proprietary “discovery” systems that search across the entire set of databases to which libraries subscribe. For example, the number of JSTOR users has dropped “as much as 50%” with installations of discovery systems and changes in Google’s algorithms. Shifts in discovery have led to pointed discussions between associations of librarians and database publishers about the lack of transparency of search mechanisms. In 2012, Tim Collins, the president of EBSCO, a major database and discovery system vendor, found it necessary to address the question of whether vendors of discovery systems favor their own content in searches, denying that they do. There is, however, no way for anyone outside the companies to verify his statement because the vendors will not reveal their search algorithms.

While understanding the ranking of search results in academic databases is an open question, a recent study comparing research in databases, Google Scholar and library discovery systems by Asher et al. found that “students imbued the search tools themselves with a great deal of authority,” often by relying on the brand name of the database. More than 90% of students in the study never went past the first page of search results. As the study notes, “students are de facto outsourcing much of the evaluation process to the search algorithm itself.”

In addition, lest one imagine that scholars are immune to an uncritical perspective on digital sources, in his study of the citation of newspaper databases in Canadian dissertations, historian Ian Milligan says that scholars have adopted the use of these databases without achieving a concomitant perspective on their shortcomings. Similarly to the Asher et al. study of undergraduate students, Milligan says, “Researchers cite what they find online.”

If critique is, as Chun says, thinking through the limitations and possibilities of what we think is true, then perhaps by encouraging reflective conversations among scholars about how these ubiquitous digital tools shape research and the production of knowledge, Beyond Citation’s efforts will be another step toward that critique.

We are at blog.beyondcitation.org. Email us at BeyondCitation [at] gmail [dot] com or follow us on Twitter @beyondcitation as we get ready for the launch in May.

Beyond Citation: Critical thinking about academic databases

During the Fall 2013 semester, I started reading, thinking and writing about the impact of academic databases such as JSTOR and Gale: Artemis Primary Sources on research and scholarship. I learned that databases shape the questions that can be asked and the arguments that can be made by scholars through search interfaces, algorithms, and the items that are contained in or absent from their collections. Although algorithms in databases have been found to have an “epistemological power” through their ranking of search results, understanding why certain search results appear is very difficult even for the team that engineered the algorithms. Yet knowledge of how databases work is extremely limited because information about database structures is scanty or unavailable and constantly changing.

Despite the ubiquity of databases, academics are often unaware of the constraints that databases place on their research. Lack of information about the impact of database structures and content on research is an obstacle to scholarly inquiry because it means that scholars may not be aware of and cannot account for how databases affect their interpretations of search results or text analysis.

Digital humanists have examined both the benefits and perils of research in academic databases. The introduction of digital tools for text analysis to identify patterns common to large amounts of documents has added to the complexity of scholars’ tasks. Historian Jo Guldi writes that, “Keyword searching [in databases] . . . allows the historian to propose longer questions, bigger questions;” yet she also remarks on the challenges posed by search in an earlier article saying that, “Each digital database has constraints that render historiographical interventions based upon scholars’ queries initially suspect.” Scholars such as Caleb McDaniel, Miriam Posner, James Mussell, Bob Nicholson and Ian Milligan have written about the skewed search results of databases of historical newspapers, the impossibility of finding provenance information to contextualize what database users are seeing, and the lack of information about OCR accuracy. Besides these issues, scholars should also have an understanding of errors in digital collections. For example, scholars using Google Books would probably want to know that thirty-six percent of Google Books have errors in either author, title, publisher, or year of publication metadata.

Historian Tim Hitchcock talks about the importance of understanding the types of items in digital collections, saying, “Until we get around to including the non-canonical, the non-Western, the non-textual and the non-elite, we are unlikely to be very surprised.” Because they can contain what seems to be an almost infinite number of documents, archival databases offer an appearance of exhaustiveness that does not yield easily to a scholar’s probing. But while a gestalt understanding of a primary source database is crucial to determining the representation of items in the collection, the limited bibliographic information that is available about academic databases is scattered or unknown to most scholars.

As one step toward overcoming scholars’ lack of knowledge about the biases inherent in databases, I am working with a team of other students in the DH Praxis Seminar at the CUNY Graduate Center to create Beyond Citation, a website to aggregate bibliographic information about major humanities databases so that scholars can understand the significance of the material they have gleaned. Beyond Citation will help humanities scholars to practice critical thinking about research in databases.

The benefit of encouraging critical thinking about databases is more than merely facilitating research. Critical thinking about databases counters scholars’ “tendency to consider the archive as a hermetically-sealed space in which historical material can be preserved untouched,” and “[forces] a recognition of the constructed nature of evidence and its relation to the absent past.”

The Beyond Citation team has selected a set of humanities databases for the initial site launch and is working out the nitty-gritty of platform and server-side database functionality as well as completing research about the databases that we have chosen to cover on the site.

By providing structured information about databases and articles about research strategies, Beyond Citation will frame the common problems that scholars face when evaluating the results of their work in databases. Scholars will be able to enrich the data on the site with their own contributions, participate in reflective conversations and share highly situated stories about their experiences of working in databases. While an early version of the website to be launched in May 2014 will have a limited scope, the idea is that the site will eventually become a research workshop.

As information scientist Ryan Shaw observes, “In an era of vast digital archives and powerful search algorithms, the key challenge of organizing information is to construct systems that aid understanding, contextualizing, and orienting oneself within a mass of resources.” By making essential bibliographic information about the structures and content of academic databases accessible to scholars, Beyond Citation will take an important step to updating the scholarly apparatus to encourage critical thinking about databases and their effect on research and scholarship.

Reach us at BeyondCitation [at] gmail [dot] com or follow us on Twitter as we get ready for the launch in May: @beyondcitation

Acknowledgments

The idea for Beyond Citation originated from my encounter with a blog post by Caleb McDaniel about historians’ research practices suggesting the creation of an “online repository” of information about proprietary databases.

Beyond Citation: Building digital tools to explain digital tools

Over the last couple weeks, the Beyond Citation team has transformed into a web production team of sorts, focused on making key decisions about platform, site architecture, user interaction, design, and communication.

Beyond Citation—a project to build a website that aggregates accessible, structured information about scholarly databases—has the potential to enhance how scholars approach, use, and interpret resources from some of today’s most widely used digital collections. While it would be straightforward for our team to simply gather and publish information about those resources, our challenge is to build a digital tool that supports meaningful interaction with that information, one that can also scale in the future and cater to a community of contributors.

In the project’s nascent stages, the tactical concerns before us are familiar—we’re taking on the common challenge of building and launching a website or web app. Thrust into the very practical realm of software, decisions, and constraints, discussions of critical theory get put off to discuss the merits of WordPress and Drupal. These powerful tools place the project in a digital ecosystem much wider than academia. The platform we have chosen—WordPress—pushes us deeper still into the wide worlds of relational databases, server-side scripting, and content management—the digital tools that will allow us to explain other digital tools.

As we construct the basic building blocks for the site, we find that the best way to focus our approach is by seeking the advice of experts, reading blogs about WordPress customization, and learning more about MySQL and WordPress taxonomies. The robust open source community behind WordPress has enabled us to confirm that the technical requirements for the Beyond Citation website can be met many times over through combinations of WordPress plugins.

Something to consider while building this tool with WordPress, is that we are seeking to publish data about proprietary tools by using open source technology. Perhaps this isn’t really so unusual—we see this in a similar vein as increasingly popular APIs that allow for easier data aggregation or configuration from multiple sources. And toolsets that are hybrids of proprietary and open source systems are extremely common.

But there’s an important depth to explore when thinking about Beyond Citation as a bridge between proprietary and open source systems. The idea of “exposed” information, built on “hidden” information, represents a direction that the project can try to push technically. For instance, if in a future iteration the team can uncover information about scholarly databases that’s not just hard to find, but not openly available (such as how search algorithms work, or the criteria behind publisher contracts), then I think the value of Beyond Citation increases in a direction most closely aligned with its original ambition. This would also allow the project to explore the similarities and differences in how scholarly databases work in more meaningful ways.

Before we can do that, everyone on the team is doing their part to fill in knowledge gaps, and discovering “how technology works” on multiple levels. Just as we are researching the types of information about scholarly databases that we want the project to highlight, we are also researching the types of data-driven web frameworks that could easily support such information. Like many Digital Humanities projects, Beyond Citation is about knowledge acquisition and aggregation for both developers and researchers. We are challenging ourselves to learn as much as we can about one set of digital tools before we can communicate new information about other sets of digital tools—both of which are moving targets, evolving in their own realms of authorship.

As we work towards a May launch date for an early version of the site, we realize that the authors of digital projects need a constant appetite for more knowledge—technical knowledge and subject-matter knowledge—in order to create and maintain an authoritative tool.

Follow us on Twitter as we get ready for May: @beyondcitation

Beyond Citation: Understanding Databases

Every year, more and more research is done by scholars online via academic databases. Print journals, scholarly monographs, newspapers, periodical indexes, and even ephemera and image collections are steadily transitioning from print to electronic.

Historically, research using print collections took place in library reading rooms with material owned by the library. Increasingly, research using electronic collections takes place outside of the library using proprietary digital platforms subscribed to by libraries. This change greatly affects how libraries function — an ownership model morphs into an access model — and how research is done. Database searches are crucial to uncovering information, but little is known about how these searches work. Additionally, it’s not always easy to find what full text content is covered in these database titles.

The goal of Beyond Citation is to help the researcher to better understand how academic databases work, and provide easier access to the database’s holdings information. For the CUNY Digital Praxis Seminar, the Beyond Citation team needed to determine which databases to feature in its initial launch, and what information to gather about each title.

First, we wanted to feature humanities databases and steer away from STEM titles. (Science, Technology, Engineering, and Mathematics.) Second, we ideally wanted to cover titles that were available at the CUNY Graduate Center’s Mina Rees Library, and we wanted representation from the big three “e” vendors: EBSCO, Gale, and ProQuest. Additionally, we wanted to cover different kinds of content, including historical newspapers, scholarly journals, and historical e-books from both non-profit and for-profit companies.

After much discussion, the Beyond Citation team has decided to focus on the following databases and collections for its initial launch.

Google Books

HathiTrust

ArtStor

ProQuest Historical Newspapers

19th Century U.S. Newspapers (Gale)

Early English Books Online (EEBO) with TCP (Text Creation Partnership) (ProQuest)

Gale Artemis: Primary Sources – Nineteenth Century Collections Online (NCCO) and Eighteenth Century Collections Online (ECCO).

JSTOR

Project Muse (Johns Hopkins University Press)

Artemis Literature Resources (Gale)

EBSCO Humanities Source

We are open to and eager for feedback from users of these titles, or from any other researchers and librarians who use databases in their research. More to come in future posts on what information we hope to gather from each title, and how that information will be displayed. You can reach us at BeyondCitation [at] gmail.com