During the Fall 2013 semester, I started reading, thinking and writing about the impact of academic databases such as JSTOR and Gale: Artemis Primary Sources on research and scholarship. I learned that databases shape the questions that can be asked and the arguments that can be made by scholars through search interfaces, algorithms, and the items that are contained in or absent from their collections. Although algorithms in databases have been found to have an “epistemological power” through their ranking of search results, understanding why certain search results appear is very difficult even for the team that engineered the algorithms. Yet knowledge of how databases work is extremely limited because information about database structures is scanty or unavailable and constantly changing.
Despite the ubiquity of databases, academics are often unaware of the constraints that databases place on their research. Lack of information about the impact of database structures and content on research is an obstacle to scholarly inquiry because it means that scholars may not be aware of and cannot account for how databases affect their interpretations of search results or text analysis.
Digital humanists have examined both the benefits and perils of research in academic databases. The introduction of digital tools for text analysis to identify patterns common to large amounts of documents has added to the complexity of scholars’ tasks. Historian Jo Guldi writes that, “Keyword searching [in databases] . . . allows the historian to propose longer questions, bigger questions;” yet she also remarks on the challenges posed by search in an earlier article saying that, “Each digital database has constraints that render historiographical interventions based upon scholars’ queries initially suspect.” Scholars such as Caleb McDaniel, Miriam Posner, James Mussell, Bob Nicholson and Ian Milligan have written about the skewed search results of databases of historical newspapers, the impossibility of finding provenance information to contextualize what database users are seeing, and the lack of information about OCR accuracy. Besides these issues, scholars should also have an understanding of errors in digital collections. For example, scholars using Google Books would probably want to know that thirty-six percent of Google Books have errors in either author, title, publisher, or year of publication metadata.
Historian Tim Hitchcock talks about the importance of understanding the types of items in digital collections, saying, “Until we get around to including the non-canonical, the non-Western, the non-textual and the non-elite, we are unlikely to be very surprised.” Because they can contain what seems to be an almost infinite number of documents, archival databases offer an appearance of exhaustiveness that does not yield easily to a scholar’s probing. But while a gestalt understanding of a primary source database is crucial to determining the representation of items in the collection, the limited bibliographic information that is available about academic databases is scattered or unknown to most scholars.
As one step toward overcoming scholars’ lack of knowledge about the biases inherent in databases, I am working with a team of other students in the DH Praxis Seminar at the CUNY Graduate Center to create Beyond Citation, a website to aggregate bibliographic information about major humanities databases so that scholars can understand the significance of the material they have gleaned. Beyond Citation will help humanities scholars to practice critical thinking about research in databases.
The benefit of encouraging critical thinking about databases is more than merely facilitating research. Critical thinking about databases counters scholars’ “tendency to consider the archive as a hermetically-sealed space in which historical material can be preserved untouched,” and “[forces] a recognition of the constructed nature of evidence and its relation to the absent past.”
The Beyond Citation team has selected a set of humanities databases for the initial site launch and is working out the nitty-gritty of platform and server-side database functionality as well as completing research about the databases that we have chosen to cover on the site.
By providing structured information about databases and articles about research strategies, Beyond Citation will frame the common problems that scholars face when evaluating the results of their work in databases. Scholars will be able to enrich the data on the site with their own contributions, participate in reflective conversations and share highly situated stories about their experiences of working in databases. While an early version of the website to be launched in May 2014 will have a limited scope, the idea is that the site will eventually become a research workshop.
As information scientist Ryan Shaw observes, “In an era of vast digital archives and powerful search algorithms, the key challenge of organizing information is to construct systems that aid understanding, contextualizing, and orienting oneself within a mass of resources.” By making essential bibliographic information about the structures and content of academic databases accessible to scholars, Beyond Citation will take an important step to updating the scholarly apparatus to encourage critical thinking about databases and their effect on research and scholarship.
The idea for Beyond Citation originated from my encounter with a blog post by Caleb McDaniel about historians’ research practices suggesting the creation of an “online repository” of information about proprietary databases.