In Jefferson Bailey’s brilliant article on digital archives, he writes, “Digital objects will have an identifier, yes, but where they ‘rest’ in intellectual space is contingent, mutable. The key point is that, even at the level of representation, arrangement is dynamic . . . Arrangement, as we think of it, is no longer a process of imposing intellectualized hierarchies or physical relocation; instead, it becomes largely automated, algorithmic, and batch processed.”
Digital humanists have increasingly embraced text mining and other techniques of data manipulation both within bodies of texts that they control and in proprietary databases. When presenting their findings, they must also consider how to represent their methodology; to describe the construction of the databases and search mechanisms used in their work; and to make available the data itself.
Many people (Schmidt, Bauer, and Gibbs and Owens) have written about the responsibility of digital scholars to make their methods transparent and data publicly available as well as the need to understand how databases differ from one another (Norwood and Gregg).
Reading narratives of the research methods of David Mimno and Matt Jockers (as well as listening to Mimno’s recent lecture) has been useful for me in my ongoing thinking about the issues of how digital humanists use data and how they report on their findings. Mimno and Jockers are exemplars of transparency in their recitation of methods and in the provision of access to their datasets so that other scholars might be able to explore their work.
While every digital humanist may not use topic modeling to the extent that Mimno and Jockers do, it is fair to say that, in the future, almost all scholars will be using commercial databases to access documents and that that access will come with some version of text analysis. But what do search and text analysis mean in commercial databases? And how should they be described? In relation to keyword searching in proprietary databases, the historian Caleb McDaniel has pointed out that historians do not have codified practices for the use and citation of databases of primary materials. He says that to correctly evaluate proprietary databases scholars should know whether the databases are created by OCR, what the default search conventions are, if the databases use fuzzy hits, when they are updated and other issues. At this time, much of the information about how they are constructed is occluded in commercial databases. McDaniel recommends the creation of an “online repository” of information about commercial databases and also suggests that historians develop a stylesheet for database citation practices.
Why is this lack of information about the mechanisms of commercial databases important? Because, as Bailey says, the arrangement of digital objects in archives (and databases) is automated, algorithmic, and batch processed. Yet, as historian Ben Schmidt has noted, “database design constrains the ways historians can use digital sources” and “proprietary databases “force” “syntax” on searches. Since database search results are contingent upon database structures, if scholars are making claims related to the frequency of search terms, at a minimum, they must understand those structures to reckon with the arguments that might be raised against their conclusions based on methodology.
I recently attended a presentation about a commercial database company’s venture into what I call “text mining lite.” What I learned has only bolstered my ideas about the importance of understanding the practices of proprietary database publishing and the necessity of scholars having access to that information. The company, Gale, one of the larger database publishers, seems to be courting the digital humanities community (or at least their idea of the digital humanities community). Gale is combining access to multiple databases of primary eighteenth and nineteenth century sources through an interface called Artemis which allows the creation of “term clusters.” These are clusters of words and phrases that occur a statistically relevant number of times within the user’s search results. One of the crucial things to know about the algorithms used is that Artemis term clusters are based on the first 100 words of the first 100 search results per content type. In practice, for search results that might include monographs, manuscripts and newspapers as types, this means that the algorithm runs only within the first one hundred words of the first one hundred monographs, the first one hundred words of the first one hundred manuscripts, and the first one hundred words of the first one hundred newspaper articles. [I will describe Artemis at more length in Part Two of this blog post.] Clearly, any conclusions drawn by scholars and others using term clusters in Artemis should include information about the construction of the database and the limitations of the search mechanisms and text analysis tools.
As a final project for the Digital Praxis Seminar, I am thinking about writing a grant proposal for the planning stages of a project that would consider possible means of gathering and making available information about the practices of commercial database publishers. I would appreciate any thoughts or comments people have about this.
* The title is taken from The Hermeneutics of Data and Historical Writing by Fred Gibbs and Trevor Owens. “As it becomes easier and easier for historians to explore and play with data it becomes essential for us to reflect on how we should incorporate this as part of our research and writing practices. Is there a better way than to simply provide the raw data and an explanation of how to witness the same phenomenon? Is this the twenty-first century footnote?”