TermClusters | Digital Praxis Seminar Fall 2013

In Part One of this blog post, I wrote about scholars’ reliance on proprietary databases for research and the importance of understanding the constraints which database structures place on the outcomes of their efforts. Unfortunately, generally speaking, information about the structures of proprietary databases is not easily accessible. To remedy this, Caleb McDaniel has talked about the need to create an online resource to collate information about the construction of proprietary databases.

As an exploration of the structure of a proprietary database, I will look at one commercial database’s search and text analysis tools and touch on their handling of content. My goal is to demonstrate some of the complexity of these systems and to parse out the types of information that scholars would want to know and should consider sharing when writing up their research findings.

Artemis – Text mining lite

I recently attended a presentation about a commercial database company’s venture into what I call “text mining lite.” The company, Gale, has just started to offer text analysis and other tools that are squarely aimed at the field/set of methods of digital humanities. The tools are available through Artemis, an interface that allows searches across multiple collections of primary eighteenth century (ECCO) and nineteenth century sources (NCCO). There is a separate Artemis platform for literary material with the same analytic features. By 2015 Gale humanities collections running the gamut from the 19^th Century U.S. Newspapers to the Declassified Documents Reference System and many others will migrate into Artemis. Artemis is available CUNY-wide.

Parameters of search

To access Artemis’s textual analysis capabilities the user first determines the parameters of selection of the materials. The options are extensive: date ranges, content type (e.g. manuscript, map, photograph), document type (e.g. manifesto, telegram, back matter), title, and source library. For example, one could search only letters from the Smith College archives or manuscripts from the Library of Congress in particular years.

Context

Discussing the use of Google’s Ngram to find themes in large bodies of texts, Matt Jockers advises caution, “When it comes to drawing semantic meaning from a word, we require more than a count of that word’s occurrence in the corpus. A word’s meaning is derived through context” (120). In his CUNY DHI and Digital Praxis Seminar lecture, David Mimno addressed the necessity of understanding the context of words in large corpora saying, “We simply cannot trust that those words that we are counting mean what we think they mean. That’s the fundamental problem.”

One way that Artemis deals with this is by offering a view into the context of the documents in search results. For each result, clicking on “Keywords in Context” brings up a window showing the words surrounding the keyword in the actual (digital facsimile) document. This makes it relatively simple to identify if the document is actually relevant to your research, as long as the number of documents being examined is not too large.

Refining results

While the categories of search that Artemis allows are quite flexible, it is also possible to enter proximity operators to find co-located words. This means that, in many situations, it will be possible to further refine results through iterative searching to locate smaller batches of relevant documents on which to run the text analysis tools.

Ngram viewer

Artemis features a visualization tool that offers some improvements over Google’s Ngram to show frequency of terms over time. The term frequency ngram is created from the search results. Click and drag on the term frequency graph to modify the date range. The graph can zoom to the one-year level. It is possible to retrieve a particular document by clicking on the point on the graph. The visualization also displays term popularity, the percent of the total documents each year. Term popularity normalizes the number of documents based on the percentage of the content.

Term clusters visualization

For larger sets of documents, or to look at entire collections, researchers might want to use term clusters. Term clusters use algorithms to group words and phrases that occur a statistically relevant number of times within the search results.

The visualization of term clusters are based on the first 100 words of the first 100 search results per content type. This means that the algorithm would run only within, for example, the first one hundred words of the first one hundred monographs, the first one hundred words of the first one hundred manuscripts, and the first one hundred words of the first one hundred newspaper articles. The size limitations are because the text analysis tools are bandwidth intensive. Searches of larger numbers of documents take longer to return results and also slow down the system for other users. By clicking on the clusters, it is possible to drill down into the search results to the level of individual documents and their metadata.

Legibility of documents

Scholars should have an understanding of the process by which database publishers have transformed documents into digital objects because it affects the accuracy of searches and text analysis. In Gales’ collections, printed materials are OCR’d. For nonprint materials, such as manuscripts, ephemera and photograph captions, the metadata of names, places and dates are entered by hand. By providing improved metadata for nonprint materials, Gale has increased the discoverability of these types of documents. This is particularly important for those studying women and marginalized groups whose records are more likely to be found in ephemeral materials.

Collection descriptions

Understanding the types of materials contained within a proprietary database can be difficult. The Eighteenth Century Collections Online (ECCO) is based on the English Short Title Catalogue from the British Library and is familiar to many scholars of the eighteenth century. The Nineteenth Century Collections Online (NCCO) is a newer grouping of collections that is being continually updated. To see a detailed description of the collections in NCCO, go to the NCCO standalone database, not the Artemis platform, and click Explore Collections.

Data for research

Generally, scholars can download PDFs of documents from Artemis only one document at a time (up to 50 pages per download). When I asked about access to large amounts of data for use by digital humanists, the Gale representative said that while their databases are not built to be looked at on a machine level (because of the aforementioned bandwidth issues), Gale is beginning to provide data separately to scholars. They have a pilot program to provide datasets to Davidson College and the British Library, among others. Gale is also looking into setting up a new capability to share data that would be based outside their current system. The impression that I got was that they would be receptive to scholars who are interested in obtaining large amounts of data for research.

Bonus tip: direct (public) link to documents

Even though it doesn’t have anything to do with standards for presenting scholarship, I thought people might want to know about this handy feature. Artemis users have the ability to bookmark search results and save the URL for future reference. The link to the document(s) can then be shared with anyone, even those without logins to the database. To be clear, anyone that clicks on the link is taken directly to the document(s) although they won’t have the capability to extend the search. This makes it easy to share documents with students and through social media.

In this post, I have sought to shed some light on the usually opaque construction of proprietary databases. If people start “playing” with Artemis’ text mining lite capabilities, I would be interested in hearing about their perceptions of its usefulness for research.

Works cited

Jockers, Matthew L. “Theme.” Macroanalysis Digital Methods and Literary History. Urbana: University of Illinois Press. Print.

In Jefferson Bailey’s brilliant article on digital archives, he writes, “Digital objects will have an identifier, yes, but where they ‘rest’ in intellectual space is contingent, mutable. The key point is that, even at the level of representation, arrangement is dynamic . . . Arrangement, as we think of it, is no longer a process of imposing intellectualized hierarchies or physical relocation; instead, it becomes largely automated, algorithmic, and batch processed.”

Digital humanists have increasingly embraced text mining and other techniques of data manipulation both within bodies of texts that they control and in proprietary databases. When presenting their findings, they must also consider how to represent their methodology; to describe the construction of the databases and search mechanisms used in their work; and to make available the data itself.

Many people (Schmidt, Bauer, and Gibbs and Owens) have written about the responsibility of digital scholars to make their methods transparent and data publicly available as well as the need to understand how databases differ from one another (Norwood and Gregg).

Reading narratives of the research methods of David Mimno and Matt Jockers (as well as listening to Mimno’s recent lecture) has been useful for me in my ongoing thinking about the issues of how digital humanists use data and how they report on their findings. Mimno and Jockers are exemplars of transparency in their recitation of methods and in the provision of access to their datasets so that other scholars might be able to explore their work.

While every digital humanist may not use topic modeling to the extent that Mimno and Jockers do, it is fair to say that, in the future, almost all scholars will be using commercial databases to access documents and that that access will come with some version of text analysis. But what do search and text analysis mean in commercial databases? And how should they be described? In relation to keyword searching in proprietary databases, the historian Caleb McDaniel has pointed out that historians do not have codified practices for the use and citation of databases of primary materials. He says that to correctly evaluate proprietary databases scholars should know whether the databases are created by OCR, what the default search conventions are, if the databases use fuzzy hits, when they are updated and other issues. At this time, much of the information about how they are constructed is occluded in commercial databases. McDaniel recommends the creation of an “online repository” of information about commercial databases and also suggests that historians develop a stylesheet for database citation practices.

Why is this lack of information about the mechanisms of commercial databases important? Because, as Bailey says, the arrangement of digital objects in archives (and databases) is automated, algorithmic, and batch processed. Yet, as historian Ben Schmidt has noted, “database design constrains the ways historians can use digital sources” and “proprietary databases “force” “syntax” on searches. Since database search results are contingent upon database structures, if scholars are making claims related to the frequency of search terms, at a minimum, they must understand those structures to reckon with the arguments that might be raised against their conclusions based on methodology.

I recently attended a presentation about a commercial database company’s venture into what I call “text mining lite.” What I learned has only bolstered my ideas about the importance of understanding the practices of proprietary database publishing and the necessity of scholars having access to that information. The company, Gale, one of the larger database publishers, seems to be courting the digital humanities community (or at least their idea of the digital humanities community). Gale is combining access to multiple databases of primary eighteenth and nineteenth century sources through an interface called Artemis which allows the creation of “term clusters.” These are clusters of words and phrases that occur a statistically relevant number of times within the user’s search results. One of the crucial things to know about the algorithms used is that Artemis term clusters are based on the first 100 words of the first 100 search results per content type. In practice, for search results that might include monographs, manuscripts and newspapers as types, this means that the algorithm runs only within the first one hundred words of the first one hundred monographs, the first one hundred words of the first one hundred manuscripts, and the first one hundred words of the first one hundred newspaper articles. [I will describe Artemis at more length in Part Two of this blog post.] Clearly, any conclusions drawn by scholars and others using term clusters in Artemis should include information about the construction of the database and the limitations of the search mechanisms and text analysis tools.

As a final project for the Digital Praxis Seminar, I am thinking about writing a grant proposal for the planning stages of a project that would consider possible means of gathering and making available information about the practices of commercial database publishers. I would appreciate any thoughts or comments people have about this.

* The title is taken from The Hermeneutics of Data and Historical Writing by Fred Gibbs and Trevor Owens. “As it becomes easier and easier for historians to explore and play with data it becomes essential for us to reflect on how we should incorporate this as part of our research and writing practices. Is there a better way than to simply provide the raw data and an explanation of how to witness the same phenomenon? Is this the twenty-first century footnote?”

Digital Praxis Seminar Fall 2013 – Spring 2014

Tag Archives: TermClusters

The Twenty-First Century Footnote, Part Two

The Twenty-First Century Footnote*

Need help with the Commons?