ngram | Digital Praxis Seminar Fall 2013

When Steve Brier pointed the class to yet another piece in the news about the “crisis in the humanities,” I joked out loud to a colleague about whether the headline was from today or from thirty years ago, because the humanities have a reputation for crisis that won’t quit. In 1980, Newsweek ran a story about the “sorry state” of the humanities, based on a Rockefeller Foundation report. This was perhaps more apt at that time because, according to historian Ben Schmidt, who has written a series of blog posts on the subject of humanities enrollments, “the real collapse of humanities enrollments happened in the 1970s.” Humanities enrollments have recovered and leveled off since that time. Working with data that he hand-transcribed from paper printouts, Schmidt argues convincingly that “long term results actually show that since 1950, only women have shown a major drop in the percentage of humanities majors.” And, “Before co-education, only about a tenth of pre-professional degrees went to women: after 1985, they were half. And since the whole puzzle is how women’s behavior changed, not how men’s majors changed, this tells you most of what you need to know.” Schmidt also refers to an Atlantic article showing that humanities majors have the same employment level as computer science majors.

Instead of repeating shibboleths about the crisis in humanities enrollments, journalists should examine the data.

Addendum: After I posted this, Ben Schmidt tweeted, “History majors are up 18% the last 25 years. Math and CS are down 40%. Can we put this media narrative to rest?” The tweet includes a link to his dynamic graph from 1986 to 2011 to view majors from all disciplines (group together your definition of humanities fields), by gender, and by institution. In Schmidt’s guest post at the Chronicle of Higher Education earlier this year, there is a link to a fun Google Books Ngram of the “crisis in the humanities.”

In Part One of this blog post, I wrote about scholars’ reliance on proprietary databases for research and the importance of understanding the constraints which database structures place on the outcomes of their efforts. Unfortunately, generally speaking, information about the structures of proprietary databases is not easily accessible. To remedy this, Caleb McDaniel has talked about the need to create an online resource to collate information about the construction of proprietary databases.

As an exploration of the structure of a proprietary database, I will look at one commercial database’s search and text analysis tools and touch on their handling of content. My goal is to demonstrate some of the complexity of these systems and to parse out the types of information that scholars would want to know and should consider sharing when writing up their research findings.

Artemis – Text mining lite

I recently attended a presentation about a commercial database company’s venture into what I call “text mining lite.” The company, Gale, has just started to offer text analysis and other tools that are squarely aimed at the field/set of methods of digital humanities. The tools are available through Artemis, an interface that allows searches across multiple collections of primary eighteenth century (ECCO) and nineteenth century sources (NCCO). There is a separate Artemis platform for literary material with the same analytic features. By 2015 Gale humanities collections running the gamut from the 19^th Century U.S. Newspapers to the Declassified Documents Reference System and many others will migrate into Artemis. Artemis is available CUNY-wide.

Parameters of search

To access Artemis’s textual analysis capabilities the user first determines the parameters of selection of the materials. The options are extensive: date ranges, content type (e.g. manuscript, map, photograph), document type (e.g. manifesto, telegram, back matter), title, and source library. For example, one could search only letters from the Smith College archives or manuscripts from the Library of Congress in particular years.

Context

Discussing the use of Google’s Ngram to find themes in large bodies of texts, Matt Jockers advises caution, “When it comes to drawing semantic meaning from a word, we require more than a count of that word’s occurrence in the corpus. A word’s meaning is derived through context” (120). In his CUNY DHI and Digital Praxis Seminar lecture, David Mimno addressed the necessity of understanding the context of words in large corpora saying, “We simply cannot trust that those words that we are counting mean what we think they mean. That’s the fundamental problem.”

One way that Artemis deals with this is by offering a view into the context of the documents in search results. For each result, clicking on “Keywords in Context” brings up a window showing the words surrounding the keyword in the actual (digital facsimile) document. This makes it relatively simple to identify if the document is actually relevant to your research, as long as the number of documents being examined is not too large.

Refining results

While the categories of search that Artemis allows are quite flexible, it is also possible to enter proximity operators to find co-located words. This means that, in many situations, it will be possible to further refine results through iterative searching to locate smaller batches of relevant documents on which to run the text analysis tools.

Ngram viewer

Artemis features a visualization tool that offers some improvements over Google’s Ngram to show frequency of terms over time. The term frequency ngram is created from the search results. Click and drag on the term frequency graph to modify the date range. The graph can zoom to the one-year level. It is possible to retrieve a particular document by clicking on the point on the graph. The visualization also displays term popularity, the percent of the total documents each year. Term popularity normalizes the number of documents based on the percentage of the content.

Term clusters visualization

For larger sets of documents, or to look at entire collections, researchers might want to use term clusters. Term clusters use algorithms to group words and phrases that occur a statistically relevant number of times within the search results.

The visualization of term clusters are based on the first 100 words of the first 100 search results per content type. This means that the algorithm would run only within, for example, the first one hundred words of the first one hundred monographs, the first one hundred words of the first one hundred manuscripts, and the first one hundred words of the first one hundred newspaper articles. The size limitations are because the text analysis tools are bandwidth intensive. Searches of larger numbers of documents take longer to return results and also slow down the system for other users. By clicking on the clusters, it is possible to drill down into the search results to the level of individual documents and their metadata.

Legibility of documents

Scholars should have an understanding of the process by which database publishers have transformed documents into digital objects because it affects the accuracy of searches and text analysis. In Gales’ collections, printed materials are OCR’d. For nonprint materials, such as manuscripts, ephemera and photograph captions, the metadata of names, places and dates are entered by hand. By providing improved metadata for nonprint materials, Gale has increased the discoverability of these types of documents. This is particularly important for those studying women and marginalized groups whose records are more likely to be found in ephemeral materials.

Collection descriptions

Understanding the types of materials contained within a proprietary database can be difficult. The Eighteenth Century Collections Online (ECCO) is based on the English Short Title Catalogue from the British Library and is familiar to many scholars of the eighteenth century. The Nineteenth Century Collections Online (NCCO) is a newer grouping of collections that is being continually updated. To see a detailed description of the collections in NCCO, go to the NCCO standalone database, not the Artemis platform, and click Explore Collections.

Data for research

Generally, scholars can download PDFs of documents from Artemis only one document at a time (up to 50 pages per download). When I asked about access to large amounts of data for use by digital humanists, the Gale representative said that while their databases are not built to be looked at on a machine level (because of the aforementioned bandwidth issues), Gale is beginning to provide data separately to scholars. They have a pilot program to provide datasets to Davidson College and the British Library, among others. Gale is also looking into setting up a new capability to share data that would be based outside their current system. The impression that I got was that they would be receptive to scholars who are interested in obtaining large amounts of data for research.

Bonus tip: direct (public) link to documents

Even though it doesn’t have anything to do with standards for presenting scholarship, I thought people might want to know about this handy feature. Artemis users have the ability to bookmark search results and save the URL for future reference. The link to the document(s) can then be shared with anyone, even those without logins to the database. To be clear, anyone that clicks on the link is taken directly to the document(s) although they won’t have the capability to extend the search. This makes it easy to share documents with students and through social media.

In this post, I have sought to shed some light on the usually opaque construction of proprietary databases. If people start “playing” with Artemis’ text mining lite capabilities, I would be interested in hearing about their perceptions of its usefulness for research.

Works cited

Jockers, Matthew L. “Theme.” Macroanalysis Digital Methods and Literary History. Urbana: University of Illinois Press. Print.

Digital Praxis Seminar Fall 2013 – Spring 2014

Tag Archives: ngram

The News about the Humanities

The Twenty-First Century Footnote, Part Two

Need help with the Commons?