Tag Archives: TopicModeling

DH Box: Tackling Project Scope

We have this great Digital Humanities project idea, but what happens between now and launch time?

With an idea like DH Box (a customized linux OS with preinstalled DH Tools and the flexibility to operate on a computer as cheap and portable as the Raspberry Pi) there are a number of directions we could take, and will certainly consider for further iterations of DH Box beyond the Spring term (this blog currently documents the experiences of a project team enrolled in a graduate course in Digital Humanities Praxis at the Graduate Center, CUNY).

In order to refine the scope of our tool, we asked ourselves some questions:

  • What approach will we take around educating users about coding, the infrastructure around the DH Box software, hardware, and operating system?
  • Which DH Tools should we include? See Alan Liu’s curated list for more info on the scope of DH tools out there
  • What user(s) are we building this for?

The success of our project hinges on our ability to carefully model the scope of the tool by shaping the answers to these questions . . . all by May 12th (public launch date)!

Educational Value

Beyond providing a collection of accessible DH Tools, we want DH Box to help bridge knowledge gaps by delivering a strong educational component. We’d like for instance, undergraduate English students to gain exposure and develop proficiency in Digital Humanities inquiry through the kind of guidance and practical experience DH Box will offer. To that end, we will begin an interactive textbook to provide instruction about the specific tools included in this first iteration of DH Box. We are most inspired by the Learn Code the Hard Way interactive textbook series by Zed Shaw.


We are gearing this version of DH Box to bring Topic Modeling and Text Analysis to Humanities students!

We began by considering the most popular DH Tools out there and quickly realized it made a lot of sense to whittle the list down for this current project phase. We’ve made choices based on optimal software performance with the Raspberry Pi. We also want to provide DH Tools that haven’t yet had the level of proliferation like some of the more popular content management systems such as WordPress.


Undergraduate Humanities students currently have little familiarity with terms like tokenizationsentiment analysis, etc., and how these components of text analysis can open expansive modes of textual inquiry. As part of its mission, DH Box will work to make these methods accessible to a broad audience!

Stay tuned for exciting updates on implementing the install scripts, using IPython Notebook, and more!


Questions? Comments? Tweet us!

David Mimno and fatty tuna

David Mimno made an important distinction about theory vs. practice when he pointed out that MALLET (or any DH tool) is a method, not a methodology.  MALLET can uncover thematic patterns in massive digital collections, but it is up to the researcher using the tool to evaluate the results, pose new questions, and think of possible new uses for the tool.  In our class discussion, Mimno compared different roles in topic modeling to Iron Chef:  he makes the knives (MALLET), librarians dump a lot of fatty tuna (the corpus of text) on the table, and the humanists are the chefs who need to make the meal (interpreting and drawing new conclusions from the results).

As a librarian, I have never thought of myself as a provider of fatty tuna, but I get the general point. What role do librarians and other alt-academics play in DH? Can a librarian be a tool maker, a chef, a sous-chef, a waitress, or something else entirely?  What does it mean to curate content and devise valuable ways to access that content?  Is it scholarship? I am not sure if I can answer that question, but I do see many new ways to apply MALLET as a search and discovery tool which would be very useful for scholarship.

Can we do better than key word search to find relevant information in huge collections of digital text? Would search terms created from the body of the text itself be more accurate than hand-coding using the very dated and narrow Library of Congress subject headings? The DH literature on topic modeling doesn’t have much on libraries, but I did find the following information. Yale, U. Michigan, and UC Irvine received an Institute of Museum and Library Services grant to study Improving Search and Discovery of Digital Resources Using Topic Modeling. See also an interesting D-Lib Magazine article on using topic modeling in HathiTrust, A New Way to Find: Testing the Use of Clustering Topics in Digital Libraries  

The Twenty-First Century Footnote*

In Jefferson Bailey’s brilliant article on digital archives, he writes, “Digital objects will have an identifier, yes, but where they ‘rest’ in intellectual space is contingent, mutable. The key point is that, even at the level of representation, arrangement is dynamic . . . Arrangement, as we think of it, is no longer a process of imposing intellectualized hierarchies or physical relocation; instead, it becomes largely automated, algorithmic, and batch processed.”

Digital humanists have increasingly embraced text mining and other techniques of data manipulation both within bodies of texts that they control and in proprietary databases. When presenting their findings, they must also consider how to represent their methodology; to describe the construction of the databases and search mechanisms used in their work; and to make available the data itself.

Many people (Schmidt, Bauer, and Gibbs and Owens) have written about the responsibility of digital scholars to make their methods transparent and data publicly available as well as the need to understand how databases differ from one another (Norwood and Gregg).

Reading narratives of the research methods of David Mimno and Matt Jockers (as well as listening to Mimno’s recent lecture) has been useful for me in my ongoing thinking about the issues of how digital humanists use data and how they report on their findings. Mimno and Jockers are exemplars of transparency in their recitation of methods and in the provision of access to their datasets so that other scholars might be able to explore their work.

While every digital humanist may not use topic modeling to the extent that Mimno and Jockers do, it is fair to say that, in the future, almost all scholars will be using commercial databases to access documents and that that access will come with some version of  text analysis. But what do search and text analysis mean in commercial databases? And how should they be described? In relation to keyword searching in proprietary databases, the historian Caleb McDaniel has pointed out that historians do not have codified practices for the use and citation of databases of primary materials. He says that to correctly evaluate proprietary databases scholars should know whether the databases are created by OCR, what the default search conventions are, if the databases use fuzzy hits, when they are updated and other issues. At this time, much of the information about how they are constructed is occluded in commercial databases. McDaniel recommends the creation of an “online repository” of information about commercial databases and also suggests that historians develop a stylesheet for database citation practices.

Why is this lack of information about the mechanisms of commercial databases important? Because, as Bailey says, the arrangement of digital objects in archives (and databases) is automated, algorithmic, and batch processed. Yet, as historian Ben Schmidt has noted,  “database design constrains the ways historians can use digital sources” and  “proprietary databases “force” “syntax” on searches. Since database search results are contingent upon database structures, if scholars are making claims related to the frequency of search terms, at a minimum, they must understand those structures to reckon with the arguments that might be raised against their conclusions based on methodology.

I recently attended a presentation about a commercial database company’s venture into what I call “text mining lite.” What I learned has only bolstered my ideas about the importance of understanding the practices of proprietary database publishing and the necessity of scholars having access to that information. The company, Gale, one of the larger database publishers, seems to be courting the digital humanities community (or at least their idea of the digital humanities community). Gale is combining access to multiple databases of primary eighteenth and nineteenth century sources through an interface called Artemis which allows the creation of “term clusters.” These are clusters of words and phrases that occur a statistically relevant number of times within the user’s search results. One of the crucial things to know about the algorithms used is that Artemis term clusters are based on the first 100 words of the first 100 search results per content type. In practice, for search results that might include monographs, manuscripts and newspapers as types, this means that the algorithm runs only within the first one hundred words of the first one hundred monographs, the first one hundred words of the first one hundred manuscripts, and the first one hundred words of the first one hundred newspaper articles. [I will describe Artemis at more length in Part Two of this blog post.] Clearly, any conclusions drawn by scholars and others using term clusters in Artemis should include information about the construction of the database and the limitations of the search mechanisms and text analysis tools.

As a final project for the Digital Praxis Seminar, I am thinking about writing a grant proposal for the planning stages of a project that would consider possible means of gathering and making available information about the practices of commercial database publishers. I would appreciate any thoughts or comments people have about this.

* The title is taken from The Hermeneutics of Data and Historical Writing by Fred Gibbs and Trevor Owens. “As it becomes easier and easier for historians to explore and play with data it becomes essential for us to reflect on how we should incorporate this as part of our research and writing practices. Is there a better way than to simply provide the raw data and an explanation of how to witness the same phenomenon? Is this the twenty-first century footnote?”