Category Archives: Fall 2013

Posts done in Fall 2013

Wishful Processing

King’s “Word Processor of the Gods” was an amusing read. It’s a bit dated (1983) but still holds captures what we all probably thought when our fingers touched keyboards… that the word processor could be construed as a magic machine with the abilities to delete or change our personal histories.

I suppose King having the perspective of an author/writer, was hypothesizing that with the arrival of the word processor, text is made extremely pliable (movable type) and that transformed the way people write/think (similar to Derrida’s pen vs machine). I think that with each communication method we adopt or adapt to the technology (be it a chisel, plume, chalk, pen, tablet) and they also change the ways which we interact.

Doug Eyman and Collin Brooke discuss writing studies and DH

On October 8, CUNY DHI and the Graduate Center Composition and Rhetoric Community (GCCRC) hosted a conversation about the intersection of writing studies and digital humanities with Doug Eyman and Collin Brooke. These two innovative scholars shared in an important discussion concerning the future of digital rhetoric. Doug Eyman is a professor of digital rhetoric, technical and scientific communication, and professional writing at George Mason University and the senior editor of Kairos: A Journal of Rhetoric, Technology, and Pedagogy; Collin Brooke is a professor of Rhetoric and Writing at Syracuse University and is the author of Lingua Fracta: Towards of Rhetoric of New Media.

Theory As A Tool

When it comes to hacking and coding one rolls up their sleeves to build models and prototypes to engage visually, open debate and uncover new meanings. Theory as applied in methodologies leads us away from the mundane and toward bold ways of assessing existing humanist issues that are embedded in abundance in big data through literature, history and sociology. The work of the digital humanist asserts that which is regarded as traditional narrative notions might gain new meaning or insight through further research and closer inspection. The question “How does theory support the digital humanities” is critical because theory compels consideration.

Drucker raises the notion of “creating computational protocols that are grounded in humanistic theory and methods”, and “suggest it is essential if we are to assert the cultural authority of the humanities in a world whose fundamental medium is digital”.⁽³⁾The term “cultural authority” suggests epistemological knowledge that is central to creating new digital approaches to engage critical thinking. These new digital approaches would assist in revisiting unresolved concerns as well as in observing thought processes to determine outcomes around current day critical issues and to create models using the digital humanist toolbox to reflect these findings. For instance the digital humanist can explore myriad issues on the political or social worldwide human landscapes and derive appropriate useful outcomes. Prototypes then aid in accessing which digital tools best assist and inform this work.

Ramsay and Rockwell put forth the idea that “prototypes are theories”⁽⁴⁾. These prototypes aid in the ability to create, to do, and to build, yet the “guidelines for evaluation of digital work”⁽³⁾ may restrict prototypes as scholarly. The argument can be made that such restriction could ultimately have the effect of working against the investment of skill and time during the course of the digital humanist’s workflow. As Drucker noted, “more is at stake than just the technical problems of projection”⁽⁷⁾. It is the potential of the prototype to assist workflow and serve to aid thoughtful response around humanist issues. The efficient use of mechanisms to devise tools in the digital realm assist the user in multitasking, and aid in the completion of data rich and-or quantitative digital tasks. Theory then is a tool that aids the work of the digital humanist to build and create.

19th Century Scholarship For the 21st Century

When David Mimno came to class to discuss topic modeling and MALLET, he first showed an image of the Perseus Digital Library, referring to it as ’19th century scholarship’. Now, Professor Mimno had a hand in the creation of that website, so I wouldn’t think he meant that as an insult. But he did go on to say that technology offers ‘more’ for the humanities than what the Perseus Project has done.

This made me wonder about the implicit criticism of ’19th century scholarship’ versus new computational humanities research. My understanding of the value of the humanities has everything to do with enrichment — that is, personal growth engendered by reading, understanding, and discussing the thoughts of other people exploring what it is to be human. Put another way: increasing wisdom through study. I accept that not everyone holds this view.

If we use MALLET to determine the difference in word use by male and female authors, we have certainly learned something about humanity. But it seems like a different project from the one I understand to be that of the humanities. Does the new, computational approach ‘engender personal growth’? I am ready to believe that it can, but not nearly as obviously as, say, studying Shakespeare’s Sonnets would. So far, the current approach seems to be more concerned with studying humans and human texts in a ‘scientific’, fact-oriented manner.

So that may be ’21st century humanities scholarship’, as opposed to that of the 19th century. But it needn’t be ‘either, or’. We can use Digital Humanities tools and methods to enrich the experience of students who are reading humanistic texts, much in the way done by the Perseus Digital Library, for instance. We can, as my colleague Gioia Stevens points out, use topic modeling to improve discovery of digital texts, which would unquestionably help in the individual pursuit of self-improvement.

The Twenty-First Century Footnote, Part Two

In Part One of this blog post, I wrote about scholars’ reliance on proprietary databases for research and the importance of understanding the constraints which database structures place on the outcomes of their efforts. Unfortunately, generally speaking, information about the structures of proprietary databases is not easily accessible. To remedy this, Caleb McDaniel has talked about the need to create an online resource to collate information about the construction of proprietary databases.

As an exploration of the structure of a proprietary database, I will look at one commercial database’s search and text analysis tools and touch on their handling of content. My goal is to demonstrate some of the complexity of these systems and to parse out the types of information that scholars would want to know and should consider sharing when writing up their research findings.

Artemis – Text mining lite

I recently attended a presentation about a commercial database company’s venture into what I call “text mining lite.” The company, Gale, has just started to offer text analysis and other tools that are squarely aimed at the field/set of methods of digital humanities. The tools are available through Artemis, an interface that allows searches across multiple collections of primary eighteenth century (ECCO) and nineteenth century sources (NCCO). There is a separate Artemis platform for literary material with the same analytic features. By 2015 Gale humanities collections running the gamut from the 19^th Century U.S. Newspapers to the Declassified Documents Reference System and many others will migrate into Artemis. Artemis is available CUNY-wide.

Parameters of search

To access Artemis’s textual analysis capabilities the user first determines the parameters of selection of the materials. The options are extensive: date ranges, content type (e.g. manuscript, map, photograph), document type (e.g. manifesto, telegram, back matter), title, and source library. For example, one could search only letters from the Smith College archives or manuscripts from the Library of Congress in particular years.

Context

Discussing the use of Google’s Ngram to find themes in large bodies of texts, Matt Jockers advises caution, “When it comes to drawing semantic meaning from a word, we require more than a count of that word’s occurrence in the corpus. A word’s meaning is derived through context” (120). In his CUNY DHI and Digital Praxis Seminar lecture, David Mimno addressed the necessity of understanding the context of words in large corpora saying, “We simply cannot trust that those words that we are counting mean what we think they mean. That’s the fundamental problem.”

One way that Artemis deals with this is by offering a view into the context of the documents in search results. For each result, clicking on “Keywords in Context” brings up a window showing the words surrounding the keyword in the actual (digital facsimile) document. This makes it relatively simple to identify if the document is actually relevant to your research, as long as the number of documents being examined is not too large.

Refining results

While the categories of search that Artemis allows are quite flexible, it is also possible to enter proximity operators to find co-located words. This means that, in many situations, it will be possible to further refine results through iterative searching to locate smaller batches of relevant documents on which to run the text analysis tools.

Ngram viewer

Artemis features a visualization tool that offers some improvements over Google’s Ngram to show frequency of terms over time. The term frequency ngram is created from the search results. Click and drag on the term frequency graph to modify the date range. The graph can zoom to the one-year level. It is possible to retrieve a particular document by clicking on the point on the graph. The visualization also displays term popularity, the percent of the total documents each year. Term popularity normalizes the number of documents based on the percentage of the content.

Term clusters visualization

For larger sets of documents, or to look at entire collections, researchers might want to use term clusters. Term clusters use algorithms to group words and phrases that occur a statistically relevant number of times within the search results.

The visualization of term clusters are based on the first 100 words of the first 100 search results per content type. This means that the algorithm would run only within, for example, the first one hundred words of the first one hundred monographs, the first one hundred words of the first one hundred manuscripts, and the first one hundred words of the first one hundred newspaper articles. The size limitations are because the text analysis tools are bandwidth intensive. Searches of larger numbers of documents take longer to return results and also slow down the system for other users. By clicking on the clusters, it is possible to drill down into the search results to the level of individual documents and their metadata.

Legibility of documents

Scholars should have an understanding of the process by which database publishers have transformed documents into digital objects because it affects the accuracy of searches and text analysis. In Gales’ collections, printed materials are OCR’d. For nonprint materials, such as manuscripts, ephemera and photograph captions, the metadata of names, places and dates are entered by hand. By providing improved metadata for nonprint materials, Gale has increased the discoverability of these types of documents. This is particularly important for those studying women and marginalized groups whose records are more likely to be found in ephemeral materials.

Collection descriptions

Understanding the types of materials contained within a proprietary database can be difficult. The Eighteenth Century Collections Online (ECCO) is based on the English Short Title Catalogue from the British Library and is familiar to many scholars of the eighteenth century. The Nineteenth Century Collections Online (NCCO) is a newer grouping of collections that is being continually updated. To see a detailed description of the collections in NCCO, go to the NCCO standalone database, not the Artemis platform, and click Explore Collections.

Data for research

Generally, scholars can download PDFs of documents from Artemis only one document at a time (up to 50 pages per download). When I asked about access to large amounts of data for use by digital humanists, the Gale representative said that while their databases are not built to be looked at on a machine level (because of the aforementioned bandwidth issues), Gale is beginning to provide data separately to scholars. They have a pilot program to provide datasets to Davidson College and the British Library, among others. Gale is also looking into setting up a new capability to share data that would be based outside their current system. The impression that I got was that they would be receptive to scholars who are interested in obtaining large amounts of data for research.

Bonus tip: direct (public) link to documents

Even though it doesn’t have anything to do with standards for presenting scholarship, I thought people might want to know about this handy feature. Artemis users have the ability to bookmark search results and save the URL for future reference. The link to the document(s) can then be shared with anyone, even those without logins to the database. To be clear, anyone that clicks on the link is taken directly to the document(s) although they won’t have the capability to extend the search. This makes it easy to share documents with students and through social media.

In this post, I have sought to shed some light on the usually opaque construction of proprietary databases. If people start “playing” with Artemis’ text mining lite capabilities, I would be interested in hearing about their perceptions of its usefulness for research.

Works cited

Jockers, Matthew L. “Theme.” Macroanalysis Digital Methods and Literary History. Urbana: University of Illinois Press. Print.

Thing Theory and Interactivity

It is striking that discussions of theory in DH seem primarily focused on how DH projects themselves provide theory rather than actually theorizing about the nature of DH as an academic discipline. The latter ostensibly belongs more appropriately to a discussion on defining DH (as we have discussed in week 2), but I find it productive and relevant to discuss here. Looking first at what Ramsay and Rockwell refer to as “thing theory” then noting the importance of interactivity in digital scholarship, I will attempt to broadly approach these two issues—i.e., locating theory in DH and literally defining a theory of DH—to substantiate DH as a theoretical undertaking but more importantly to illustrate how DH is unique from traditional humanities.

Regarding the hack vs. yack debate, it seems clear that even the strongest proponents of methodology over theory would agree that there is no strict dichotomy between the two. As Natalia Cecire notes, “the two are not antithetical” (56). In fact, hack and yack share essential qualities, namely the overall goal of humanistic inquiry. The only apparent differences involve the tools and media utilized. But throughout history humans have used a variety of tools and media to externalize thought (from Paleolithic cave paintings to film and new media). Simply put, humanities scholarship has long suffered from the tyranny of oral and written discourse as its primary media. DH utilizes digital tools as its media to externalize thought and humanistic inquiry. The digital product itself possesses (or should possess) the essential qualities of a written piece of scholarship, i.e., theory (notably, theory with a lower case “t”).

Ramsay and Rockwell refer to this as thing theory: “Prototypes are theories, which is to say they already contain or somehow embody that type of discourse that is most valued—namely, the theoretical” (3), and later more poignantly claim, “To ask whether coding is a scholarly act is like asking whether writing is a scholarly act” (8). I would perhaps add that coding itself is a form of writing, just as, for instance, filmmaking or other media creation are forms of writing, insofar as a communicable textual entity is created. As Drucker notes, such forms of scholarship involve “an analysis of ways representational systems produce a spoken subject” (8).

Can a film not act as a form of scholarship? Interestingly, tenure-track faculty in film production departments (though not necessarily a humanities discipline) are assessed purely on their body of film work. And it seems equally valid for a traditional humanist to produce a provocative film in lieu of a formal essay. Additionally, an inherent rule in filmmaking (though often broken) involves concealing the process (the tools). Regarding digital scholarship, Patrick Murray-John notes, “A good user interface is designed specifically so that you don’t have to deal with the inner workings of the application” (76). This is becoming a bit of a digression, but I would at least like to pose the question: is this an important rule for DH?

Equally striking, however, is Gary Hall’s take in “There are No Digital Humanities.” Hall questions the computational turn in humanities as a movement, stressing the notion that it appears to be a reverse of Lyotard’s Postmodern Condition, allowing science and quantitative information to dominate the humanities. This is an important point that deserves deeper investigation, particularly as DH continually evolves.

Ben Schmidt’s thesis is particularly useful here: “The answer, I am convinced, is that we should have prior beliefs about the ways the world is structured, and only ever use digital methods to try to create works which let us watch those structures in operation” (61). The individual subject, the human, is key in interpreting even the most empirical humanistic inquiry. Furthermore, DH fundamentally advocates open-access and, more importantly, interactivity. The ability of the user (scholar or non-scholar) to experience a DH work and interpolate his/her experiences and thoughts seems to allow DH to evade a reversal of postmodernism. Whether via data visualization, topic models, or simply blogs and open-access texts, which allow peer review/critique and interactivity with the text, the foundation of DH as a discipline appears firmly rooted in subjective humanistic inquiry in a manner that is unique and potentially more effective than traditional scholarship.

In this sense, DH can and should innately contain both theory (generally speaking) as well as a theory of itself, i.e., promoting subjective interactivity with relatively objective knowledge.

David Mimno and fatty tuna

David Mimno made an important distinction about theory vs. practice when he pointed out that MALLET (or any DH tool) is a method, not a methodology. MALLET can uncover thematic patterns in massive digital collections, but it is up to the researcher using the tool to evaluate the results, pose new questions, and think of possible new uses for the tool. In our class discussion, Mimno compared different roles in topic modeling to Iron Chef: he makes the knives (MALLET), librarians dump a lot of fatty tuna (the corpus of text) on the table, and the humanists are the chefs who need to make the meal (interpreting and drawing new conclusions from the results).

As a librarian, I have never thought of myself as a provider of fatty tuna, but I get the general point. What role do librarians and other alt-academics play in DH? Can a librarian be a tool maker, a chef, a sous-chef, a waitress, or something else entirely? What does it mean to curate content and devise valuable ways to access that content? Is it scholarship? I am not sure if I can answer that question, but I do see many new ways to apply MALLET as a search and discovery tool which would be very useful for scholarship.

Can we do better than key word search to find relevant information in huge collections of digital text? Would search terms created from the body of the text itself be more accurate than hand-coding using the very dated and narrow Library of Congress subject headings? The DH literature on topic modeling doesn’t have much on libraries, but I did find the following information. Yale, U. Michigan, and UC Irvine received an Institute of Museum and Library Services grant to study Improving Search and Discovery of Digital Resources Using Topic Modeling. See also an interesting D-Lib Magazine article on using topic modeling in HathiTrust, A New Way to Find: Testing the Use of Clustering Topics in Digital Libraries

NYPL Labs-Turning Physical Data into Digital Knowledge

Hello,

Last month I attended a Hacks/Hackers http://hackshackers.com event that the NYPL hosted. Here is a blog post from the NYPL Labs team discussing what was covered http://hackshackers.com/blog/2013/09/16/nypl-labs-turns-physical-data-digital-knowledge-hhnyc/ and there is a link to the slides that were used during the presentation: https://dl.dropboxusercontent.com/u/5949624/NYPL-Labs-9-10-13-HacksHackers.pdf

When we were asked to define Digital Humanities a few classes ago, my definition was a clunky explanation about accessibility to information. I think the projects that the NYPL Labs team are working on exemplify what I understand the digital humanities to be. Also, I think it’s revolutionizing the access to information that public libraries can provide…which is very exciting.

-Melanie Locay

Kirschenbaum’s, “The Book-Writing Machine”

Warning: tangents ahead….

What I found interesting about Kirschenbaum”s article “The Book-Writing Machine” (aside from the window being removed and the weight of the computer) was the absurd amount of coincidences that overlapped with Len Deighton’s novel Bomber and the MTST. It seems like Kismet when Len Deighton was told about IBM’s MTST and that he used it to write his novel, Bomber, . Was it happenstance that his assistant, Ellenor Handley would be complaining to a typewriter technician, further that the technician was aware of the latest “machine” that could possible aid her in writing or rather rewriting, makes you wonder about what we lose when we rely solely on computer mediated communication, here we see how ideas were shared face-to-face, a solution was produced. Social media/collaboration back in the 70’s.

I read more on MTST, apparently Jim Henson was requested by IBM to produce a PR film “Paper Explosion” extolling the benefits of MTST: ( and now completely off track…the man at the end film looks like the inspiration for Henson’s muppets Statler and Waldorf (stage left balcony box)

and that Deighton was the first novel to be written via word processed….

I did a quick search and found an article from 2007 (ancient) that states “In Japan, half of the top ten selling works of fiction in the first six months of 2007 were composed on mobile phones.”

any clues as to what could be next…..

The Twenty-First Century Footnote*

In Jefferson Bailey’s brilliant article on digital archives, he writes, “Digital objects will have an identifier, yes, but where they ‘rest’ in intellectual space is contingent, mutable. The key point is that, even at the level of representation, arrangement is dynamic . . . Arrangement, as we think of it, is no longer a process of imposing intellectualized hierarchies or physical relocation; instead, it becomes largely automated, algorithmic, and batch processed.”

Digital humanists have increasingly embraced text mining and other techniques of data manipulation both within bodies of texts that they control and in proprietary databases. When presenting their findings, they must also consider how to represent their methodology; to describe the construction of the databases and search mechanisms used in their work; and to make available the data itself.

Many people (Schmidt, Bauer, and Gibbs and Owens) have written about the responsibility of digital scholars to make their methods transparent and data publicly available as well as the need to understand how databases differ from one another (Norwood and Gregg).

Reading narratives of the research methods of David Mimno and Matt Jockers (as well as listening to Mimno’s recent lecture) has been useful for me in my ongoing thinking about the issues of how digital humanists use data and how they report on their findings. Mimno and Jockers are exemplars of transparency in their recitation of methods and in the provision of access to their datasets so that other scholars might be able to explore their work.

While every digital humanist may not use topic modeling to the extent that Mimno and Jockers do, it is fair to say that, in the future, almost all scholars will be using commercial databases to access documents and that that access will come with some version of text analysis. But what do search and text analysis mean in commercial databases? And how should they be described? In relation to keyword searching in proprietary databases, the historian Caleb McDaniel has pointed out that historians do not have codified practices for the use and citation of databases of primary materials. He says that to correctly evaluate proprietary databases scholars should know whether the databases are created by OCR, what the default search conventions are, if the databases use fuzzy hits, when they are updated and other issues. At this time, much of the information about how they are constructed is occluded in commercial databases. McDaniel recommends the creation of an “online repository” of information about commercial databases and also suggests that historians develop a stylesheet for database citation practices.

Why is this lack of information about the mechanisms of commercial databases important? Because, as Bailey says, the arrangement of digital objects in archives (and databases) is automated, algorithmic, and batch processed. Yet, as historian Ben Schmidt has noted, “database design constrains the ways historians can use digital sources” and “proprietary databases “force” “syntax” on searches. Since database search results are contingent upon database structures, if scholars are making claims related to the frequency of search terms, at a minimum, they must understand those structures to reckon with the arguments that might be raised against their conclusions based on methodology.

I recently attended a presentation about a commercial database company’s venture into what I call “text mining lite.” What I learned has only bolstered my ideas about the importance of understanding the practices of proprietary database publishing and the necessity of scholars having access to that information. The company, Gale, one of the larger database publishers, seems to be courting the digital humanities community (or at least their idea of the digital humanities community). Gale is combining access to multiple databases of primary eighteenth and nineteenth century sources through an interface called Artemis which allows the creation of “term clusters.” These are clusters of words and phrases that occur a statistically relevant number of times within the user’s search results. One of the crucial things to know about the algorithms used is that Artemis term clusters are based on the first 100 words of the first 100 search results per content type. In practice, for search results that might include monographs, manuscripts and newspapers as types, this means that the algorithm runs only within the first one hundred words of the first one hundred monographs, the first one hundred words of the first one hundred manuscripts, and the first one hundred words of the first one hundred newspaper articles. [I will describe Artemis at more length in Part Two of this blog post.] Clearly, any conclusions drawn by scholars and others using term clusters in Artemis should include information about the construction of the database and the limitations of the search mechanisms and text analysis tools.

As a final project for the Digital Praxis Seminar, I am thinking about writing a grant proposal for the planning stages of a project that would consider possible means of gathering and making available information about the practices of commercial database publishers. I would appreciate any thoughts or comments people have about this.

* The title is taken from The Hermeneutics of Data and Historical Writing by Fred Gibbs and Trevor Owens. “As it becomes easier and easier for historians to explore and play with data it becomes essential for us to reflect on how we should incorporate this as part of our research and writing practices. Is there a better way than to simply provide the raw data and an explanation of how to witness the same phenomenon? Is this the twenty-first century footnote?”

Digital Praxis Seminar Fall 2013 – Spring 2014