NYPL Labs-Turning Physical Data into Digital Knowledge

Hello,

Last month I attended a Hacks/Hackers http://hackshackers.com event that the NYPL hosted. Here is a blog post from the NYPL Labs team discussing what was covered http://hackshackers.com/blog/2013/09/16/nypl-labs-turns-physical-data-digital-knowledge-hhnyc/  and there is a link to the slides that were used during the presentation: https://dl.dropboxusercontent.com/u/5949624/NYPL-Labs-9-10-13-HacksHackers.pdf

When we were asked to define Digital Humanities a few classes ago, my definition was a clunky explanation about accessibility to information.  I think the projects that the NYPL Labs team are working on exemplify what I understand the digital humanities to be.  Also, I think it’s revolutionizing the access to information that public libraries can provide…which is very exciting.

-Melanie Locay

Kirschenbaum’s, “The Book-Writing Machine”

Warning: tangents ahead….

What I found interesting about Kirschenbaum”s article “The Book-Writing Machine” (aside from the window being removed and the weight of the computer) was the absurd amount of coincidences that overlapped with Len Deighton’s novel Bomber and the MTST.  It seems like Kismet when Len Deighton was told about IBM’s MTST and that he used it to write his novel, Bomber, .  Was it happenstance that his assistant, Ellenor Handley would be complaining to a typewriter technician, further that the technician was aware of the latest “machine” that could possible aid her in writing or rather rewriting, makes you wonder about what we lose when we rely solely on computer mediated communication, here we see how ideas were shared face-to-face, a solution was produced.  Social media/collaboration back in the 70’s.

I read more on MTST, apparently Jim Henson was requested by IBM to produce a PR film “Paper Explosion” extolling the benefits of MTST: ( and now completely off track…the man at the end film looks like the inspiration for Henson’s muppets Statler and Waldorf (stage left balcony box)

and that Deighton was the first novel to be written via word processed….

I did a quick search and found an article from 2007 (ancient) that states “In Japan, half of the top ten selling works of fiction in the first six months of 2007 were composed on mobile phones.”

any clues as to what could be next…..

The Twenty-First Century Footnote*

In Jefferson Bailey’s brilliant article on digital archives, he writes, “Digital objects will have an identifier, yes, but where they ‘rest’ in intellectual space is contingent, mutable. The key point is that, even at the level of representation, arrangement is dynamic . . . Arrangement, as we think of it, is no longer a process of imposing intellectualized hierarchies or physical relocation; instead, it becomes largely automated, algorithmic, and batch processed.”

Digital humanists have increasingly embraced text mining and other techniques of data manipulation both within bodies of texts that they control and in proprietary databases. When presenting their findings, they must also consider how to represent their methodology; to describe the construction of the databases and search mechanisms used in their work; and to make available the data itself.

Many people (Schmidt, Bauer, and Gibbs and Owens) have written about the responsibility of digital scholars to make their methods transparent and data publicly available as well as the need to understand how databases differ from one another (Norwood and Gregg).

Reading narratives of the research methods of David Mimno and Matt Jockers (as well as listening to Mimno’s recent lecture) has been useful for me in my ongoing thinking about the issues of how digital humanists use data and how they report on their findings. Mimno and Jockers are exemplars of transparency in their recitation of methods and in the provision of access to their datasets so that other scholars might be able to explore their work.

While every digital humanist may not use topic modeling to the extent that Mimno and Jockers do, it is fair to say that, in the future, almost all scholars will be using commercial databases to access documents and that that access will come with some version of  text analysis. But what do search and text analysis mean in commercial databases? And how should they be described? In relation to keyword searching in proprietary databases, the historian Caleb McDaniel has pointed out that historians do not have codified practices for the use and citation of databases of primary materials. He says that to correctly evaluate proprietary databases scholars should know whether the databases are created by OCR, what the default search conventions are, if the databases use fuzzy hits, when they are updated and other issues. At this time, much of the information about how they are constructed is occluded in commercial databases. McDaniel recommends the creation of an “online repository” of information about commercial databases and also suggests that historians develop a stylesheet for database citation practices.

Why is this lack of information about the mechanisms of commercial databases important? Because, as Bailey says, the arrangement of digital objects in archives (and databases) is automated, algorithmic, and batch processed. Yet, as historian Ben Schmidt has noted,  “database design constrains the ways historians can use digital sources” and  “proprietary databases “force” “syntax” on searches. Since database search results are contingent upon database structures, if scholars are making claims related to the frequency of search terms, at a minimum, they must understand those structures to reckon with the arguments that might be raised against their conclusions based on methodology.

I recently attended a presentation about a commercial database company’s venture into what I call “text mining lite.” What I learned has only bolstered my ideas about the importance of understanding the practices of proprietary database publishing and the necessity of scholars having access to that information. The company, Gale, one of the larger database publishers, seems to be courting the digital humanities community (or at least their idea of the digital humanities community). Gale is combining access to multiple databases of primary eighteenth and nineteenth century sources through an interface called Artemis which allows the creation of “term clusters.” These are clusters of words and phrases that occur a statistically relevant number of times within the user’s search results. One of the crucial things to know about the algorithms used is that Artemis term clusters are based on the first 100 words of the first 100 search results per content type. In practice, for search results that might include monographs, manuscripts and newspapers as types, this means that the algorithm runs only within the first one hundred words of the first one hundred monographs, the first one hundred words of the first one hundred manuscripts, and the first one hundred words of the first one hundred newspaper articles. [I will describe Artemis at more length in Part Two of this blog post.] Clearly, any conclusions drawn by scholars and others using term clusters in Artemis should include information about the construction of the database and the limitations of the search mechanisms and text analysis tools.

As a final project for the Digital Praxis Seminar, I am thinking about writing a grant proposal for the planning stages of a project that would consider possible means of gathering and making available information about the practices of commercial database publishers. I would appreciate any thoughts or comments people have about this.

* The title is taken from The Hermeneutics of Data and Historical Writing by Fred Gibbs and Trevor Owens. “As it becomes easier and easier for historians to explore and play with data it becomes essential for us to reflect on how we should incorporate this as part of our research and writing practices. Is there a better way than to simply provide the raw data and an explanation of how to witness the same phenomenon? Is this the twenty-first century footnote?”

Dialogues on Feminism and Technology

The experimental online class Dialogues on Feminism and Technology led by Anne Balsamo and Alexandra Juhasz has begun posting weekly video conversations. Sixteen colleges are participating, including CUNY’s Graduate Center and Macaulay Honors College, as well as thousands of learners outside of formal educational settings. The class is intended to be a model for open access pedagogy in a collaborative environment, in contrast to MOOC-style learning. For those without institutional logins, there are suggested readings to accompany the class.

As an example of the ideas about women and technology in popular culture that they are seeking to counter, class organizers point to a June 2012, New York Times article about Silicon Valley which opened with “Men invented the Internet.” Partly to address that type of thinking, students will participate in Storming Wikipedia, an exercise in writing and editing to include women and feminist scholarship in Wikipedia. HASTAC has a wiki page about the Wikistorming that took place earlier this year.

9/23/13 Cultural analytics: A guest lecture by Lev Manovich

Lev Manovich—a Computer Science professor and practitioner at the Grad Center who writes extensively on new media theory—delivered a guest lecture on visualization and its role in cultural analytics and computing on 9/23.

Basing his discussion on a range of visualization examples from the last decade or so, Lev highlighted how the rapid emergence of tools for collecting data and writing software have allowed artists, social scientists and others to investigate and question:

  • the role of algorithms in determining how technology mediates our cultural and social experiences,
  • how to work with very large datasets to identify social and cultural patterns worth exploring,
  • the role of aesthetics and interpretation in data visualization projects,
  • and how visualization projects can put forth reusable tools and software for working with cultural artifacts.

He also discussed previous and future projects undertaken by his lab, which developed at the University of California San Diego, and is now migrating to the CUNY Graduate Center.

Class discussion following the lecture highlighted the value of transparency in Lev’s work and processes—a value he affirmed has always defined his own publishing philosophy, even before he began writing software.

Another line of inquiry was based on how machines can be programmed to automatically “understand” content. A current challenge lies in developing computational methods that can make meaningful assessments of complex, contextualized objects. For instance, how do we train machines to go beyond simply recording strings of characters or groups of pixels (the kinds of data computers are fundamentally good at collecting), and instead write programs that have the potential to generate insights about types of sentences or faces? What is the role of visualization in meeting this challenge and how is it different than other scientific methods, like applying statistics to big data?

privacy and recording lectures

Like Sarah, I take for granted that public universities and classes promote information sharing. The articles and cases I’ve read are very blurry–according to NY Recording law, recording in a public university may fall under the category of recording a public meeting…which means no consent is required.

This article raised some great points, focused on Livescribe recordings.

Here is the direct link to NY’s recording laws.

 

DefiningDH

Initial:
The digital humanities is an academic community, its members united by interest in (and use of) digital tools to 1) redefine their research and analytical practices and/or 2) cultivate new forms of academic collaboration and dialogue. Its sustainability, emergent goals, and politics are reactive to wider industry developments and economic forces.

Secondary:
The Digital Humanities is an emerging discipline within the field of information science.

Reflection:
I have found it exciting to follow DH issues and studies that are continually emerging in networked spaces, shaped by a dynamic community that seems to self-identify and develop new ideas at a rapid pace. Discovering the latest questions and critique within the discipline itself has piqued my interest in understanding the trajectories of those discussions. But while I think the world of DH-introspection enriches the discipline and fosters its growth, it at times seems a veil for the natural growth and potential for the field. Given the rapid adoption of digital tools, computational frameworks, and data mining in most areas of contemporary scholarship (and industry, government, etc.), I’m interested in why proving the value and relevance of digital methods within the humanities is a different process than it is in other realms.

As a result, I’d like to further explore how DH, as a practice, resonates with larger trends in digital practices, and how deeply interdisciplinary projects manifest the value of DH methodologies — perhaps in ways that transcend semantic qualms and curb the agency of an Analog vs. Digital duality.

But if DH is contingent on dualism (at least for the time being), perhaps alternative definitions of DH arise when one of its basic value propositions (that digital tools will deliver new value to the humanities) is inverted. For instance, there are computer and social scientists interested in questions of language, interpretation, expression, philosophy….so how can longstanding lines of inquiry in the humanities bring new dimensionality to the methods that those social and computational scientists use? I am hoping to come to a 3rd definition of DH that fortifies the idea (in practice as well as theory) that the digital needs the humanities just as the humanities need the digital.

Defining DH

Before our class conversation, I defined Digital Humanities as the practice of accessing and exchanging ideas and scholarship through interconnected digital platforms. I understood the Digital Humanities as the field which merges scholarship across the wide spectrum of academic disciplines with modern technological advances to realize ideas through a more effective and relevant context.

Following our discussion, I would redefine Digital Humanities as a practice of accessing, developing, communicating, and exchanging ideas and scholarship through digitized mediums. I still hold that the field aims to fuse modern scholarship with modern technological capabilities in order to produce relevant, effective idea exchange. This growing field realizes that in our digitizing world, the individual’s capacity for thought and creation is heightened. In order to advance our ideas and scholarship in this changing context, we must utilize digital mediums. The Digital Humanities guides this adaptation.