Author Archives: Marisa Plumb

Academic Databases: Beyond Digital Literacy

Basic digital literacy for scholarly research includes knowing how to access digital archives, search them, and interpret their results.

Another component of digital literacy is familiarity with the semiotics of the interface; knowing how to “read” the instructions and symbols that give the user an idea of what invisible material lives in a database. These portals make the contents accessible, and also convey, before a search is even conducted, a range of search possibilities. The interface suggests something about the most useful metadata that the archive contains and the way the data can be accessed.

A user, then, can glean understanding about the mechanics of the database through the interface alone. This additional level of digital literacy is helpful, but still represents a limited understanding of databases. Many of the commonly used archives that humanities scholars, librarians, and historians use are proprietary, and even with some information and educated guesses about these archives’ metadata structures, it’s difficult or impossible to go a step deeper and discern exactly how the search algorithms work and how the database is designed.

This is an issue of emerging importance for digital scholars, and is prompting historians and others to think about what appears in search results and what doesn’t. But even if researchers knew how every database and its search algorithms worked, that wouldn’t resolve all the issues and theoretical implications of digital research and scholarship. As Ben Schmidt has pointed out, “database design constrains the ways historians can use digital sources.”

The limits of database design are an important window into the computational disciplines that enable information science in the first place. Programming machines to search a hybrid of digitized source materials is of course a wide problem, involving a myriad of methods, employing methods that are constantly evolving and becoming more powerful. Therefore, it’s interesting to ask: When are the issues associated with digital research contingent on computational science and when are they contingent on the way that proprietary archives and databases choose to implement the latest algorithms?

An interesting consideration in addressing this question might start with a distinction that William J. Turkel makes between scholars who use subscription archives and those who write code to mine massive data sets themselves. The literary scholar Ted Underwood has also discussed searching academic databases and data mining in parallel, commenting, “I suspect that many humanists who think they don’t need “big data” approaches are actually using those approaches every day when they run Google searches . . . Search is already a form of data mining. It’s just not a very rigorous form: it’s guaranteed only to produce confirmation of the theses you bring to it.”

Thinking about the distinction between proprietary database engineer and dataset hackers might foster the assumption that those two parties have radically different agendas or methods for searching born-digital and and digitized archive material. But while independent programmers represents a new frontier of sorts—scholars willing to learn the methods needed to do their own research and retrieve information from their own source material—they aren’t necessarily confronted by any fewer database design limitations than the engineers who work at Gale. This gets at the heart of what’s at stake for researchers in a digital age, and why this is an apt time to explore the way digital archives work, on a computational level.

Many automated, machine-driven search techniques are a set of instructions that don’t always produce predictable results, and can be difficult to reverse engineer even when bugs are discovered. Corporate engineers don’t have full control over the results they get, and neither do hackers or the authors of open-source software.

Why is that important? One goal of Beyond Citation is to explore and provide information on how databases work, so that scholars can better understand their research results. One could argue that scholars require so-called “neutral” technology; systems that don’t favor any one type or set of results over another. And it’s easier to understand and confirm search neutrality if algorithms and source code are publicly available. But exactly what is such neutrality, and would we know it if we saw it? Any algorithm, secret or otherwise, is a product of disciplinary constraints and intersections, and reveals the boundaries of what’s computationally possible. In short, the “correctness” of any algorithm is hard to nail down.

When we look more closely at the concept of neutrality, we see that both the user and the engineer are implicated in algorithmic design choices. James Grimmelman, a lawyer, has made a compelling argument that, “Search is inherently subjective: it always involves guessing the diverse and unknown intentions of users.” Code that’s written as a service to users is written with an interaction already in mind. Evaluating the nuances of search algorithms and determining the impact they make on the integrity of one’s research involves acknowledging these kinds of imagined dialogues.

These are just some exploratory thoughts, as none of these questions about database design and search can be taken in isolation. Beyond Citation, then, is a starting point for going beyond digital literacy in multiple directions. We are gathering and presenting the kinds of knowledge that might allow scholars to distinguish between computational limitations, the limits of metadata and the ways it’s structured, and the agendas of a proprietary company. As the project evolves, we ourselves hope to deepen the kinds of skills and knowledge that allow us to present such information in the most meaningful and usable ways.

Beyond Citation: Building digital tools to explain digital tools

Over the last couple weeks, the Beyond Citation team has transformed into a web production team of sorts, focused on making key decisions about platform, site architecture, user interaction, design, and communication.

Beyond Citation—a project to build a website that aggregates accessible, structured information about scholarly databases—has the potential to enhance how scholars approach, use, and interpret resources from some of today’s most widely used digital collections. While it would be straightforward for our team to simply gather and publish information about those resources, our challenge is to build a digital tool that supports meaningful interaction with that information, one that can also scale in the future and cater to a community of contributors.

In the project’s nascent stages, the tactical concerns before us are familiar—we’re taking on the common challenge of building and launching a website or web app. Thrust into the very practical realm of software, decisions, and constraints, discussions of critical theory get put off to discuss the merits of WordPress and Drupal. These powerful tools place the project in a digital ecosystem much wider than academia. The platform we have chosen—WordPress—pushes us deeper still into the wide worlds of relational databases, server-side scripting, and content management—the digital tools that will allow us to explain other digital tools.

As we construct the basic building blocks for the site, we find that the best way to focus our approach is by seeking the advice of experts, reading blogs about WordPress customization, and learning more about MySQL and WordPress taxonomies. The robust open source community behind WordPress has enabled us to confirm that the technical requirements for the Beyond Citation website can be met many times over through combinations of WordPress plugins.

Something to consider while building this tool with WordPress, is that we are seeking to publish data about proprietary tools by using open source technology. Perhaps this isn’t really so unusual—we see this in a similar vein as increasingly popular APIs that allow for easier data aggregation or configuration from multiple sources. And toolsets that are hybrids of proprietary and open source systems are extremely common.

But there’s an important depth to explore when thinking about Beyond Citation as a bridge between proprietary and open source systems. The idea of “exposed” information, built on “hidden” information, represents a direction that the project can try to push technically. For instance, if in a future iteration the team can uncover information about scholarly databases that’s not just hard to find, but not openly available (such as how search algorithms work, or the criteria behind publisher contracts), then I think the value of Beyond Citation increases in a direction most closely aligned with its original ambition. This would also allow the project to explore the similarities and differences in how scholarly databases work in more meaningful ways.

Before we can do that, everyone on the team is doing their part to fill in knowledge gaps, and discovering “how technology works” on multiple levels. Just as we are researching the types of information about scholarly databases that we want the project to highlight, we are also researching the types of data-driven web frameworks that could easily support such information. Like many Digital Humanities projects, Beyond Citation is about knowledge acquisition and aggregation for both developers and researchers. We are challenging ourselves to learn as much as we can about one set of digital tools before we can communicate new information about other sets of digital tools—both of which are moving targets, evolving in their own realms of authorship.

As we work towards a May launch date for an early version of the site, we realize that the authors of digital projects need a constant appetite for more knowledge—technical knowledge and subject-matter knowledge—in order to create and maintain an authoritative tool.

Follow us on Twitter as we get ready for May: @beyondcitation

Sustained disruption

William Turkel’s presentation and workshop last week opened with the notion that those who engage in physical computing have the opportunity to “build objects that convey a humanistic argument.” This reminded me that DH scholarship isn’t constrained to data and digitization. While access to digital information and artifacts plays a huge role in the genesis and momentum of the digital humanities, working with data can simply be seen as working with knowledge in the most popular medium of the day. The systems we work within have multiple entry points, and many possible layers to manipulate. Beyond software (and the industry and implications of big data), physical computing and fabrication offer us an alternative way to formulate questions about interfaces, manufacturing, and the politics of innovation.

Always when working with computers and digital tools, we confront not only the black box of processes that we don’t fully understand, but also the scholar’s entanglements with the prescriptions and rules of consumer technology. But a physical computing project works on a more fundamental level of abstraction. While, as Tukel pointed out, there is always a proprietary (non-transparent) layer involved, a physical computing project does allow its maker to experiment with and change a different set of parameters and functionalities than software allows. It’s important that we have permission and the resources to take a hands-on approach to computing because it can disrupt and deepen our relationship to the technologies that we’re ultimately accountable for when DH practice becomes critique.

But I got the feeling that Turkel wasn’t overly concerned with that kind of broad or absolute speculation. I’m interested in the fact that Turkel’s lecture and workshop didn’t necessarily move in the direction of solving what it means to work with hardware, sensors, or fabrication. His talk sidestepped making heavy-handed theoretical claims or predictive expansions on his opening thought, and instead moved into a discussion of his students’ individual projects. I’m not sure his lecture outline was a statement in itself, but it did seem that he refrained from making explicit claims about the need for or purpose of physical computing in answering DH objectives or critiques. Turkel seemed to be saying that while physical computing—as a medium for play and exploration—can represent ideas and embody cultural critique, the future of the humanities does not depend on our mastery or reinvention of microchips. Even though we are bound to make objects that convey arguments, Turkel’s pitch for the essentialness of making didn’t seem wholly contingent on the scientificness or theoretical stakes of our approach. Perhaps “making” outside one’s comfort zone is important in itself, and represents a commitment to the interdisciplinarity that we have presumably already embraced.

I’m interested in the simplicity and purity in the invitation to play and fail, and wonder about this as a sustainable framework for understanding new tools and asking new questions. It’s interesting that Turkel made a few references to his love of kindergarten, and the values we lose after we leave that early classroom environment—a place of beginnings, an environment that’s ideal for experimentation.

In a sense, humanities scholars who decide to delve into hardware hacking, software, and interface design are also engaged in a kind of beginning or frontier. This last week, I’ve found myself asking questions about the relationship of beginnings to experimentation and play, and wonder how Turkel’s hands-on imperative will evolve as the contexts, spaces, and available expertise for making technology grow and change.

10/14/13: Matthew Kirschenbaum, “The Literary History of Word Processing”

 

Matthew Kirschenbaum spoke about his forthcoming book project, which was recently profiled in The New York Times.

Kirschenbaum’s research asks questions such as: When did writers begin using word processors? Who were the early adopters? How did the technology change their relationship to their craft? Was the computer just a better typewriter—faster, easier to use—or was it something more? And what will be the fate of today’s “manuscripts,” which take the form of electronic files in folders on hard drives, instead of papers in hard copy? This talk, drawn from the speaker’s forthcoming book on the subject, will provide some answers, and also address questions related to the challenges of conducting research at the intersection of literary and technological history.

Matthew G. Kirschenbaum is Associate Professor in the Department of English at the University of Maryland and Associate Director of the Maryland Institute for Technology in the Humanities (MITH, an applied thinktank for the digital humanities).

Doug Eyman and Collin Brooke discuss writing studies and DH

On October 8, CUNY DHI and the Graduate Center Composition and Rhetoric Community (GCCRC) hosted a conversation about the intersection of writing studies and digital humanities with Doug Eyman and Collin Brooke. These two innovative scholars shared in an important discussion concerning the future of digital rhetoric. Doug Eyman is a professor of digital rhetoric, technical and scientific communication, and professional writing at George Mason University and the senior editor of Kairos: A Journal of Rhetoric, Technology, and Pedagogy; Collin Brooke is a professor of Rhetoric and Writing at Syracuse University and is the author of Lingua Fracta: Towards of Rhetoric of New Media.

9/23/13 Cultural analytics: A guest lecture by Lev Manovich

Lev Manovich—a Computer Science professor and practitioner at the Grad Center who writes extensively on new media theory—delivered a guest lecture on visualization and its role in cultural analytics and computing on 9/23.

Basing his discussion on a range of visualization examples from the last decade or so, Lev highlighted how the rapid emergence of tools for collecting data and writing software have allowed artists, social scientists and others to investigate and question:

  • the role of algorithms in determining how technology mediates our cultural and social experiences,
  • how to work with very large datasets to identify social and cultural patterns worth exploring,
  • the role of aesthetics and interpretation in data visualization projects,
  • and how visualization projects can put forth reusable tools and software for working with cultural artifacts.

He also discussed previous and future projects undertaken by his lab, which developed at the University of California San Diego, and is now migrating to the CUNY Graduate Center.

Class discussion following the lecture highlighted the value of transparency in Lev’s work and processes—a value he affirmed has always defined his own publishing philosophy, even before he began writing software.

Another line of inquiry was based on how machines can be programmed to automatically “understand” content. A current challenge lies in developing computational methods that can make meaningful assessments of complex, contextualized objects. For instance, how do we train machines to go beyond simply recording strings of characters or groups of pixels (the kinds of data computers are fundamentally good at collecting), and instead write programs that have the potential to generate insights about types of sentences or faces? What is the role of visualization in meeting this challenge and how is it different than other scientific methods, like applying statistics to big data?

DefiningDH

Initial:
The digital humanities is an academic community, its members united by interest in (and use of) digital tools to 1) redefine their research and analytical practices and/or 2) cultivate new forms of academic collaboration and dialogue. Its sustainability, emergent goals, and politics are reactive to wider industry developments and economic forces.

Secondary:
The Digital Humanities is an emerging discipline within the field of information science.

Reflection:
I have found it exciting to follow DH issues and studies that are continually emerging in networked spaces, shaped by a dynamic community that seems to self-identify and develop new ideas at a rapid pace. Discovering the latest questions and critique within the discipline itself has piqued my interest in understanding the trajectories of those discussions. But while I think the world of DH-introspection enriches the discipline and fosters its growth, it at times seems a veil for the natural growth and potential for the field. Given the rapid adoption of digital tools, computational frameworks, and data mining in most areas of contemporary scholarship (and industry, government, etc.), I’m interested in why proving the value and relevance of digital methods within the humanities is a different process than it is in other realms.

As a result, I’d like to further explore how DH, as a practice, resonates with larger trends in digital practices, and how deeply interdisciplinary projects manifest the value of DH methodologies — perhaps in ways that transcend semantic qualms and curb the agency of an Analog vs. Digital duality.

But if DH is contingent on dualism (at least for the time being), perhaps alternative definitions of DH arise when one of its basic value propositions (that digital tools will deliver new value to the humanities) is inverted. For instance, there are computer and social scientists interested in questions of language, interpretation, expression, philosophy….so how can longstanding lines of inquiry in the humanities bring new dimensionality to the methods that those social and computational scientists use? I am hoping to come to a 3rd definition of DH that fortifies the idea (in practice as well as theory) that the digital needs the humanities just as the humanities need the digital.