Category Archives: DH Student Projects 2014

Announcing the launch of Beyond Citation

The Beyond Citation project team is thrilled to announce the public launch of our website at BeyondCitation.org. Even though scholars use academic databases every day, it is difficult to find information about how the databases work and what is in them. Beyond Citation gathers information about academic databases in one place to enable traditional humanities scholars and digital humanists to get a better sense of the content and searching mechanisms in databases. The goal of Beyond Citation is to make academic databases more transparent to users and to encourage critical thinking about academic databases.

The audience for the site is scholars, librarians, research enthusiasts or anyone who:

Uses academic databases and wants to learn more about what is in them
Is frustrated with academic databases and wants tips about how to more effectively search them
Wants to share their knowledge or experiences of academic databases with others

We invite you to participate in Beyond Citation by:

Starting or adding to a thread in the Community Forum.
Proposing an article or blog post that they would like to write
Offering to write an entire entry for a new database

Please visit BeyondCitation.org. Follow us on Twitter @beyondcitation

Continue reading →

DH Box Takes Off

Cross-posted from the DH Box Blog: https://dhbox.commons.gc.cuny.edu/blog/2014/dh-box-takes-off

This is it: DH Box is officially launching. The Digital GC is presenting an evening of short talks from various CUNY Graduate Center digital initiatives today, May 12 — starting off with DH Box.

I wanted to take a moment to reflect on where DH Box started and how far we’ve come. We introduced our project in early February:

What is DH Box?

Not much, so far. But we intend it to be a portable, customized linux environment for Digital Humanities learners that can rely on incredibly inexpensive technology. All you really need is a computer that runs Linux (and a monitor and keyboard, of course!) — but the platform that excites us most is the Raspberry Pi, a tiny computer that sells for just $35. Imagine a collection of DH tools, pre-installed and configured, and a set of texts for users to interrogate — all on a portable and inexpensive device.

That’s a quote from our first blog post — and it illustrates the most drastic change to our project. DH Box’s founder, Stephen Zweibel, had originally envisioned DH Box as being scripts that, when run, installed common DH applications (think Omeka, MALLET, NLTK) onto the user’s system; additionally, DH Box could be shipped as its suite of tools pre-installed on the light and portable Raspberry Pi computer.

As DH Box developed, it took a shift in platform, moving away from the issue of dealing with the idiosyncrasies of each individual’s system, to hosting instances of a virtual computer that any user could launch.

This was a vast and visible shift. But, despite not being as drastic, many other project elements developed in the journey from DH Box’s inception to its official launch.

Continue reading →

[Cross-posted] The conundrum of public creation

In the first blog post for our Travelogue: Mapping Literary History project “Welcome to Travelogue” written by our great Project Manager Sarah, she talked about the excitement the group felt at embarking on this project and our eagerness to learn new things and to create a great digital project. She was speaking the truth; we are all excited about working on this project.

For me, as the web site developer, the first thing I had the opportunity to learn was WordPress. The idea was that I would create a meta-blog site and the whole group would use the site to blog and post about the process we were all going through to create out project, “Travelogue – Mapping Literary History”. The process of creating this meta-blog site would give me the opportunity and a place where I could learn and play with WordPress so that when I had to create the official web site for our actual public project, I’d be comfortable and familiar with the CMS.

In her post Sarah also referenced a post I had written for our Fall 2013 Digital Praxis seminar, where I talked about not being afraid to fail. While I wrote about not worrying about failing and how the process itself of learning and trying new things was a success, whether the project failed or not, I must admit that while that may sound good, in reality it is hard to live that philosophy. I was afraid to fail, I was afraid to create a site which would be less than and to do it in public no less is not easy. It is not easy working and creating “in public” (a phrase our professor Matt Gold likes to use). It is not easy to talk about your worries and concerns in public. In my work life I’ve worked where you don’t show the process to the public, just the results. You know, you don’t want to see sausage being made; you just want to eat the sausage. I had to keep reminding myself that part of this class and project was actually doing a good portion of our work in public and letting the public see what we were doing, the difficulties we were having, along with our successes. Stay tuned for my next post where I will write about some of my failures and successes so far in creating these 2 sites and what I’ve learned so far working on this group project.

DH Box Development and Testing

We’ve made big strides developing the front end interface to launch a new DH Box, and the Welcome page/menu that acts as the DH Box ‘home base’. We received extremely helpful feedback from some generous volunteer user experience testers at City Tech, and valuable advice from Chris Stein, Director of User Experience for the CUNY Academic Commons.

The results of our first round of user experience testing gave our team some great insights, and a fresh perspective on the project. We learned that perhaps one of our biggest challenges is effectively conveying the concept of the project in a readily digestible way.

We discovered that users can easily get the impression that DH Box is essentially a website, when in fact it’s much more than that (it’s a computer!). It’s understandable that this virtual computer could be confused for a website since DH Box’s primary navigation happens through your web browser. A distinct IP address is assigned to each DH Box instance at the time of launch. DH Box users navigate to applications (Mallet, Omeka, etc.) through specific ports designated for each tool. The “port” is just a unique numeric identifier appended to the end of your DH Box IP address. This same protocol for assigning unique identifiers is the basis of the internet; there’s an IP address behind every website.

We as a team are now reexamining how to explain the system of navigation, along with all of the fantastic stuff a virtual computer can offer so that users will be ready to push DH Box to the limit.

[Cross-Posting] On Successful #DH Project Management

Project management is difficult. As one of my teammates said to me point-blank: “I would not want your job.”

As our team began to work on Travelogue, I assumed that my brief stint organizing the development of two separate websites in various professional settings would help me. But while a background in marketing has allowed me to think more critically about things like publicity, nothing really prepared me for managing people my own age in a setting where we do not receive salaries for our work.

And while I have been extremely lucky to work with a group of brilliant people who are invested in helping me complete the project, it has been tricky figuring out how to tell people what (and how much) to do; everyone has full lives outside of school.

In a work setting, orders would coming down from my boss who had little idea of the actual tasks we needed to take in order to complete a website. The details of these orders were laid out for me by advanced IT and design departments, each of whom had their own ideas about how the website should look and behave. In this project, where I am the “boss,” things were more difficult, especially because while all of us have great ideas, the actual means to execution can be unclear. But just because you only have a basic understanding of web design, it does not mean that you can’t build something (mostly) from scratch. You just need a good plan.

Websites and website redesigns can (and do) take years to complete, but for this project, we only have about four months. In the course of this semester thus far, I’ve found that a few things are essential to completing a project successfully. Some seem obvious, but when you are trying to keep a bunch of different wheels spinning, simple things can be easy to forget.

(Of course, this is not complete list)

Know Your Deliverables

What are the major tasks that need to be completed in order to produce a final project? In the course of a semester, what needs to be completed from week-to-week in order to get things done? Setting some key deadlines, and being able to adjust them, will help the project move forward. I made a simple project plan in an Excel document that was arranged by week, with a new goal for each Monday. From there, I doubled back and talked to my group members about what needed to be completed for each goal. I am indebted to Micki Kaufman for major assistance here, as well as to Tom Scheinfeldt’s lecture last semester.

Use Your Support Network

There are experts at your school who can help you. As it goes with everything, being afraid to ask for help can (and will) diminish your success.

Know Your Team’s Strengths (and Weaknesses)

Project management involves a good deal of emotional intelligence. Knowing where your group members are coming from, and being aware of and sensitive to what they can and can’t accomplish in a given time frame, will provide for a better outcome. It kind of goes without saying that actively listening to your group members’ concerns and ideas will make them more invested in your goals.

Be Flexible

This goes for allowing extra time in your project plan, as well as being open to adjusting your vision and/or timeline. It can be hard to let go of original ideas, but if they aren’t working, it’s important that you are able to recognize that and just let go. In the case of Travelogue, our project scope changed slightly from what I originally proposed when we learned more about our platform. You also have to pad enough extra time in your project plan in case you hit roadblocks or an unexpected learning curve.

Relax (a Little Bit)

In working on a major project with a tight deadline, not only is it important to manage your expectations, but it is also important not to put too much pressure on your group. My personality defaults to surface-level relaxation that can be misinterpreted as lackadaisical, when usually (like anyone else) I’m managing a huge amount of internal stress. I try not to micromanage my team as a result of my internal freakouts, which would make anyone stressed-out and disengaged. At the same time, being too lax about deadlines says: “I don’t really care.” If you don’t care, neither will they.

We are currently buzzing around our computers to get this thing done, with constant revision of the plan to keep things in motion.

Visit: https://travelogue.commons.gc.cuny.edu

And here is a link to the project plan for anyone who’s interested: https://docs.google.com/spreadsheet/ccc?key=0As13_khVZTLXdHBMV2NlNWwtTndiRTZsUk1QQTVWYnc&usp=sharing

Academic Databases: Beyond Digital Literacy

Basic digital literacy for scholarly research includes knowing how to access digital archives, search them, and interpret their results.

Another component of digital literacy is familiarity with the semiotics of the interface; knowing how to “read” the instructions and symbols that give the user an idea of what invisible material lives in a database. These portals make the contents accessible, and also convey, before a search is even conducted, a range of search possibilities. The interface suggests something about the most useful metadata that the archive contains and the way the data can be accessed.

A user, then, can glean understanding about the mechanics of the database through the interface alone. This additional level of digital literacy is helpful, but still represents a limited understanding of databases. Many of the commonly used archives that humanities scholars, librarians, and historians use are proprietary, and even with some information and educated guesses about these archives’ metadata structures, it’s difficult or impossible to go a step deeper and discern exactly how the search algorithms work and how the database is designed.

This is an issue of emerging importance for digital scholars, and is prompting historians and others to think about what appears in search results and what doesn’t. But even if researchers knew how every database and its search algorithms worked, that wouldn’t resolve all the issues and theoretical implications of digital research and scholarship. As Ben Schmidt has pointed out, “database design constrains the ways historians can use digital sources.”

The limits of database design are an important window into the computational disciplines that enable information science in the first place. Programming machines to search a hybrid of digitized source materials is of course a wide problem, involving a myriad of methods, employing methods that are constantly evolving and becoming more powerful. Therefore, it’s interesting to ask: When are the issues associated with digital research contingent on computational science and when are they contingent on the way that proprietary archives and databases choose to implement the latest algorithms?

An interesting consideration in addressing this question might start with a distinction that William J. Turkel makes between scholars who use subscription archives and those who write code to mine massive data sets themselves. The literary scholar Ted Underwood has also discussed searching academic databases and data mining in parallel, commenting, “I suspect that many humanists who think they don’t need “big data” approaches are actually using those approaches every day when they run Google searches . . . Search is already a form of data mining. It’s just not a very rigorous form: it’s guaranteed only to produce confirmation of the theses you bring to it.”

Thinking about the distinction between proprietary database engineer and dataset hackers might foster the assumption that those two parties have radically different agendas or methods for searching born-digital and and digitized archive material. But while independent programmers represents a new frontier of sorts—scholars willing to learn the methods needed to do their own research and retrieve information from their own source material—they aren’t necessarily confronted by any fewer database design limitations than the engineers who work at Gale. This gets at the heart of what’s at stake for researchers in a digital age, and why this is an apt time to explore the way digital archives work, on a computational level.

Many automated, machine-driven search techniques are a set of instructions that don’t always produce predictable results, and can be difficult to reverse engineer even when bugs are discovered. Corporate engineers don’t have full control over the results they get, and neither do hackers or the authors of open-source software.

Why is that important? One goal of Beyond Citation is to explore and provide information on how databases work, so that scholars can better understand their research results. One could argue that scholars require so-called “neutral” technology; systems that don’t favor any one type or set of results over another. And it’s easier to understand and confirm search neutrality if algorithms and source code are publicly available. But exactly what is such neutrality, and would we know it if we saw it? Any algorithm, secret or otherwise, is a product of disciplinary constraints and intersections, and reveals the boundaries of what’s computationally possible. In short, the “correctness” of any algorithm is hard to nail down.

When we look more closely at the concept of neutrality, we see that both the user and the engineer are implicated in algorithmic design choices. James Grimmelman, a lawyer, has made a compelling argument that, “Search is inherently subjective: it always involves guessing the diverse and unknown intentions of users.” Code that’s written as a service to users is written with an interaction already in mind. Evaluating the nuances of search algorithms and determining the impact they make on the integrity of one’s research involves acknowledging these kinds of imagined dialogues.

These are just some exploratory thoughts, as none of these questions about database design and search can be taken in isolation. Beyond Citation, then, is a starting point for going beyond digital literacy in multiple directions. We are gathering and presenting the kinds of knowledge that might allow scholars to distinguish between computational limitations, the limits of metadata and the ways it’s structured, and the agendas of a proprietary company. As the project evolves, we ourselves hope to deepen the kinds of skills and knowledge that allow us to present such information in the most meaningful and usable ways.

Communicating Technical Process

With alpha work on DH Box wrapping up, it’s a good moment to reflect on some technical lessons learned, as well as some lessons about being on the technical side of a team. Up to this point, while I have been keeping my team apprised in general of DH Box’s technical situation as it progressed, most of the details of its implementation, as well as the specific tools I’ve used and their justifications, pros/cons, and possible alternatives, I have kept to myself.

This is, in part, due to the fact that I did not begin with a particular plan. Though we had a well-defined goal for DH Box, I knew that there were myriad ways to reach it. So I experimented with different methods of cloud deployment and server provisioning, that is, different ways of creating each new instance of DH Box and automatically installing all of the necessary software on it.

I started with a BASH script designed to run on the first boot of each new DH Box instance. This worked well enough, but didn’t offer much in the way of sophisticated automation or transparency for debugging. I then tried some of the more well-known server deployment/provisioning tools, like Puppet and Salt. Puppet I found less straightforward than I’d hoped, partially because it requires modules to be written in a homespun variety of Ruby, which I’m not super comfortable with. Salt did more of what I wanted, but I was still reading its documentation when I became distracted by yet another tool, Ansible.

Ansible turned out to be just what I needed: It is written in Python, a language I have more familiarity with, and it allows me to monitor each deployment of a new DH Box in real time. Using Ansible, I’ve been able to create a whole automation workflow in one language, and, even better, I can easily see if and at exactly which point a deployment fails. This is crucial to efficient problem solving and future updates for DH Box, as its installation process necessarily involves many separate moving parts.

With these details of DH Box’s technical framework determined, it’s possible to create a more concrete “blueprint”, and I’m now working with our project planner, Gioia, to incorporate much more specific technical milestones into our overall plan. Going forward, I hope to keep everyone up-to-date and communicate some of what I learn along the way, without getting us too bogged-down in technical minutiae.

Collaborative Opportunities

The Travelogue team has been exploring how other sites are using maps as digital pedagogical tools. We are also connecting with possible collaborators, including other mapping projects, educational institutions and libraries.

In an effort to be participate in the conversations happening on social network platforms, Travelogue has been monitoring how Twitter is being used by similar projects. We have explored hashtags that are being used in reference to maps, are concerned with literature, teaching, English, History, Social Studies, high school teachers, lesson plans etc. We have also been following the conversations/posts on the Humanities, Arts, Science and Technology Alliance and Collaboratory (HASTAC) site.

On the development front we are playing with several WordPress Child Themes to see which will best work for the Travelogue site and the ESRI Storymap we will be using. Research wise, we have completed a workable draft of the Ernest Hemingway content spreadsheet which we will use to construct Travelogue’s Ernest Hemingway StoryMap.

The Travelogue Commons site has a Research section that is categorized and features helpul resources, compiled during the progression of the Travelogue project. For example, Esri Storymaps for Education.

Thank you for following our journey. We look forward to sharing our connections with others in the GIS world.

If you want to contact us please do. Our project blog is at travelogue.commons.gc.cuny.edu. Email us at dhtravelogue [at] gmail [dot] com or follow us on Twitter @DhTravelogue

DH Box considers deployment options

Cross-posted from the DH Box Blog: https://dhbox.commons.gc.cuny.edu/blog/2014/deployment-options-dh-box

Once DH Box knew the platform it would adopt, it was simply a matter of figuring out the best way to utilize that platform. But was it so simple?

What the DH Box Team has been tackling this week is striking a balance between providing a robust tool that is useful for the intended audience and whose maintenance is not insurmountable for its administrators.

To recap — the platform chosen for delivering the DH Box environment, ready with DH tools installed, is a web server image provided through Amazon’s AMI (Amazon Machine Image) appliance. This will deliver, in essence, an identical copy of a tool-laden operating system to any user’s system.

Choosing this platform offered important benefits — for example, freedom from having to address issues caused by tools being installed to users’ personal systems. However, it also introduced tension: to deploy images hosted by Amazon, one needs to use an Amazon account. Would we have users create their own Amazon Web Services (AWS) accounts that require credit card information (though launching the Image is a free service) or would we maintain an account that instances would be launched from and figure out how the DH Box team would handle potential related charges?

Many questions entered into this equation: Would our intended users be open to providing credit card information? Who might this alienate? Or, if we managed the AWS account with many instances running, would we incur charges we’re not prepared to deal with? What would be the time-period allotted to users for running the instances?

DH Box has had to think through how different deployment options (e.g. requiring users to have their own AWS accounts) might affect how DH Box will be adopted by intended users. And this — the tension between providing a service that is maintainable, sustainable, and at-once useful to the intended audience — is something any project like DH Box might face.

Thinking About Authority and Academic Databases

Beyond Citation hopes to encourage critical thinking by scholars about academic databases. But what do we mean by critical thinking? Media culture scholar Wendy Hui Kyong Chun has defined critique as “not attacking what you think is false, but thinking through the limitations and possibilities of what you think is true.”

One question that the Beyond Citation team is considering is the scholarly authority of a database. Yale University Library addresses the question of scholarly authority in a handout entitled the “Web vs. Library Databases,” a guide for undergraduates. The online PDF states that information on the web is “seldom regulated, which means the authority is often in doubt.” By contrast, “authority and trustworthiness are virtually guaranteed” to the user of library databases.

Let’s leave aside for the moment the question of whether scholars should always prefer the “regulated” information of databases to the unruly data found on the Internet. While Yale Library may simply be using shorthand to explain academic databases to undergraduates, to the extent that they are equating databases and trustworthiness, I think they may be ceding authority to databases too readily and missing some of the complexity of the current digital information landscape.

Yale Library cites Academic Search and Lexis-Nexis as examples of databases. Lexis-Nexis is a compendium of news articles, broadcast transcripts, press releases, law cases, as well as Internet miscellany. Lexis-Nexis is probably authoritative in the sense that one can be comfortable that the items accessed are the actual articles obtained directly from publishers and thus contain the complete texts of articles (with images removed). In that limited sense, items in Lexis-Nexis are certainly more reliable than results obtained from a web search. (Although this isn’t true for media historians who want to see the entire page with pictures and advertisements included. For that, try the web or another newspaper database). Despite its relatively long pedigree for an electronic database, careful scrutiny of results is just as crucial when doing a search in Lexis-Nexis as it is for an Internet search.

In some instances, especially when seeking information about non-mainstream topics, searching the Internet may be a better option. Composition and rhetoric scholar Janine Solberg has written about her experience of research in digital environments, in particular how full-text searches on Amazon, Google Books, the Internet Archive and HathiTrust enabled her to locate information that she was unable to find in conventional library catalogs. She says, “Web-based searching allowed me not only to thicken my rhetorical scene more quickly but also to rapidly test and refine questions and hypotheses.” In the same article, Solberg calls for “more explicit reflection and discipline-specific conversation around the uses and shaping effects of these [digital] technologies” and recommends as a method “sharing and circulating research narratives that make the processes of historical research visible to a wider audience . . . with particular attention to the mediating role of technologies.”

Adding to the challenge of thinking critically about academic databases is their dynamic nature. The terrain of library databases is changing as more libraries adopt proprietary “discovery” systems that search across the entire set of databases to which libraries subscribe. For example, the number of JSTOR users has dropped “as much as 50%” with installations of discovery systems and changes in Google’s algorithms. Shifts in discovery have led to pointed discussions between associations of librarians and database publishers about the lack of transparency of search mechanisms. In 2012, Tim Collins, the president of EBSCO, a major database and discovery system vendor, found it necessary to address the question of whether vendors of discovery systems favor their own content in searches, denying that they do. There is, however, no way for anyone outside the companies to verify his statement because the vendors will not reveal their search algorithms.

While understanding the ranking of search results in academic databases is an open question, a recent study comparing research in databases, Google Scholar and library discovery systems by Asher et al. found that “students imbued the search tools themselves with a great deal of authority,” often by relying on the brand name of the database. More than 90% of students in the study never went past the first page of search results. As the study notes, “students are de facto outsourcing much of the evaluation process to the search algorithm itself.”

In addition, lest one imagine that scholars are immune to an uncritical perspective on digital sources, in his study of the citation of newspaper databases in Canadian dissertations, historian Ian Milligan says that scholars have adopted the use of these databases without achieving a concomitant perspective on their shortcomings. Similarly to the Asher et al. study of undergraduate students, Milligan says, “Researchers cite what they find online.”

If critique is, as Chun says, thinking through the limitations and possibilities of what we think is true, then perhaps by encouraging reflective conversations among scholars about how these ubiquitous digital tools shape research and the production of knowledge, Beyond Citation’s efforts will be another step toward that critique.

We are at blog.beyondcitation.org. Email us at BeyondCitation [at] gmail [dot] com or follow us on Twitter @beyondcitation as we get ready for the launch in May.

Digital Praxis Seminar Fall 2013 – Spring 2014