Author Archives: Eileen Clancy

Announcing the launch of Beyond Citation

The Beyond Citation project team is thrilled to announce the public launch of our website at BeyondCitation.org. Even though scholars use academic databases every day, it is difficult to find information about how the databases work and what is in them. Beyond Citation gathers information about academic databases in one place to enable traditional humanities scholars and digital humanists to get a better sense of the content and searching mechanisms in databases. The goal of Beyond Citation is to make academic databases more transparent to users and to encourage critical thinking about academic databases.

The audience for the site is scholars, librarians, research enthusiasts or anyone who:

Uses academic databases and wants to learn more about what is in them
Is frustrated with academic databases and wants tips about how to more effectively search them
Wants to share their knowledge or experiences of academic databases with others

We invite you to participate in Beyond Citation by:

Starting or adding to a thread in the Community Forum.
Proposing an article or blog post that they would like to write
Offering to write an entire entry for a new database

Please visit BeyondCitation.org. Follow us on Twitter @beyondcitation

Continue reading →

Thinking About Authority and Academic Databases

Beyond Citation hopes to encourage critical thinking by scholars about academic databases. But what do we mean by critical thinking? Media culture scholar Wendy Hui Kyong Chun has defined critique as “not attacking what you think is false, but thinking through the limitations and possibilities of what you think is true.”

One question that the Beyond Citation team is considering is the scholarly authority of a database. Yale University Library addresses the question of scholarly authority in a handout entitled the “Web vs. Library Databases,” a guide for undergraduates. The online PDF states that information on the web is “seldom regulated, which means the authority is often in doubt.” By contrast, “authority and trustworthiness are virtually guaranteed” to the user of library databases.

Let’s leave aside for the moment the question of whether scholars should always prefer the “regulated” information of databases to the unruly data found on the Internet. While Yale Library may simply be using shorthand to explain academic databases to undergraduates, to the extent that they are equating databases and trustworthiness, I think they may be ceding authority to databases too readily and missing some of the complexity of the current digital information landscape.

Yale Library cites Academic Search and Lexis-Nexis as examples of databases. Lexis-Nexis is a compendium of news articles, broadcast transcripts, press releases, law cases, as well as Internet miscellany. Lexis-Nexis is probably authoritative in the sense that one can be comfortable that the items accessed are the actual articles obtained directly from publishers and thus contain the complete texts of articles (with images removed). In that limited sense, items in Lexis-Nexis are certainly more reliable than results obtained from a web search. (Although this isn’t true for media historians who want to see the entire page with pictures and advertisements included. For that, try the web or another newspaper database). Despite its relatively long pedigree for an electronic database, careful scrutiny of results is just as crucial when doing a search in Lexis-Nexis as it is for an Internet search.

In some instances, especially when seeking information about non-mainstream topics, searching the Internet may be a better option. Composition and rhetoric scholar Janine Solberg has written about her experience of research in digital environments, in particular how full-text searches on Amazon, Google Books, the Internet Archive and HathiTrust enabled her to locate information that she was unable to find in conventional library catalogs. She says, “Web-based searching allowed me not only to thicken my rhetorical scene more quickly but also to rapidly test and refine questions and hypotheses.” In the same article, Solberg calls for “more explicit reflection and discipline-specific conversation around the uses and shaping effects of these [digital] technologies” and recommends as a method “sharing and circulating research narratives that make the processes of historical research visible to a wider audience . . . with particular attention to the mediating role of technologies.”

Adding to the challenge of thinking critically about academic databases is their dynamic nature. The terrain of library databases is changing as more libraries adopt proprietary “discovery” systems that search across the entire set of databases to which libraries subscribe. For example, the number of JSTOR users has dropped “as much as 50%” with installations of discovery systems and changes in Google’s algorithms. Shifts in discovery have led to pointed discussions between associations of librarians and database publishers about the lack of transparency of search mechanisms. In 2012, Tim Collins, the president of EBSCO, a major database and discovery system vendor, found it necessary to address the question of whether vendors of discovery systems favor their own content in searches, denying that they do. There is, however, no way for anyone outside the companies to verify his statement because the vendors will not reveal their search algorithms.

While understanding the ranking of search results in academic databases is an open question, a recent study comparing research in databases, Google Scholar and library discovery systems by Asher et al. found that “students imbued the search tools themselves with a great deal of authority,” often by relying on the brand name of the database. More than 90% of students in the study never went past the first page of search results. As the study notes, “students are de facto outsourcing much of the evaluation process to the search algorithm itself.”

In addition, lest one imagine that scholars are immune to an uncritical perspective on digital sources, in his study of the citation of newspaper databases in Canadian dissertations, historian Ian Milligan says that scholars have adopted the use of these databases without achieving a concomitant perspective on their shortcomings. Similarly to the Asher et al. study of undergraduate students, Milligan says, “Researchers cite what they find online.”

If critique is, as Chun says, thinking through the limitations and possibilities of what we think is true, then perhaps by encouraging reflective conversations among scholars about how these ubiquitous digital tools shape research and the production of knowledge, Beyond Citation’s efforts will be another step toward that critique.

We are at blog.beyondcitation.org. Email us at BeyondCitation [at] gmail [dot] com or follow us on Twitter @beyondcitation as we get ready for the launch in May.

Beyond Citation: Critical thinking about academic databases

During the Fall 2013 semester, I started reading, thinking and writing about the impact of academic databases such as JSTOR and Gale: Artemis Primary Sources on research and scholarship. I learned that databases shape the questions that can be asked and the arguments that can be made by scholars through search interfaces, algorithms, and the items that are contained in or absent from their collections. Although algorithms in databases have been found to have an “epistemological power” through their ranking of search results, understanding why certain search results appear is very difficult even for the team that engineered the algorithms. Yet knowledge of how databases work is extremely limited because information about database structures is scanty or unavailable and constantly changing.

Despite the ubiquity of databases, academics are often unaware of the constraints that databases place on their research. Lack of information about the impact of database structures and content on research is an obstacle to scholarly inquiry because it means that scholars may not be aware of and cannot account for how databases affect their interpretations of search results or text analysis.

Digital humanists have examined both the benefits and perils of research in academic databases. The introduction of digital tools for text analysis to identify patterns common to large amounts of documents has added to the complexity of scholars’ tasks. Historian Jo Guldi writes that, “Keyword searching [in databases] . . . allows the historian to propose longer questions, bigger questions;” yet she also remarks on the challenges posed by search in an earlier article saying that, “Each digital database has constraints that render historiographical interventions based upon scholars’ queries initially suspect.” Scholars such as Caleb McDaniel, Miriam Posner, James Mussell, Bob Nicholson and Ian Milligan have written about the skewed search results of databases of historical newspapers, the impossibility of finding provenance information to contextualize what database users are seeing, and the lack of information about OCR accuracy. Besides these issues, scholars should also have an understanding of errors in digital collections. For example, scholars using Google Books would probably want to know that thirty-six percent of Google Books have errors in either author, title, publisher, or year of publication metadata.

Historian Tim Hitchcock talks about the importance of understanding the types of items in digital collections, saying, “Until we get around to including the non-canonical, the non-Western, the non-textual and the non-elite, we are unlikely to be very surprised.” Because they can contain what seems to be an almost infinite number of documents, archival databases offer an appearance of exhaustiveness that does not yield easily to a scholar’s probing. But while a gestalt understanding of a primary source database is crucial to determining the representation of items in the collection, the limited bibliographic information that is available about academic databases is scattered or unknown to most scholars.

As one step toward overcoming scholars’ lack of knowledge about the biases inherent in databases, I am working with a team of other students in the DH Praxis Seminar at the CUNY Graduate Center to create Beyond Citation, a website to aggregate bibliographic information about major humanities databases so that scholars can understand the significance of the material they have gleaned. Beyond Citation will help humanities scholars to practice critical thinking about research in databases.

The benefit of encouraging critical thinking about databases is more than merely facilitating research. Critical thinking about databases counters scholars’ “tendency to consider the archive as a hermetically-sealed space in which historical material can be preserved untouched,” and “[forces] a recognition of the constructed nature of evidence and its relation to the absent past.”

The Beyond Citation team has selected a set of humanities databases for the initial site launch and is working out the nitty-gritty of platform and server-side database functionality as well as completing research about the databases that we have chosen to cover on the site.

By providing structured information about databases and articles about research strategies, Beyond Citation will frame the common problems that scholars face when evaluating the results of their work in databases. Scholars will be able to enrich the data on the site with their own contributions, participate in reflective conversations and share highly situated stories about their experiences of working in databases. While an early version of the website to be launched in May 2014 will have a limited scope, the idea is that the site will eventually become a research workshop.

As information scientist Ryan Shaw observes, “In an era of vast digital archives and powerful search algorithms, the key challenge of organizing information is to construct systems that aid understanding, contextualizing, and orienting oneself within a mass of resources.” By making essential bibliographic information about the structures and content of academic databases accessible to scholars, Beyond Citation will take an important step to updating the scholarly apparatus to encourage critical thinking about databases and their effect on research and scholarship.

Reach us at BeyondCitation [at] gmail [dot] com or follow us on Twitter as we get ready for the launch in May: @beyondcitation

Acknowledgments

The idea for Beyond Citation originated from my encounter with a blog post by Caleb McDaniel about historians’ research practices suggesting the creation of an “online repository” of information about proprietary databases.

Easy Access to Data for Text Mining

Will 2014 be the year that you take a huge volume of texts and run them through an algorithm to detect their themes? Because significant hurdles to humanists’ ability to analyze large volumes of text have been or are being overcome, this might very well be the year that text mining takes off in the digital humanities. The ruling in the Google Books federal lawsuit that text mining is fair use has removed many concerns about copyright that had been an almost insurmountable barrier to obtaining data. Another sticking point has been the question of where to get the data. Until recently, unless researchers digitized the documents themselves, the options for humanities scholars were mostly JSTOR’s Data for Research, Wikipedia and pre-1923 texts from Google Books and HathiTrust. If you had other ideas, you were out of luck. But within the next few months there will be a broader array of full-text data available from subscription and open access databases.

CrossRef, the organization that manages Digital Object Identifiers (DOIs) for database publishers, has a pilot text mining program, Prospect, that has been in beta since July 2013 and will launch early this year. There is no fee for researchers who already have subscription access to the databases. To use the system, researchers with ORCID identifiers log in to Prospect and receive an API token (alphanumeric string). For access to subscription databases, Prospect displays publishers’ licenses that researchers can sign with a click. After agreeing to the terms, they receive a full-text link. The publisher’s API verifies the token, license, and subscription access and returns full-text data subject to rate limiting (e.g. 1500 requests per hour).

Herbert Van de Sompel and Martin Klein, information scientists who participated in the Prospect pilot, say “The API is really straightforward and based on common technical approaches; it can be easily integrated in a broader workflow. In our case, we have a work bench that monitors newly published papers, obtains their XML version via the API, extracts all HTTP URIs, and then crawls and archives the referenced content.”

The advantage for publishers is that providing access to an API may stop people from web scraping the same URLs that others are using to gain access to individual documents. And publishers won’t have to negotiate permissions with many individual researchers. Although a 2011 study found that when publishers are approached by scholars with requests for large amounts of data to mine they are inclined to agree, it remains to be seen how many publishers will sign up for the optional service and what the license terms will be. Interestingly, the oft-maligned Elsevier is leading the pack having made its API accessible to researchers during the pilot phase. Springer, Wiley, Highwire and the American Physical Society are also involved.

Details about accessing the API are on the pilot support site and in this video. CrossRef contacts are Kirsty Meddings, product manager [[email protected]] and Geoffrey Bilder, Director of Strategic Initiatives [[email protected]].

The News about the Humanities

When Steve Brier pointed the class to yet another piece in the news about the “crisis in the humanities,” I joked out loud to a colleague about whether the headline was from today or from thirty years ago, because the humanities have a reputation for crisis that won’t quit. In 1980, Newsweek ran a story about the “sorry state” of the humanities, based on a Rockefeller Foundation report. This was perhaps more apt at that time because, according to historian Ben Schmidt, who has written a series of blog posts on the subject of humanities enrollments, “the real collapse of humanities enrollments happened in the 1970s.” Humanities enrollments have recovered and leveled off since that time. Working with data that he hand-transcribed from paper printouts, Schmidt argues convincingly that “long term results actually show that since 1950, only women have shown a major drop in the percentage of humanities majors.” And, “Before co-education, only about a tenth of pre-professional degrees went to women: after 1985, they were half. And since the whole puzzle is how women’s behavior changed, not how men’s majors changed, this tells you most of what you need to know.” Schmidt also refers to an Atlantic article showing that humanities majors have the same employment level as computer science majors.

Instead of repeating shibboleths about the crisis in humanities enrollments, journalists should examine the data.

Addendum: After I posted this, Ben Schmidt tweeted, “History majors are up 18% the last 25 years. Math and CS are down 40%. Can we put this media narrative to rest?” The tweet includes a link to his dynamic graph from 1986 to 2011 to view majors from all disciplines (group together your definition of humanities fields), by gender, and by institution. In Schmidt’s guest post at the Chronicle of Higher Education earlier this year, there is a link to a fun Google Books Ngram of the “crisis in the humanities.”

Know Your Typewriter History

Film still from "Know Your Typewriter" courtesy of Prelinger Archives and archive.org

Film still from “Know Your Typewriter” courtesy of Prelinger Archives and the Internet Archive

In “Gibson’s Typewriter,” Scott Bukatman writes about the irony that William Gibson’s cyberpunk novel Neuromancer was composed on a manual typewriter. Distinguishing himself from the postmodernists who have declared the end of history, Bukatman argues that “The discourse around surrounding (and containing) electronic technology is somewhat surprisingly prefigured by the earlier technodiscourse of the machine age.” To explore the “tropes that tie cyberculture to its historical forebears,” Bukatman says that he wants to reinstate the history of the typewriter “in order to type history back into Neuromancer.”

Typing history back into Neuromancer turns out to be quite a challenge because as Bukatman says, “The repression of the typewriter’s historical significance in the Neuromancer anecdote has its analogue in the annals of technological history. No serious academic investigation of the typewriter has been published, to my knowledge, and almost all curious writers seem to rely upon the same two texts: The Typewriter and the Men Who Made It (hmmm . . .) and, even better, The Wonderful Writing Machine (wow!), both highly positivist texts from the 1950s.”

I started out interested and sympathetic to Bukatman’s aims. He is a gifted writer who skillfully pulls apart and teases out the meaning of the 1950s texts. I had the impression that I was reading the “truest” version of the history of the typewriter that was available at the time Bukatman was writing. I was curious, though, about other histories of the typewriter that might have been published after this piece was written.

After some research I was surprised to learn that there are quite a few histories of the typewriter, almost all of which were published well before Bukatman’s essay. See the Smithsonian’s bibliography of the typewriter and Google Books (related books links). A number of these were written by collectors or have illustrations targeted to collectors; but several are more serious, with Michael H. Adler’s The Writing Machine widely regarded as the most accurate. Despite Bukatman’s claim at the time of his writing that there weren’t academic books about the history of the typewriter, one of the two histories he cites, The Typewriter and the Men Who Made It, was written by a professor at Urbana, published by the University of Illinois Press, and reviewed in a journal of the Organization of American Historians. Another example from academia is George Nichols Engler’s dissertation, The Typewriter Industry: The Impact of a Significant Technological Revolution (1969).

Am I simply being pedantic by pointing this out? I don’t think so. Bukatman, Professor of Art and Art History at Stanford, declares that his task is “reinstating history,” and the recitation of that history comprises about a third of the essay. Calling the lack of authoritative histories a “repression” and claiming an analogy to the “repression” of the anecdote about Gibson’s manual typewriter in cyberculture is central to the structure of his argument. And, through his fluent analysis of the texts he has chosen, he seems to present himself as an authority who has culled the best that is available.

The feeling that I am left with as a reader is of being misled by the writer (however inadvertently), and that, to borrow Bukatman’s phrase, the “disappearance [of history] was little more than a trope of a postmodern text.”

The Twenty-First Century Footnote, Part Two

In Part One of this blog post, I wrote about scholars’ reliance on proprietary databases for research and the importance of understanding the constraints which database structures place on the outcomes of their efforts. Unfortunately, generally speaking, information about the structures of proprietary databases is not easily accessible. To remedy this, Caleb McDaniel has talked about the need to create an online resource to collate information about the construction of proprietary databases.

As an exploration of the structure of a proprietary database, I will look at one commercial database’s search and text analysis tools and touch on their handling of content. My goal is to demonstrate some of the complexity of these systems and to parse out the types of information that scholars would want to know and should consider sharing when writing up their research findings.

Artemis – Text mining lite

I recently attended a presentation about a commercial database company’s venture into what I call “text mining lite.” The company, Gale, has just started to offer text analysis and other tools that are squarely aimed at the field/set of methods of digital humanities. The tools are available through Artemis, an interface that allows searches across multiple collections of primary eighteenth century (ECCO) and nineteenth century sources (NCCO). There is a separate Artemis platform for literary material with the same analytic features. By 2015 Gale humanities collections running the gamut from the 19^th Century U.S. Newspapers to the Declassified Documents Reference System and many others will migrate into Artemis. Artemis is available CUNY-wide.

Parameters of search

To access Artemis’s textual analysis capabilities the user first determines the parameters of selection of the materials. The options are extensive: date ranges, content type (e.g. manuscript, map, photograph), document type (e.g. manifesto, telegram, back matter), title, and source library. For example, one could search only letters from the Smith College archives or manuscripts from the Library of Congress in particular years.

Context

Discussing the use of Google’s Ngram to find themes in large bodies of texts, Matt Jockers advises caution, “When it comes to drawing semantic meaning from a word, we require more than a count of that word’s occurrence in the corpus. A word’s meaning is derived through context” (120). In his CUNY DHI and Digital Praxis Seminar lecture, David Mimno addressed the necessity of understanding the context of words in large corpora saying, “We simply cannot trust that those words that we are counting mean what we think they mean. That’s the fundamental problem.”

One way that Artemis deals with this is by offering a view into the context of the documents in search results. For each result, clicking on “Keywords in Context” brings up a window showing the words surrounding the keyword in the actual (digital facsimile) document. This makes it relatively simple to identify if the document is actually relevant to your research, as long as the number of documents being examined is not too large.

Refining results

While the categories of search that Artemis allows are quite flexible, it is also possible to enter proximity operators to find co-located words. This means that, in many situations, it will be possible to further refine results through iterative searching to locate smaller batches of relevant documents on which to run the text analysis tools.

Ngram viewer

Artemis features a visualization tool that offers some improvements over Google’s Ngram to show frequency of terms over time. The term frequency ngram is created from the search results. Click and drag on the term frequency graph to modify the date range. The graph can zoom to the one-year level. It is possible to retrieve a particular document by clicking on the point on the graph. The visualization also displays term popularity, the percent of the total documents each year. Term popularity normalizes the number of documents based on the percentage of the content.

Term clusters visualization

For larger sets of documents, or to look at entire collections, researchers might want to use term clusters. Term clusters use algorithms to group words and phrases that occur a statistically relevant number of times within the search results.

The visualization of term clusters are based on the first 100 words of the first 100 search results per content type. This means that the algorithm would run only within, for example, the first one hundred words of the first one hundred monographs, the first one hundred words of the first one hundred manuscripts, and the first one hundred words of the first one hundred newspaper articles. The size limitations are because the text analysis tools are bandwidth intensive. Searches of larger numbers of documents take longer to return results and also slow down the system for other users. By clicking on the clusters, it is possible to drill down into the search results to the level of individual documents and their metadata.

Legibility of documents

Scholars should have an understanding of the process by which database publishers have transformed documents into digital objects because it affects the accuracy of searches and text analysis. In Gales’ collections, printed materials are OCR’d. For nonprint materials, such as manuscripts, ephemera and photograph captions, the metadata of names, places and dates are entered by hand. By providing improved metadata for nonprint materials, Gale has increased the discoverability of these types of documents. This is particularly important for those studying women and marginalized groups whose records are more likely to be found in ephemeral materials.

Collection descriptions

Understanding the types of materials contained within a proprietary database can be difficult. The Eighteenth Century Collections Online (ECCO) is based on the English Short Title Catalogue from the British Library and is familiar to many scholars of the eighteenth century. The Nineteenth Century Collections Online (NCCO) is a newer grouping of collections that is being continually updated. To see a detailed description of the collections in NCCO, go to the NCCO standalone database, not the Artemis platform, and click Explore Collections.

Data for research

Generally, scholars can download PDFs of documents from Artemis only one document at a time (up to 50 pages per download). When I asked about access to large amounts of data for use by digital humanists, the Gale representative said that while their databases are not built to be looked at on a machine level (because of the aforementioned bandwidth issues), Gale is beginning to provide data separately to scholars. They have a pilot program to provide datasets to Davidson College and the British Library, among others. Gale is also looking into setting up a new capability to share data that would be based outside their current system. The impression that I got was that they would be receptive to scholars who are interested in obtaining large amounts of data for research.

Bonus tip: direct (public) link to documents

Even though it doesn’t have anything to do with standards for presenting scholarship, I thought people might want to know about this handy feature. Artemis users have the ability to bookmark search results and save the URL for future reference. The link to the document(s) can then be shared with anyone, even those without logins to the database. To be clear, anyone that clicks on the link is taken directly to the document(s) although they won’t have the capability to extend the search. This makes it easy to share documents with students and through social media.

In this post, I have sought to shed some light on the usually opaque construction of proprietary databases. If people start “playing” with Artemis’ text mining lite capabilities, I would be interested in hearing about their perceptions of its usefulness for research.

Works cited

Jockers, Matthew L. “Theme.” Macroanalysis Digital Methods and Literary History. Urbana: University of Illinois Press. Print.

The Twenty-First Century Footnote*

In Jefferson Bailey’s brilliant article on digital archives, he writes, “Digital objects will have an identifier, yes, but where they ‘rest’ in intellectual space is contingent, mutable. The key point is that, even at the level of representation, arrangement is dynamic . . . Arrangement, as we think of it, is no longer a process of imposing intellectualized hierarchies or physical relocation; instead, it becomes largely automated, algorithmic, and batch processed.”

Digital humanists have increasingly embraced text mining and other techniques of data manipulation both within bodies of texts that they control and in proprietary databases. When presenting their findings, they must also consider how to represent their methodology; to describe the construction of the databases and search mechanisms used in their work; and to make available the data itself.

Many people (Schmidt, Bauer, and Gibbs and Owens) have written about the responsibility of digital scholars to make their methods transparent and data publicly available as well as the need to understand how databases differ from one another (Norwood and Gregg).

Reading narratives of the research methods of David Mimno and Matt Jockers (as well as listening to Mimno’s recent lecture) has been useful for me in my ongoing thinking about the issues of how digital humanists use data and how they report on their findings. Mimno and Jockers are exemplars of transparency in their recitation of methods and in the provision of access to their datasets so that other scholars might be able to explore their work.

While every digital humanist may not use topic modeling to the extent that Mimno and Jockers do, it is fair to say that, in the future, almost all scholars will be using commercial databases to access documents and that that access will come with some version of text analysis. But what do search and text analysis mean in commercial databases? And how should they be described? In relation to keyword searching in proprietary databases, the historian Caleb McDaniel has pointed out that historians do not have codified practices for the use and citation of databases of primary materials. He says that to correctly evaluate proprietary databases scholars should know whether the databases are created by OCR, what the default search conventions are, if the databases use fuzzy hits, when they are updated and other issues. At this time, much of the information about how they are constructed is occluded in commercial databases. McDaniel recommends the creation of an “online repository” of information about commercial databases and also suggests that historians develop a stylesheet for database citation practices.

Why is this lack of information about the mechanisms of commercial databases important? Because, as Bailey says, the arrangement of digital objects in archives (and databases) is automated, algorithmic, and batch processed. Yet, as historian Ben Schmidt has noted, “database design constrains the ways historians can use digital sources” and “proprietary databases “force” “syntax” on searches. Since database search results are contingent upon database structures, if scholars are making claims related to the frequency of search terms, at a minimum, they must understand those structures to reckon with the arguments that might be raised against their conclusions based on methodology.

I recently attended a presentation about a commercial database company’s venture into what I call “text mining lite.” What I learned has only bolstered my ideas about the importance of understanding the practices of proprietary database publishing and the necessity of scholars having access to that information. The company, Gale, one of the larger database publishers, seems to be courting the digital humanities community (or at least their idea of the digital humanities community). Gale is combining access to multiple databases of primary eighteenth and nineteenth century sources through an interface called Artemis which allows the creation of “term clusters.” These are clusters of words and phrases that occur a statistically relevant number of times within the user’s search results. One of the crucial things to know about the algorithms used is that Artemis term clusters are based on the first 100 words of the first 100 search results per content type. In practice, for search results that might include monographs, manuscripts and newspapers as types, this means that the algorithm runs only within the first one hundred words of the first one hundred monographs, the first one hundred words of the first one hundred manuscripts, and the first one hundred words of the first one hundred newspaper articles. [I will describe Artemis at more length in Part Two of this blog post.] Clearly, any conclusions drawn by scholars and others using term clusters in Artemis should include information about the construction of the database and the limitations of the search mechanisms and text analysis tools.

As a final project for the Digital Praxis Seminar, I am thinking about writing a grant proposal for the planning stages of a project that would consider possible means of gathering and making available information about the practices of commercial database publishers. I would appreciate any thoughts or comments people have about this.

* The title is taken from The Hermeneutics of Data and Historical Writing by Fred Gibbs and Trevor Owens. “As it becomes easier and easier for historians to explore and play with data it becomes essential for us to reflect on how we should incorporate this as part of our research and writing practices. Is there a better way than to simply provide the raw data and an explanation of how to witness the same phenomenon? Is this the twenty-first century footnote?”

Dialogues on Feminism and Technology

The experimental online class Dialogues on Feminism and Technology led by Anne Balsamo and Alexandra Juhasz has begun posting weekly video conversations. Sixteen colleges are participating, including CUNY’s Graduate Center and Macaulay Honors College, as well as thousands of learners outside of formal educational settings. The class is intended to be a model for open access pedagogy in a collaborative environment, in contrast to MOOC-style learning. For those without institutional logins, there are suggested readings to accompany the class.

As an example of the ideas about women and technology in popular culture that they are seeking to counter, class organizers point to a June 2012, New York Times article about Silicon Valley which opened with “Men invented the Internet.” Partly to address that type of thinking, students will participate in Storming Wikipedia, an exercise in writing and editing to include women and feminist scholarship in Wikipedia. HASTAC has a wiki page about the Wikistorming that took place earlier this year.

Digital Humanities: “It does things to things.”

At the beginning of class, Matt Gold asked us to write a definition of the Digital Humanities “as you would like it to be defined, not as it is defined.”

Definition at the beginning of class: The Digital Humanities is a set of computationally-based methods which study historical and contemporary artifacts and texts. I think Jamie Bianco’s statement about the Digital Humanities, “It does things to other things,” adds an important dimension to this.

There was much class discussion about definitions of Digital Humanities in relation to quantitative methods and the sciences. The parallel drawn between GIS and the Digital Humanities was particularly enlightening. A student said the field of geography had answered the question of whether GIS is a tool, a discipline or a field of study—yes, to all of these categories. GIS is now recognized as its own field of research, but also as a tool that can be used without knowledge of the theory behind it, similarly to the way some people say that you don’t have to be able to code to be a digital humanist. I think what happened with GIS in geography is a likely outcome of the debates around Digital Humanities.

During the discussion, Steve Brier pointed out that the structure that digital humanists work in, or aspire to work in, was created in the sciences where collaborative work using digital technology, and a quick process of review and publication is the standard. He said, jokingly, that humanists have been “slow” about catching on.

Are digital humanists laggards, or is this “slowness” partly because of differences between scholarly communication in the sciences and the humanities? Scientists have to move quickly to document their discoveries because the structure of the field requires it. Humanists work in a more drawn out time frame. Kathleen Fitzpatrick makes another crucial distinction, “The work in the sciences, on some level, is doing the science. The stuff that gets communicated afterward is the record of its having been done. Where in the humanities, the work is the thing that is communicated.” The insight that Digital Humanities projects are both the scholarly work and the communication of that work provides a context for thinking about the future of scholarly communications. Fitzpatrick says, “The sciences in their modes of communication are still very fixated on this object that is the journal article however it is distributed.” She says the strength of the humanities is that they are in a position to be “Utterly reimagining what the nature of scholarship can be. . . . That there are other forms that projects can take. They can take the form of Omeka exhibits . . . Scalar multimedia projects . . . databases . . . archives.”

After the class discussion, I would add a few words of explanation to the definition and also stress the role of communication in Digital Humanities projects.

Definition after discussion: The Digital Humanities is a set of computationally-based methods which study historical and contemporary artifacts and texts. Jamie Bianco’s statement about the Digital Humanities, “It does things to other things,” adds an important dimension to the description of Digital Humanities methodology by emphasizing that it is also about the creation of new artifacts, texts and tools that communicate interpretations and arguments.