In Jefferson Bailey’s brilliant article on digital archives, he writes, “Digital objects will have an identifier, yes, but where they ‘rest’ in intellectual space is contingent, mutable. The key point is that, even at the level of representation, arrangement is dynamic . . . Arrangement, as we think of it, is no longer a process of imposing intellectualized hierarchies or physical relocation; instead, it becomes largely automated, algorithmic, and batch processed.”
Digital humanists have increasingly embraced text mining and other techniques of data manipulation both within bodies of texts that they control and in proprietary databases. When presenting their findings, they must also consider how to represent their methodology; to describe the construction of the databases and search mechanisms used in their work; and to make available the data itself.
Many people (Schmidt, Bauer, and Gibbs and Owens) have written about the responsibility of digital scholars to make their methods transparent and data publicly available as well as the need to understand how databases differ from one another (Norwood and Gregg).
Reading narratives of the research methods of David Mimno and Matt Jockers (as well as listening to Mimno’s recent lecture) has been useful for me in my ongoing thinking about the issues of how digital humanists use data and how they report on their findings. Mimno and Jockers are exemplars of transparency in their recitation of methods and in the provision of access to their datasets so that other scholars might be able to explore their work.
While every digital humanist may not use topic modeling to the extent that Mimno and Jockers do, it is fair to say that, in the future, almost all scholars will be using commercial databases to access documents and that that access will come with some version of text analysis. But what do search and text analysis mean in commercial databases? And how should they be described? In relation to keyword searching in proprietary databases, the historian Caleb McDaniel has pointed out that historians do not have codified practices for the use and citation of databases of primary materials. He says that to correctly evaluate proprietary databases scholars should know whether the databases are created by OCR, what the default search conventions are, if the databases use fuzzy hits, when they are updated and other issues. At this time, much of the information about how they are constructed is occluded in commercial databases. McDaniel recommends the creation of an “online repository” of information about commercial databases and also suggests that historians develop a stylesheet for database citation practices.
Why is this lack of information about the mechanisms of commercial databases important? Because, as Bailey says, the arrangement of digital objects in archives (and databases) is automated, algorithmic, and batch processed. Yet, as historian Ben Schmidt has noted, “database design constrains the ways historians can use digital sources” and “proprietary databases “force” “syntax” on searches. Since database search results are contingent upon database structures, if scholars are making claims related to the frequency of search terms, at a minimum, they must understand those structures to reckon with the arguments that might be raised against their conclusions based on methodology.
I recently attended a presentation about a commercial database company’s venture into what I call “text mining lite.” What I learned has only bolstered my ideas about the importance of understanding the practices of proprietary database publishing and the necessity of scholars having access to that information. The company, Gale, one of the larger database publishers, seems to be courting the digital humanities community (or at least their idea of the digital humanities community). Gale is combining access to multiple databases of primary eighteenth and nineteenth century sources through an interface called Artemis which allows the creation of “term clusters.” These are clusters of words and phrases that occur a statistically relevant number of times within the user’s search results. One of the crucial things to know about the algorithms used is that Artemis term clusters are based on the first 100 words of the first 100 search results per content type. In practice, for search results that might include monographs, manuscripts and newspapers as types, this means that the algorithm runs only within the first one hundred words of the first one hundred monographs, the first one hundred words of the first one hundred manuscripts, and the first one hundred words of the first one hundred newspaper articles. [I will describe Artemis at more length in Part Two of this blog post.] Clearly, any conclusions drawn by scholars and others using term clusters in Artemis should include information about the construction of the database and the limitations of the search mechanisms and text analysis tools.
As a final project for the Digital Praxis Seminar, I am thinking about writing a grant proposal for the planning stages of a project that would consider possible means of gathering and making available information about the practices of commercial database publishers. I would appreciate any thoughts or comments people have about this.
* The title is taken from The Hermeneutics of Data and Historical Writing by Fred Gibbs and Trevor Owens. “As it becomes easier and easier for historians to explore and play with data it becomes essential for us to reflect on how we should incorporate this as part of our research and writing practices. Is there a better way than to simply provide the raw data and an explanation of how to witness the same phenomenon? Is this the twenty-first century footnote?”
This is an interesting post (and project) – I agree that we tend to take for granted how commercial databases sort and present their material. You mention ‘Artemis’ – is this related at all to the JISC ”Historical Books’ project that brings together EEBO, ECCO and 19th databases? https://www.jisc-collections.ac.uk/jiscecollections/jischistoricbooks/ Or is another wrinkle on the same idea?
Thanks for your comment. I will attempt to disambiguate these efforts. Artemis is an interface created by Gale to search its humanities databases. Artemis allows searching and text analysis across multiple collections. Gale publishes the Eighteenth Century Collections Online (ECCO) and the Nineteenth Century Collections Online (NCCO). To build those collections, they partnered and/or licensed content from the British National Archives, the British Library, and various U.S. collections among others. JISC Collections (a U.K. governmental entity) appears to have purchased access to ECCO and NCCO from Gale for its member institutions. Early English Books Online (EEBO) is a database published by ProQuest. EEBO is available for JISC Collections member institutions, presumably through a purchase from ProQuest. Artemis allows cross searching of ProQuest’s EEBO for the convenience of its users, but does not actually provide access to EEBO records.
It is not clear to me if JISC HistoricBooks (the part of JISC Collections that includes ECCO, NCCO and EEBO) currently has access to or will, at some point in the future, have access to the Artemis interface.
Thanks for the great post!
One thing I remember thinking about back when I wrote the posts you linked is that it would be cool to add some kind of crowd-sourcing to any such site. That is, to allow users to give time-stamped reports on their individual experiences using such databases, perhaps on a “wiki” or “discuss” page behind the main front-facing page offering an overview of the database.
This could make the site more difficult to maintain, especially if one tried to check individual claims made by users. But I think that given that not all owners of proprietary databases will be eager to release information, even a free-for-all space by users could be helpful in sharing tips or small discoveries that people have made about a particular engines quirks.
Thanks for the comments. My posts were inspired by your posts on the subject, and the links that I followed from there.
Your crowd-sourcing idea is interesting. I really like the idea of people sharing information about the peculiarities of particular sites especially because the database contents and structures are constantly evolving.
After I made these two posts, my thinking about these issues benefited from a conversation with Matt Gold, one of the instructors of the Digital Praxis Seminar. He helped me to clarify the problems such a site might seek to resolve and possible solutions. At this point, I see two issues:
1) When academics are writing-up their research findings, how can they make clear the process and parameters they have used to search proprietary databases? 2) How can database publishers be encouraged to make known the structure and mechanisms of their databases?
One possible solution
Instead of a repository of database practices that might be difficult to create and maintain, what if there were a database citation tool for scholars that would assist them in answering the key questions about their process?
Another possible solution
Creating a set of questions that scholars should be able to answer about the proprietary databases that they are using. (This is related to the stylesheet idea that you have suggested, but might be simpler to pull off.)
Thinking (far) ahead, if societies of scholars were to say that academics should answer a particular set of questions when using proprietary databases, it would be a strong incentive for database publishers to produce the information.
I like those follow-up ideas very much, and agree that they would be easier to implement and maintain at first. I can imagine something structured like the Creative Commons Choose a License page, which dynamically changes the “license” in response to user input prompted by questions.
With some databases, it also might be possible to extract some of the desired information about the database directly from the URL query, sort of like the Twitter citation generation.
Would love to hear more about what you come up with. If it’s helpful to you, here are the notes that were taken at the THATCamp session I proposed on this back in 2011.
A Creative Commons-style page could work because there would be a lot of if/then types of responses. I don’t know understand the mechanisms to extract information from a query to a proprietary database. Is there more that you could say about that? Or perhaps you could point me to some resources?
Thanks so much for the THATcamp notes. I have already cribbed heavily from them. Below is a preliminary draft [https://docs.google.com/document/d/1CzzO52sJZV5ZlRyTTebkMrsY2_W2yNJDSC3EvPdgF0Y/edit?usp=sharing] of questions for researchers to engage with.
These questions are meant to be a guide for scholars to discuss the methods they have used to search proprietary databases. They would generally apply only to proprietary databases; although some might be useful for researchers who have obtained data from other sources.
Was the database used to formulate research questions or locate documents/objects or both?
Answer these questions when citing particular documents/objects or discussing the use of the database to frame questions:
When was the search performed?
What version of the database was used?
If available, what is the URL for the results?
If available, what is the DOI for the search?
Was the text from a full article or an abstract?
Other relevant information about method used.
Database description
Does database use “fuzzy hits” for searches?
Quality of OCR of corpus.
Quantity of OCR’d documents of corpus.
Percentage of documents searched that are OCR’d.
Were searches performed within the entire document or metadata only?
Quantity of hand-transcribed documents in corpus.
Other relevant information.
When discussing quantitative results or findings, e.g. number of documents containing keywords or results obtained with text analysis tools, there should be a textual explanation in the methodology section of the paper. Also, these additional questions should be addressed:
Search parameters
What exact phrase was used in searches? (Provide the entire list of queries, including searches with negative results.)
Were entire documents searched or only metadata?
Was subject indexing provided by the database used in queries?
Was a Boolean phrase used?
Was “sort by relevance” checked?
If available, what is the DOI for the search?
What was the amount of documents/objects returned in response to query?
Were certain results excluded? Why?
Were the results downloaded? What format?
If downloaded, were results reformatted or cleaned?
Is the resulting research data accessible to others?
Other relevant information about method used.
Text analysis
What text analysis tools were used?
What visualizations were created?
Results
Display at least a portion of results to show context.
Include a portion of visualizations of analysis.
Looks like your list if off to a great start!
Re: extracting information from the query URL, here’s an example of what I mean. If you look at a search on the Internet Archive, a typical results URL might look like this:
http://archive.org/search.php?query=collection%3A%22library_of_congress%22%20AND%20%28william%20lloyd%20garrison%29
Just parsing that URL could tell me that the search was in the Library of Congress collection only. It also tells me that while
library_of_congress
was wrapped in quotes (%22
), the termswilliam lloyd garrison
were not.What I was wondering in my earlier comment was whether even proprietary database results URLs could be made to cough up some information if the researcher were to copy and paste their URL into whatever form you set up. For example, here’s a result from the American Periodical Series Online by ProQuest:
http://search.proquest.com.ezproxy.rice.edu/americanperiodicals/results/14121F8FBB04011C809/1/$5bqueryType$3dadvanced:americanperiodicals$3b+sortType$3drelevance$3b+searchTerms$3d$5b$3cAND$7cti:texas$3e$5d$3b+searchParameters$3d$7bNAVIGATORS$3dsourcetypenav,pubtitlenav,objecttypenav,languagenav$28filter$3d200$2f0$2f*$29,decadenav$28filter$3d110$2f0$2f*,sort$3dname$2fascending$29,yearnav$28filter$3d1100$2f0$2f*,sort$3dname$2fascending$29,yearmonthnav$28filter$3d120$2f0$2f*,sort$3dname$2fascending$29,monthnav$28sort$3dname$2fascending$29,daynav$28sort$3dname$2fascending$29,+RS$3dOP,+jstorMuseCollections$3dJSTOR+MUSESTD,+chunkSize$3d20,+instance$3dprod.academic,+date$3dRANGE:1828,1832,+ftblock$3d740842+1+660848+670831+194104+194001+670829+194000+660843+660840+104,+removeDuplicates$3dtrue$7d$3b+metaData$3d$7bUsageSearchMode$3dAdvanced,+dbselections$3d10000019,+SEARCH_ID_TIMESTAMP$3d1381846467262,+fdbok$3dN,+siteLimiters$3dDocumentType$7d$5d?accountid=7064
That’s obviously a lot more opaque than the Internet Archive one, but if you look at it closely you can figure out how the results were sorted, what searchTerms I used, and even what the searchParameters are. It would be hard to build something that worked with all databases, especially since not all databases are going to provide such data-rich URLs, but a proof of concept for one could be useful.
For more about understanding query strings, check out this Programming Historian lesson.
Thanks for the comments and the extremely clear explanation of how it might be possible to extract information from the query URL. The Programming Historian link is also excellent in describing how to interpret and manipulate URLs of query strings. I think it would be worth exploring the possibilities of automating the process of citation.