Skip to main content

The Citizen

Five rules for using data in digital humanities

By William Sentence, MPP ’14, Correspondent

Harvard held an event last week entitled ‘Digital Humanities Across the Spectrum’. It is part of an explosion of interest in the area – the total number of sessions on the subject at the Modern Language Association Conference in Boston doubled to nearly 70 over the past 3 years.

It has ramifications for the public funding of humanities research too. These were tackled in a recent Harvard Kennedy School panel discussion on the future of the arts.

The excitement makes sense. Data has the potential to instigate a lot of new avenues for qualitative research. One ongoing project is led by Claude Willan at Stanford. He is using data visualization to map the over 3,000 letters of John Locke, their date, type and the network they created. He hopes to show Locke to be at the nexus of disparate intellectual communities across Europe and provide insight on the claim that he brought these separate groups together to foster the creation of the public sphere.

Yet opposition to digital humanities is growing and it’s led by Stephen Marche under the banner ‘Literature is not data’. His most consistent theme is that large datasets and algorithms are necessarily reductionist and remove humanity from the humanities.

Marche’s is an extreme characterization but it has some lessons. Plus it underlines the need to clear up some misperceptions. In light of this, here are five rules for people using data in digital humanities:

1) Relish the messiness of the humanities. Digital humanities should be prepared to use data as a tool to aid qualitative analysis and generate new research paths. However, it should not assume that the predictive certainty of pattern recognition in data can refine all of the liberal arts. Meaning can be ambiguous, often intentionally so, and qualitative interpretation is crucial. But it can be aided by these new resources.

2) Do better data science. Humanities scholars and computer science researchers are not yet working together enough. There are a lot more insights out there that will probably only be spotted by quality data analytics but a lot of digital humanities work so far has been conducted by academics with limited experience working with big data. This limits the impact of the new insights they come up with.

3) Data is great for providing context.  The time when a book was written, its place, legacy, all matter. Data can help here by digitizing the context not just the content. That means capturing as much of the context as possible – the networks of writers, intellectuals, ideas of the time  – all in digital form and ready as a resource for non-digital researchers too.

4) Keep digitizing the archives. This has been happening online for years – for example Early English Books Online (EEBO) is a great example of this.  Since 2002, Google has been at the heart of this project. Two big problems remain.  One, there is a huge wealth of non-print content that is not yet close to being archived (in museums, physical archives and libraries). Two, the existing databases are not linked up – a lot of insights could emerge just by making possible integration of resources that already exist.

5) Data driven does not mean rationality necessarily. There is a tendency to slip from using quantitative methods to analyze a writer’s work to assuming that the writer (or creator in general) was rational or even knew and understood all the information that the data analysis has now revealed to us. It’s a mistake to do this and we should be cautious before drawing conclusions about the creator’s intention.