Text Encoding

This is a republishing of an article I wrote for the Strange Bedfellows project.  The original url is here.

I was recently working with an academic project housed at the University of Toronto called Records of Early English Drama, or REED.  REED has been around for a long time, since 1975, and has previously disseminated its work through print volumes.  It has just begun the process of exploring and expanded its work into the digital realm.  I feel like this is a situation that many long-standing research projects are in right now.  Print has become second class in terms of favorability compared to the digital yet printed volumes are both what hold the project’s existing work and what the project is familiar with doing.  In this case, there is a simultaneous need to “re-publish” past volumes and publish new research in a digital format.

I have found that this surge towards the digital has created the need for a new type of research associate in these cases, one that is much more computer savvy.  However, given the scarcity of funding in general and the established communities for academics in the “digital humanities,” oftentimes an existing member of a research time essentially retrains or expands their focus in the project to include these digital topics.  I believe that this is backwards in the sense that a project looks at specific digital components as a means to an end rather than approaching it holistically.  The problem with this comes to a head in Text Encoding.

REED is in the process of creating a digital prototype for their new collection of records and they have decided to use TEI and EATS elements for the project.  TEI (Text Encoding Initiative)  is probably the oldest academic text encoding project, at least to my knowledge, and a lot of projects support it.  TEI is useful for “online publication, searching, text analysis, and conversion into other formats”.[1]  EATS stands for the Entity Authority Toolset.  EATS was designed so that “authorities” and their corresponding data could be encoded into a text and then used for displaying or searching for that data.  It is both a front-end and back-end platform in terms of its functionality within a digital text.

Historically, there have always been inherent problems encoding text for the humanities.  For example, much time was spent on encoding the meta-data of Jerome McGann’s Rossetti Archive and, as a result, it has a resounding digital structure.  But the drive for text encoding consumed so much of the project’s energy (due to the volume of the data) that little was left for analytical tools to use that data.  Because text encoding is so strenuous and overbearing, it often becomes the result of a project rather than its foundation.  For instance, during the short time I was working with REED several hours were devoted to troubleshooting problems with getting EATS to work with Oxygen.  In addition, if you take a look at the sheer number of elements that are part of TEI you can understand why any unproductive time is even more bitter.  Text encoding presents a problem in this attempt to essentially “republish” printed works through a digital medium.  The problem is not that it cannot be done but whether or not the time spent on such efforts is returned.

I first approached a very similar situation at the end of 2011 and documented it on my blog, All Is True.  In my post, I looked at trying to encode Shakespeare’s plays into a JSON format rather than in XML.  I had several reasons for doing this: speed of machine parsing, ease of human reading, and file size across internet bandwidth.  I ran into an unexpected question when I simply began making a mock-up model of a play’s structure in JSON.  It had to do with whether or not the definition of a ‘character’ was within a ‘line’ or whether a ‘line’ with within a ‘character’.  This was intriguing on many levels, especially in regard to genre theory, but it was eventually solved by fixing a bracket problem in the code.

This false positive made me think further into the result of these different encoding styles.  With the ever-increasing processor speed of computers and the quickly growing capacities of the internet, the advantages of a JSON format over a XML format become fairly negligible.  This further pushed me to question whether or not encoding was even necessary.  Cannot the “Find” command on an internet browser work faster than a query in a search box on the same webpage?  While I think that questions posed by text encoding can reveal new topics of discussion, I question whether the time spent on such endeavors will ever be returned, whether in terms of faster searches, richer meta-data, etc.  Furthermore, will the time spent “republishing” printed texts ever yield a return on investment?

In the end, I think that analytical and creative efforts should be directed toward dynamic apparatuses which can adapt and delve into the data provided, rather than spending all of our time curating the data.  Google Books is a good example of this practice; it digitizes millions of texts given only the words that are recognized.  It then builds a framework around that which allows users to browse, search, and manipulate the data.  (Take Google Ngram Viewer for example)  Nothing is special about this data, except that it is machine recognizable, yet interesting research can still occur.  In the case of digitizing research, like REED’s, I believe that any emphasis on a prototype should be on its form rather than on its content.

[1] http://www.tei-c.org/About/faq.xml


Leave a comment

Filed under Accessability

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s