Since the beginning of the year, I have had several projects going which, after completion, sat in the refrigerator of my mind. As I mentioned before, this has resulted in a kind of textual pile-up in my brain and so I will be trying to write a couple posts retroactively, although my own experiments have currently gone farther.
The first of the projects that I want to mention was finished back in February. It was solely focused on separating all of the characters in Shakespeare’s plays from their respective works. After the data preparation, my results were simultaneously confusing and captivating.
The first image is a picture with the original text files of the plays, a file with all of a play’s text except for the main character, and then a file with the main character’s lines. The abbreviations at the end of the titles standing for the following: ‘rev.txt’ is the original text file of the play (as in Twelfth Night rev.text), ‘w-o.txt’ is the play without the main character’s words, and ‘.txt’ is for the main character’s lines only. The graph is in a dendrogram format using Ward’s clustering method focusing on the Cluster level analysis on the left-hand side and the LAT level analysis on the right. The process of choosing the main character for each play was a continuation of the method described here.
This diagram provides an objective view of the corpus in a way that has not been tested before, to my knowledge, and that is a variant test. The variants in this case are the three different sets of data present in the diagram which are, in essence, really a single corpus but their true identity lies somewhere between a different edition of the plays entirely and an editor working on the corpus as it stands. By separating the main character from the plays, a surgical incision is made into each play instead of random or printed variation and educated, editorial decisions. Since the nature of these changes to the original, ‘rev’ plays is different than others, the result must also be taken in distinction from previous knowledge in either of the other categories. For this reason, I do not wish to make any sort of judgments on genre based upon this diagram. Instead, I believe that the focus should be upon the objective view it gives.
The color coding scheme reflects the objectivity of the view by disregarding previous color distinctions based on genre and instead assigning a color to each set of data in this diagram. It is interesting to note that none of the single orange ‘w-o’ texts fail to cluster with their respective black ‘rev’ counterpart in either the Cluster or LATs diagram. In addition, the characters only ever pair up in the bottom half of the Clusters diagram. To me, this kind of test is like testing a blood sample which clots. The clots are formed by the ‘rev’ and ‘w-o’ plays and tend to stay in one place while the characters are the free floating white blood cells in multi-dimensional matter. The meandering nature of the characters’ lines and their association with each other and the plays around them is rather peculiar as well. None of the characters appear to stay with the same play or plays between Clusters and LATs and only Henry V and Troilus and Lear and Timon remain close to each other out of the characters that had grouped together in the Cluster diagram. The reason behind this is perplexing, especially when we dissociate these findings from genre. These results appear to work against my previous schema of language as a series of layers on top of one another that move and slide against each other with top-level forms like genre and bottom level forms like articles and prepositions. With the way that these characters’ lines shuffle about the dendrogram, the commonality between them and the larger plays lessens in a way that suggests the size of a text as being part of the structure of language in addition to other textual features.
The importance of size to understanding the written form of language is a perplexing question to ask and I would argue that it all boils down to subjectivity. I do not believe that I gain the same amount of information from a perfectly crafted haiku than I do from a free-form poem with the same number of words. At the same time, I have read sentences in a novel that speak more to me than whole books do. Presently though, this kind of query is best left alone for the simple reason that if size is a factor in a vision of language as a multitude of layers, it casts doubts on the validity of previous results that have been using these same tools.
Even though I feel certain that we can rule out free-association as a potential influence on these results, we are still left with the unspoken explanation behind this diagram. If size really does matter, do we need to go back to the drawing board and create a new tool? Or is something else entirely at work here?