My topic modeling dataset was a set of abstracts (total of 373) from the 1996-2003 annual conferences of the Alliance of Digital Humanities Organizations. (The Digital Humanities Conference).  Running the topic modeling tool, which automagically extracts topics from texts, for 20 topics, 200 iterations, and 10 words resulted in the list below. In parentheses is my topic name or the definition of what I think that topic is about.

For some reason, the process of defining/naming each of the 20 topics was not an easy task for me. Questions abounded, such as: Do I need to take all 10 words in a topic into account, meaning, do all ten words need to be reflected in the topic’s name? Do some words in a topic have more weight, or meaning, than others? (In class Zoe Borovsky mentioned that the words at the front of the list are the ones most frequently appearing in that topic.) So, should my topic name reflect the meaning of the first few words only? What about topics that seemed to be about two or more things? Can topics have sub-topics?  It became clear to me how subjective and interpretive this process is. Professor Posner helped clarify this task for me: “If we were trying to write a publishable paper, we’d do a lot of checking back and forth between the text and the topic; but what we’re doing is just a gesture toward classification, not a hard-and-fast organizational system into the naming/defining of these topics.” This helped me realize that having a clear idea from the outset of why one is doing the extracting – what is to be learned and accomplished from the task of extracting – is very important.

The twenty topics and their names/definitions:

1. time process work years case forms large found important form (work process – length of time)

2. knowledge historical form representational social meaning tradition significant level cultural (historical knowledge – social, cultural significance/meaning) – This one was difficult to interpret and name. Scanning some of the actual abstracts found the following, which helped somewhat:

  • personal and cultural forces shape how people organize information
  • structural theories of reading
  • user cultures, literacies and conventions
  • historical ways time and temporality have been conceptualized
  • community in an online context
  • philosophical analysis of “representation” and “interpretation”

3. de la des les le du une en dans pour (foreign language words – of, the, a)

4. project images image history work projects design ptr materials scholars (history of image projects)

5. tools xml html target http texts text display based (tools used for display of texts)

6. order language information rules semantic based structure analysis common languages (analysis of language)

7. user system web software users interface internet material make delivery (information system)

8. information documents document research text data xml literary retrieval field (retrieval of text documents)

9. humanities computing research university resources technology teaching development support community (humanities computing research in academia)

10. gt It sgml markup tei document encoding dtd documents elements (encoding and markup of documents)

11. data emph rend italics terms features frequency components fragments component (data components and features)

12. text textual texts literary theory hypertext reading encoding writing critical (theory -reading and writing)

13. text word texts word number order line amp context set (words and texts/context)

14. electronic edition editions manuscripts publication manuscript scholarly university project book (electronic edition academic book)

15. analysis words texts authorship study style studies authors middle statistical (analysis of writing style of authors)

16. model paper set approach specific problem problems features work level (approach for modeling a specific problem)

17. language corpus linguistic english annotation corpora data speech analysis linguistics (English language corpus)

18. digital data information database library metadata collections objects research databases (research in/of digital library databases or library collection or metadata for digital database)

19.  dictionary information verb system entries entry dictionaries translation form syntactic (dictionary entries)

20. students learning student writing multimedia technology web group create courses (students learning multimedia technology)

Confusion with TopicsinDocs.csv

I went to the TopicsinDocs.csv list for this same run of  20 topics, 200 iterations, 10 words.

Scanning the TopicsinDocs.csv list I became interested in documents/abstracts that only contained one topic. For example, document/abstract 30 has one “top topic,” which is topic 3. This topic makes a 92.1% contribution to the document/abstract.

I then went to theTopic Index – List of Topics (html), to find out the word clusters in this topic, which are “de la des les le du une en dans pour.” (topic 3)

Clicking on this word cluster-topic led me to a list of the top-ranked documents/abstracts in this topic (#words in doc assigned to this topic). I clicked on document/abstract #30 to see what more I could find out about “de la des les le du une en dans pour” in the context of this document/abstract.

At the actual document/abstract, I am given the file name, DOC :2000_paper_662_loiacono.txt., the title of the document/abstract, “Primroses and Power: a Study on Linguistic Excellence in Political Discourse” and the document/abstract itself. Scanning the document/abstract, I did not find any of  the words “de la des les le du une en dans pour.”

Furthermore, this page contains a list: “Top topics in this doc (% words in doc assigned to this topic).”  The topic,“de la des les le du une en dans pour” is nowhere to be found in this list.

I tried another run of this. Document/abstract 66 also only contains this one topic, “de la des les le du une en dans pour” at 93.8%. I went through the same process – ending up at the document itself, DOC :2002_paper_110_gabler.txt, “There is Virtue in Virtuality. Future Potentials of Electronic Humanities Scholarship.” Again, I did not see “de la des les le du une en dans pour” in the document/abstract itself, nor was it listed as a “Top topics in this doc (%words in doc assigned to this topic).” I am thoroughly confused with this.

Different iterations

I ran the topic modeling tool for 5 topics, 200 iterations, and 10 words, and then another run for 5 topics, 2000 iterations, and 10 words. I was interested to compare and contrast different levels of iteration to see if a higher iteration would yield more refined or nuanced results. There wasn’t that much difference between the runs. The 200 iterations and 2000 iterations results were very similar in regards to the word clusters found in the topics and the order of the list of the 5 topics:

1. language corpus word information english linguistic based dictionary data system (200)

1. language corpus word analysis texts words text english linguistic data (2000)

2. humanities project research university digital students web electronic computing information (200)

2. humanities digital research project students university web information computing resources (2000)

3. text gt It document sgml xml data encoding markup documents (200)

3. de la des les le authorship style de une authors (2000)

4. texts texts analysis literary work words study studies theory reading (200)

4. text electronic texts literary textual hypertext edition writing computer (2000)

5. de la des les le du du une en dans pour (200)

5. gt It text document sgml data xml encoding markup documents (2000)

Conclusion

Although my experience with using this topic modeling tool was often frustrating, and left me with more questions than answers (perhaps that is the point),  I am definitely interested to keep working with it to examine  large bodies of texts related to the arts, such as dance reviews from the New York Times, scholarly articles from PAJ: A Journal of Performance and Art or Movement Research Performance Journal.  The ability to distantly read large swaths of historical or contemporary conversations in dance and theater in order to discover something new about the conversations or to discover patterns in regards to themes, ideas, events, people etc., is exciting to me. I am also interested to learn how to use a modeling tool for graphic objects such as labanotation scores in order to analyze changes in modern dance choreography throughout the years. Lastly, I am curious to know if there is a tool that exists to topic model (movement model?) choreography on film or video.

Advertisements