Harvard Extended: Thesis update: A eureka moment

I've been working on chapter 3 of my thesis for the past week or so (see blog updates for chapter 1 and chapter 2), and tonight I just experienced a "eureka!" moment.

This relates to several types of data that I gathered in the Yoshikoder dictionary and concordance reports. The purpose of these reports is to gauge the negative/positive tone of NCNA/Xinhua news items. To this end, I created two custom "dictionaries" of negative and positive terms that are commonly used in the NCNA lexicon, which I then compared to my collection of samples of NCNA news items about Vietnam and other countries during the Deng Xiaoping era. The results of these comparisons were output as a proportion of the total number of words in each yearly sample. I'll excerpt from the draft of chapter III to illustrate:

The output from the dictionary reports was expressed as a percentage. For example, the 1977 type V sample contained the full text of 10 randomly chosen news items from that year, totalling 1,965 words. Most of the text from this sample came from the bodies of the NCNA news items, but there was additional text from the headlines, datelines, and some production data (article length, and the individual item numbers). Yoshikoder determined that 81 words from this sample matched terms in the NCNA negative dictionary, or 4.12% (i.e., 81/1965). Yoshikoder also found 115 words in the sample that matched the NCNA positive dictionary, or 5.85% of the total (i.e., 115/1965).

I also created an additional NCNA dictionary, which I called "NCNA insecurity", and ran it against the samples. This dictionary was a subset of the NCNA negative dictionary, and excluded those words which related directly to war, weaponry, military conflict, and military activity. The purpose of this report was to measure those negative words which the NCNA associated with Vietnam in various contexts, but were not connected to the war in Kampuchea, China's 1979 invasion of Vietnam, and other border skirmishes.

At first I did not want to use the NCNA insecurity results, because they were extremely low compared to the NCNA negative results -- of the six sample types I tested, there was not one year in which the NCNA insecurity results topped 3%.

But tonight I realized that the excluded terms from the NCNA insecurity dictionary -- all military-related words, like army, assault, soldier, etc. -- are actually turning up a lot, even in sample types that have no direct connection with Vietnam's drawn-out war Kampuchea. This is worthy of further analysis. In Excel, it was a cinch to quantify these military references, for each sample type, simply by subtracting the results from the NCNA insecurity reports from the higher NCNA negative results. I now have a new dataset, which I am calling NCNA military, that can provide new insights into the data I have gathered.

This would not have been possible if I hadn't taken the time at the beginning of the data collection phase to create a separate dictionary that was a subset of the NCNA negative dictionary. It took about an hour, but Yoshikoder was able to include this subset at the same time I ran the other dictionary reports, with no additional work on my part (thanks to Will Lowe, who created Yoshikoder, for having the foresight to add this functionality!). Therefore, I would advise anyone who uses Yoshikoder for content analysis purposes to consider creating subsets of their dictionaries based on simple themes -- it takes just a little extra effort early on, but can have big dividends in terms of the amount of data produced and the insights into what the data means.

Harvard Extended

Saturday, November 11, 2006

Thesis update: A eureka moment

No comments: