Sunday, September 03, 2006

Thesis update: Revising proposal, going granular with Yoshikoder

It's been awhile since I wrote a thesis update, but I have made a lot of progress, especially in the last two weeks, during a new data-collection phase.

As I mentioned earlier this year, I did complete my thesis proposal back in February. It was approved by the Extension School, and then in June it was tentatively accepted by a Harvard Faculty of Arts and Sciences professor who not only specializes in modern Chinese history and government policy, but also has experience conducting quantitative research.

However, before letting me start on writing the thesis, he had a few questions and suggestions regarding the proposal:
  1. What was I really interested in concentrating on in my research -- Chinese policy toward Vietnam, or a quantitative methodology for studying Chinese policy? My answer: the methodology. Vietnam happened to be a convenient example. He therefore suggested that I rework my thesis proposal to stress the reasons why my methodology is useful, what advantages it holds over certain qualitative methodologies, and to use China's Vietnam policy as an example.

  2. How could my quantitative methodology be improved? In the February version of the proposal, I based my hypotheses on simple frequency counts of NCNA news items that mentioned Vietnam and other countries. "Go granular" with the NCNA content was the FAS professor's advice. He asked me to look at a bunch of different content analysis programs that can perform keyword in context (concordance) analysis, dictionary analysis, and other tasks beyond simple frequency counts.
I settled on a neat little program called Yoshikoder, which was developed by Will Lowe and others at Harvard as part of the Identity Project at Harvard's Center for International Affairs. Here's a description of Yoshikoder's functionality, from Will Lowe's website:

Yoshikoder allows you to load documents, construct and apply content analysis dictionaries, examine keywords-in-context, and perform basic content analyses, in any language.

In more detail: Yoshikoder works with text documents, whether in plain ASCII, Unicode (e.g. UTF-8), or a national encodings (e.g. Big5 Chinese.) You can construct, view, and save keywords-in-context. You can write content analysis dictionaries can be constructed using PERL-style regular expressions. Yoshikoder provides summaries of documents, either as word frequency tables or according to a content analysis dictionary. You can also compare documents according to word frequency profile or with respect to a content dictionary. Yoshikoder's native file format is XML, so dictionaries and keyword-in-context files are non-proprietary and human readable.
One of the additional benefits of the program is that it works on Mac OS X, which I use at home. There are also Linux and Windows versions.

Having a tool is one thing, knowing how to use it is something else entirely. My FAS thesis director did not tell me how to use the tool beyond giving a five-minute demonstration of the interface. Nor did he tell me what methodology I should use, i.e. how I should "go granular" with the NCNA content. It was up to me to design tests of the NCNA data using Yoshikoder that would give more insights into Chinese attitudes toward Vietnam during the Deng Xiaoping era.

I've found that the best way to learn about an application or a tool, or pick up a skill, is to do something practical that lets you test out the tool or skill. That's one of the reasons I started this blog -- it helps me hone my writing skills, and forces me to stay on top of developments concerning my research interests, namely Chinese mass media and modern Chinese history. That's also how I started to use computer assisted content analysis. Coursework and reading introduced the concepts, but I was able to conduct my first computer-assited content analysis last year for my Modern Chinese Emigration class (see the results in term paper from that class, China's Emerging Overseas Chinese Policy in the Late 1970s and Implications for Ethnic Chinese Communities in Vietnam and Kampuchea).

As for learning how to use Yoshikoder, I had a practical opportunity to give the tool a spin at the 2006 Summer School, for my class Film And History: Postwar Japan and Post-Mao China. I used it to compare NCNA coverage of Chinese film directors Xie Jin (谢晋) and Zhang Yimou (张艺谋). This allowed me to get a feel for Yoshikoder's strong functionality, as well as learn a few lessons about how this powerful tool can give unexpected results -- my data was skewed in unusual directions because of the small size of the samples and the tone of the articles in the samples. (Read the results in my final paper, Evaluating Official Attitudes Toward Post-Mao Chinese Film Through a Quantitative Lens. The Yoshikoder data and analysis is described on pages 16-18).

In the past two weeks, I've used Yoshikoder in two ways to "go granular" with NCNA data relating to Vietnam. I've conducted dictionary word counts, and performed aggregated concordance analyses. This has involved several steps:

  1. Deciding how I will test the NCNA data. In other words, what indicators will help me determine which issues relating to Vietnam -- Vietnam's regional ambitions, or its relationship with the Soviet Union -- were more important to Beijing during the Deng period? I decided that taking NCNA news items that are specifically about regional issues, and comparing them with sample news items about USSR issues is the best course, but I had to isolate for Kampuchea-related items, as the war that dominated the country from the 1970s to the late 1980s would result in a higher incidence of negatively-themed articles.

  2. Building NCNA-specific dictionaries that Yoshikoder can use to analyze my sets of samples. I did this by selecting 21 NCNA articles with "Vietnam" (or variants) in the headline, dating from 1977-1993, concatenating them together into a single text file, using Yoshikoder to create a list of all the words used in all 21 articles, and then selecting words that could be construed as positive or negative. From the list of 200 or so negative words (and variants, used by adding a wildcard character) I eliminated words relating to military actions or armed aggression, which would show up in any items relating to the Kampuchean conflict. I then added all three NCNA dictionaries (positive, negative, and what I called "insecurity") to a file containing a positive and a negative dictionary from the General Inquirer content analysis program, each containing many of hundreds of terms used to analyze political texts.

  3. Creating samples of NCNA items relating to five types of news articles:

    1. Vietnam and USSR terms in the headline, but no other refs to Kampuchea, other regional countries or ASEAN, or the U.S. in the full text

    2. Vietnam and USSR and Kampuchea terms in the headline, but no other refs to other regional countries or ASEAN, or the U.S. in the full text

    3. Vietnam and Kampuchea and other regional countries in the headline, but no other refs to the USSR or the U.S. in the full text

    4. Vietnam and Kampuchea in the headline, but no other refs to the USSR or the U.S. or other regional countries in the full text

    5. Vietnam and other regional countries in the headline, but no other refs to the USSR or the U.S. or Kampuchea in the full text

    The samples were gathered using the LexisNexis interface. For each year, I would determine how many of each type existed. If the number was 10 or less, I would take all of the news items and concatenate them for later analysis. if the number was more than 10, I would use an Internet random number generator to select 10 random numbers, and then pick the corresponding news items in LexisNexis, and stitch them together.

  4. The dictionary-based frequency counts were performed on the samples for each type and year, and entered into Excel.

  5. I also performed "concordance reports" for all sample types and certain years. This involves taking a dictionary of words -- in this case, four terms relating to Vietnam (vietnamese or vietnam or nam or hanoi) -- and having Yoshikoder show me the context in which they appear for each sample, across all sample types. Two examples, with a concordance "window" of 5 terms in each direction:

    ... threat to the leadership in hanoi the vietnamese authorities had all ...
    ... china's new ambassador to viet nam DATELINE hanoi october 11 1977 ...

    The idea is not to look at every sentence and apply my own eyeballs to the results, but rather let Yoshikoder do all the work, by taking these contexts of "Vietnam"-related terms, and then measuring the number of positive/negative/"insecurity" terms in each sample type over time. My guideline for picking the years: There had to be at least six news items in the sample (to improve the quality of the data, and lessen the chance of bias based on a handful of articles with a certain tone) and there had to be at least one Soviet-related as well as one regional-related sample in a given year. This restricted my concordance reports to 1978, 1979, 1980, 1985, 1986, 1987, 1988, and 1991.
As of Friday night, I have completed all of the data collection steps, entered the results into Excel (Yoshikoder has export to HTML and export to Excel functions). Right now I am in the preliminary analysis phase. I might share some charts later in the week, but my main priority is communicating the findings to my FAS thesis director, getting my proposal approved by him, and then starting to write the thesis!


Al ve oku said...

I don't know if you are still tracking the comments -- but I'm wrestling with yoshikoder myself, now! The dictionaries packaged with the software (LIWC, RID, and Laver/Garry) did not generate the kind of insights I needed. SO I need to, it seems, "roll my own." I hope things turned out well for you!

I Lamont said...

Al: If you want, I can email you copy of my dictionaries -- the General Inquirer positive and negative, as well as some custom dictionaries I designed based on Xinhua content.

The results I had were striking, once I had input the data for each sample's run in Excel and graphed it.

My email is ianlamont at post dot harvard dot edu