Saturday, January 14, 2006

Problems with search terms in LexisNexis

I'm close to completing the hard-core research that will form the basis of my thesis. Before I wrap up the final 1000 or so searches, I wanted to talk a little bit about the search terms I am plugging into LexisNexis Academic.

First, a quick recap of what I am doing with my research: I am attempting to use the official New China News Agency English wire service as a barometer of Beijing's various policies toward Vietnam during the Deng era, specifically those policies relating that involve or refer to two regional countries (Kampuchea and Laos), the two superpowers, the United Nations and the Association of Southeast Asian Nations. We know from reading NCNA news articles and other sources that China saw linkage between issues relating to Vietnam and these other countries and organizations, but what I am doing is measuring references to them singly and in combination, and analyzing the findings. This methodology has been used extensively in the study of international relations, political science, and the media, but only to a limited extent in historical research.

Before I even started searching NCNA for references to these countries and organizations, I had to devise lists of terms that refer to each one. My aim was to get a list that draws out the most possible "hits" for each country/organization without including results that have no connection at all, but might show because of language issues or problems with the software I am using (LexisNexis Academic).

I tested extensively for names of countries, names of political leaders, variations of these names, acronyms, and more, and came up with a set of search terms for each variable V, K, L, U, S, I, and A.

Thus, the list of search terms I use every time I want to search for NCNA articles refering to Vietnam (V) is:

vietnamese or vietnam or "viet nam" or Hanoi

"Viet Nam" is the standard spelling in NCNA for Vietnam, but sometimes stories will use Vietnam, or not even refer to the country itself, but will refer "Vietnamese troops" or "Hanoi's actions."

Laos (L) was a bit tougher, even though the search terms are very basic:

Laos or Laotian or Vientiane

The problem here relates to the fact that LexisNexis doesn't differentiate between singular and plural versions of the same word, and automatically strips off the letter "s" from the end of words that have it, regardless of whether it is a plural word or not. Laos is not. That's OK because the system will still count references to Laos, but the problem is "Lao" is also a surname in China. In fact, one of China's most famous playwrights is "Lao She" (老舍), a term that appears in a few dozen English NCNA during the Deng period. I couldn't screen out references to him, because it would also screen out references to the country. Another famous Chinese person with the same surname is the philosopher Lao Zi (老子). I ended up adding the following string to my search:

[And not] literary or literature or playwright or theat! or "lao zi" or "lao tzu"

It doesn't screen out other people surnamed Lao, such as athletes and officials who might be mentioned in a NCNA dispatch, but it really helps make my measurement of Laos-related articles.

The United States, however, posed the greatest problem. Take a look at the following set of search terms, and try to figure out what's missing:

“United States” or USA or Washington or “White House” or ( Presiden! w/2 Carter ) or ( Presiden! w/2 Reagan ) or ( Presiden! w/3 Bush ) or ( Presiden! w/2 Clinton )

"Presiden! w/3 Bush" looks for mentions of the word "bush" within 3 words of the word "president." Of course, I am attempting to get references to President George HW Bush, and screening out refs to shrubbery, trees, other people named Bush, etc. "Presiden!" finds all similar words that start with Presiden (president, presidential, etc.).

But there are two major search terms missing: The U.S., and America.

The problem with U.S.: LexisNexis doesn't recognize periods in acronyms, or capital letters. So searching for "U.S." will also return results for the pronoun "us". Additionally, the LexisNexis database also equates "America" with North America, South America, Central America, Latin America, and all countries in the hemisphere. References to "North American oil reserves" or "Central American peace process" and lots of other unwanted results turn up. There is no way around this problem unless I personally screen out unwanted articles, but this is something I am trying to avoid, owing to the potential for human error and the extra time involved. I will therefore have to accept that I am probably undercounting articles that reference the United States. This will be acknowledged in my thesis proposal and the thesis itself, however.

OK, back to work ....

No comments: