Thursday, December 01, 2005

The English language is a minefield when computers get involved

Cambridge, we have a problem.

I just ran into a potential stumbling block with my research. Prior to carrying out the body of my quantitative research, I am testing out LexisNexis using various search terms singly and in combination, to see if any potential word usage problems crop up.

Why? There are two reasons. Number 1: LexisNexis does not recognize periods in searches or capital letters, even when enclosed with quotation marks. Therefore, searching for "u.s.s.r." and "ussr" will return the same results. But "u.s." looks like "us" to the system, and a search of articles mentioning "u.s.", as in "United States", will also return results for "us", the pronoun. This makes it difficult for me to include articles that say "U.S." without saying "United States".

Number 2: A computer system like LexisNexis does not understand what I want it to do unless I tell it exactly what to do. There are expected issues with multiple words applying to the same concept (for instance, Cambodia and Kampuchea). In most cases the operand "or" solves a lot of potential conflicts. For instance, firing up LexisNexis and entering into the New China News Agency catalogue the search string:

kampuchea or kampuchean or cambodia or cambodian or phnom or sihanouk or khmer

will turn up nearly every New China News Agency news item relating to the country if applied to the full text of every NCNA dispatch during the period under study. I have tested for obvious names or terms which might relate to the variable under study without explicitly mentioning that variable by its most common name in English. For instance, "Sihanouk" is included in my planned search string for Kampuchea, because there are a few dozen stories which mention King Norodom Sihanouk without mentioning the country or its people during the period from 1978 to 1992. "Khmer" covers articles which mention the Khmer Rouge or Khmer people.

But I encountered a big problem when testing for alternate terms in articles that mention the United States without saying "United States". Some terms, like "Washington", work out fine. But "America" or "American" are very problematic, because there are many stories in the NCNA catalogue which mention these words but have nothing to do with "United States" -- those stories that mention Central America, South America, Latin America, or North America. Remember, China during the Deng years (the late 1970s and 1980s and early 90s) still saw itself as a champion of the third world, a critic of the superpowers (then both heavily involved in Latin American conflicts), and an active counterweight to Taiwan's influence in the region. Thus, there are thousands of NCNA stories that discuss issues relating to the Americas but not the United States.

But there are also many NCNA items that relate to these countries as well as the United States. Additionally, LexisNexis maps certain terms -- most notably, the names of countries in Central America -- to the word "America" or "American". I found this out by reviewing lists of NCNA stories by testing various word combinations and exclusions, and there is no way to counter this -- it is hard-wired into the LexisNexis database by default.

Or is there? It is an issue I will need to address with my searches, or with an admission that one of my variables cannot be accurately searched in LexisNexis.

No comments: