Monday, June 13, 2005

Using LexisNexis Academic: Pros and Cons

A crucial part of my planned content analysis is the tool that I will use to gather data from the Xinhua News Agency. LexisNexis Academic holds the electronic archives of hundreds, maybe even thousands, of text-based news sources such as magazines, newspapers, and wire service reports, stretching back to the 1970s. There must be several million individual articles on file, which makes it an incredible resource for students of history, foreign policy, anthropology, and other disciplines. It was apparently started as a resource for lawyers to search for evidence and old case material held in text files.

However, the Web-based interface with this giant database is limited. It works like a search engine, with a few twists. If you want, you can search all of the news sources in the database, or you use drop-down menus and other fields to restrict the search to a certain region, or a certain publication. You can also specify a date range to search. There are three fields to specify the terms you want to search for, as well as "operands" to add or eliminate terms from the results.

For instance, in my study of Xinhua references to Vietnam and Overseas Chinese, I set up a series of searches on one month periods starting in January 1977 and ending in December 1979. I restricted the results to the Xinhua News Agency, so I wouldn't get "hits" from the New York Times, Associated Press, the Hong Kong Standard, etc. In the first field I typed "viet" (lowercase "v", as the results are not case-sensitive) as opposed to "Vietnam" because Xinhua's style guide spells the country "Viet Nam." I also made sure that "viet" wouldn't return results for articles that contained "Soviet", which was a common term in Xinhua articles at the time.

I then selected the operand "and" and in the second field, I typed "overseas Chinese." The way the LexisNexis engine works, the words typed in a single field will be interpreted as a single phrase, not separate words -- i.e., only stories that said "overseas Chinese" would be searched. Also, I could use the operand "not" to find all stories about Vietnam that do not mention overseas Chinese.

One other cool feature of LexisNexis is the ability to restrict each of the three search terms to a certain part of the story -- the headline, the lead paragraph, or the full body of the story. Anyone who regularly reads newspapers can appreciate the value of restricting a search to a word that appears in a headline versus anywhere in the body of an article. How many New York Times stories feature "corn" as a focus, versus a mere background detail? Restricting a search to a headline term would filter out those stories which only included "corn" as a background detail, as background details never appear in a headline.

While this may sound like a great tool, it's far from perfect. My research centers around frequency counts, and LexisNexis is only of limited help in this respect. It can tell me how many stories with a certain term appear in a certain time period, but only if the number is less than 1,000 -- otherwise it returns an error message, forcing you to refine the search criteria, which usually means reducing the time period under study from, say, one month to one week, in order to stay below the 1,000 item limit. It also won't tell you the total number of stories printed by a given news source in a certain time period, which means you have to trick the engine into telling you this detail -- I used "item no" as a search term, as each article in the Xinhua archive starts with "Item No.:" in the slug. This allowed me to determine the total number of stories in a certain period, and, after manually pasting the data into an Excel spreadsheet, calculate the frequency of stories mentioning Vietnam and overseas Chinese (as well as other terms). But this was a more labor-intensive and potentially error-prone method, than if LexisNexis added this capability to the existing tool.

Additionally, LexisNexis does not allow you to determine which terms appear most often in a body of news articles over a certain period. Wouldn't it be great if I could call up all Xinhua stories that mention Vietnam in 1978, and then find out the five most frequently mentioned words within those results that are located in the lead paragraph and are longer than five letters, and their relative frequencies?

Such a search would be possible, if users were allowed to perform full SQL queries on the database. SQL stands for Structured Query Language, and is the language most databases understand .... in fact, it is almost certainly the language that the LexisNexis engine uses behind the scenes to return results using the Web browser interface. However, the browser interface is limited to the SQL methods that correspond to what i have described above .... more advanced queries, such as automatically calculating frequency counts, ordering results by size, or performing statistical methods on the results is not possible.

But what can I do? I can wait for LexisNexis to improve its browser-based search tool, or I could look elsewhere -- Factiva offers a similar service, but its Xinhua archive only goes back to 1989. Or I can attempt more tricks or work-arounds to get the results that will help me with my research.

In any case, I should be thankful that I have access to the existing tool, via the Harvard Libraries agreement with LexisNexis, which allows students to access the engine via a dial-up Web connection. Ten years ago, such a tool was probably not available, unless you were at an on-campus connection.