Harvard Extended: Scaling back my research: Reality sets in

Research update, Thursday night. Here are the highlights:

1) I have once again changed the scope of my research: The old plan was to conduct a computer-assisted text analysis over the Deng (鄧小平) and Jiang (江澤民) eras, but for reasons which will be explained later, have opted to just do the Deng period, plus a little overlap on either side of his reign. I have also cut down the number of planned searches of specific variables and combinations of variables from more than 30 to 23.

2) I have completed a census of all NCNA English news items (stories, briefs, summaries, but not exchange rates or technical notes related to the wire service) for each month, from January 1977 to December 1993

3) I have created a basic spreadsheet model that includes searches for variables, as well as derivative variables, comparisons of variables, and other data.

Here are the details, for anyone who may be interested:

First, a key for the content variables I am working with:

V = Vietnam and related terms
K = Kampuchea and related terms
L = Laos and related terms

S = Soviet Union/Russia and related terms
U = United States and related terms

I = United Nations and related terms
A = ASEAN and related terms

Right now I have an Excel spreadsheet, with worksheet tabs for each year from January 1977 to December 1993, broken up by month. The A column contains the following:

V
% of NCNA total

V+K

V-K
% V items with K
% V items without K

Ratio V:K items

V+L

V-L
% of V items with L
% of V items without L

Ratio V:L items

Ratio V+K:V+L items

(list goes on to Row 150, including spaces)

Bolded items are searches I have to perform, the non-bolded items are derived results using Excel formulae (for instance, V-K can be obtained by subtracting V+K from V). When I refer to V, it means the number of news items that mention Vietnam anywhere in the text. V+L correlates to all items with Vietnam and Laos in the full text. In other words, an article that mentions both Vietnam and Laos in the text of the story will be counted. If there are four articles that meet this criteria in a given month, my Excel spreadsheet will record "4". If I refer to V+L-S, it means a search for all NCNA news items that mention both Vietnam and Laos, but not the Soviet Union. "% of V items with L" will display the percentage of NCNA items that mention Vietnam and also mention Laos.

Relating to point 1, here's the reason for the change of heart. In a nutshell, reality set in -- I realized that the number of searches required to get a baseline of NCNA items for the Jiang era would require at least 15 hours of additional work in LexisNexis. 15 hours is a big deal for someone who has a full-time job and a full-time family. It would mean pushing back the starting point for writing my thesis proposal by two weeks, at least. Just doing the Deng-era NCNA census totalled well over 1000 searches, or about 12 hours of work in front of the computer.

But here's the part that's most frustrating: It doesn't have to take so much time to do these searches. The problem lies in the tool I am using. The more I use Lexis Nexis, the more I am aware of the limitations of the interface and the results that are displayed -- when trying to gather monthly totals of NCNA English items, the error message that results when more than 1000 hits are returned causes lots of problems for me -- by the early 90s each month typically resulted in more than 5000 NCNA news items. Practically speaking, it meant more than 100 searches per year, compared to less than 40 per year in the early 80s. If I could perform SQL queries on the LexisNexis database, instead of using the crappy Web form, I could have had the same results in less than an hour.

Another issue relating to point #1: Every time I add a new set of search variables in the vertical column of my spreadsheet, I am adding at least two hours to the amount of time I have to perform research, because each additional variable has to be tested for each month over a 16-year period, or N x 12 x 16. That's about two hours per variable. With seven variables under study (V, K, L, S, U, I, A) the permutations number in the hundreds. Where do I stop?

At first, I identified about crucial 35 variables and combinations, but after considering the time it would take me to complete all 35 x 16 x 12 searches, whittled it down to 23. But I have to consider the end results of the dropped searches -- the data I get from these extra searches could yield significant patterns or other trends that shift the tone of my thesis.

OK, it's 11 pm, time to hit the sack. I'll release another progress report this weekend.

Harvard Extended

Thursday, December 29, 2005

Scaling back my research: Reality sets in

No comments: