Tonight I have embarked upon some review of this year's White House press briefings.
Yes, I was very bored!
With the advent of the Karl Rove controversy, White House press secretary Scott McClellan has come under fire. He and the White House have taken the position that they are not going to comment on an ongoing investigation. This does not keep the press corps from pressing him on the matter.
The current theory is that when the press corps gains momentum, McClellan has a "go to" reporter by the name of Raghubir Goyal. McClellan knows that once called upon, Goyal will ask a question that is totally unrelated to the Valerie Plame leak inquiry.
But, does this hold up to analysis? Has Goyal been recognized in the press briefing room now more than in the past? There's only one way to find out -- analyze the transcripts.
Press briefing transcripts are readily available at the WhiteHouse.gov website. So, I downloaded all of the briefings or gaggles that McClellan has given since the beginning of 2005. At first I figured this would be easy.
If McClellan answered questions by calling on the reporter by name, I could do a simple text search for "Goyal" and easily determine the frequency. In these briefings, however, McClellan usually points or nods to give a particular reporter the floor.
I had to attack this problem differently.
Consider this: an incoming email is either spam or ham (the "technical" term for a legitimate email). There is great software available to read an incoming message and assign it a spam rating. This rating translates into the likelihood of that particular message being spam. The only problem is that these software filters must be appropriately trained with a set of messages that are human-identified as ham and a set that are spam.
Why couldn't this software be used to decide if a question was asked by Goyal? All I had to do was find a set of questions that I definitely knew were posed by Goyal and train the filter accordingly. Occassionally, McClellan will call on a reporter by name. I did a search of the transcripts for "Goyal" and used the context clues to create a set of "Goyal questions".
The human-identified "Goyal questions" became the training set for the spam filter. I could easily identify about 18 questions from Goyal. I took an equivalent number of questions posed by other reporters and used those as the ham training set.
In order for the questions to be processed by the spam filter, they had be turned into "legitimate" emails -- these means adding header information. Since I didn't want the spam filter to associate particular header information with the "Goyal questions", I chose to append identical headers to each message. It is my hope that this will force the spam filter to only identify the message based upon its content -- the question, in this case.
To do the actual filtering, I chose to use spambayes -- a bayesian spam filter written in Python.
The next step is to ask the spam filter about each question. I stripped out all the questions from the transcripts and formatted them as individual emails. The formatting was done with the identical header information that was appended to the training set questions -- with one exception. I added a special X- field into the header so I could link the questions to the briefings where they were asked.
Each message will receive a score that indicates the likelihood that it is spam. Again, spam correlates to a question posed by Goyal.
Of approximately 4500 questions submitted to the spam filter, about 470 were given a spam rating. This implies that in the past year, about 10 percent of the questions were asked by Goyal. I am not a White House press corps expert, but this does seem a little high.
Now, what about the most important question: Has the frequency of "Goyal questions" increased recently as the Karl Rove controversy heated up?
Here is the graph created from my collected data:
The x axis represents each of the briefings. The y axis represents the number of questions asked at each briefing by Goyal, as identified by the bayesian spam filter. For the sake of clarity, the x axis lists each briefing using a unique id. These ids were assigned simply: the first briefing of the year was given 1, the second 2, etc.
I will let you, the dedicated reader, draw your own conclusions.
Comments?
Monday, July 25, 2005
Subscribe to:
Post Comments (Atom)
1 comment:
you sir, are a wingnut... but i raise my turban to you... that analysis was a work of art.
hmm.. i prob need to put this down on paper, but i think you're assuming that the spin needs on each press briefing are of equal probability... this could be modified tho' based on plame, rove, brownie etc.
any chance you could overlay a timeline on your chart highlighting major spin events... :-)
Post a Comment