Tag Archives: data mining

Employment Progression

In my new dataset, each row is a series of jobs that one person has had.

Most of them are quotidian:
Junior Tax Analyst –> Senior Tax Analyst
Investment Banking –> Investment Banking –> Investment Banking

Some of them are funny:
Corn Detassler –> Flight Delivery Center Technician
Quabbity Assuance –> Electronics Sales Associate

Some baffling:
Gymnast –> Air Traffic Controller
bust boy –> bust boy –> bust boy?

And some inspiring:
Dishwasher –> Dishwasher –> Model

Does Barack Obama follow Queen Noor?

Twitter maintains a few lists of verified accounts. One of these lists includes 38 world leaders. Using Twitter’s fantastic API, I did some detective work to see which world leaders “follow” which others.

Follow network of verified world leaders’ Twitter accounts.

The graph is messy, but it displays some order. Barack Obama (@BarackObama) and David Cameron (@Number10gov) tie for the most followers at 17 each and appear toward the center of the network. The Prime Minister is more reciprocal in his attention – with 13 outgoing follows to The President’s mere 4.

What does it mean for one world leader to follow another on Twitter? Probably not much. Perhaps there will come a day when it is a diplomatic faux pas to meet with a head of state and then neglect to follow his Twitter account.

As for whether Barack Obama follows Queen Noor? He does not. @QueenNoor‘s follow of @BarackObama is unrequited.

Politics or Sports

When tweeting, what words do people use when they are talking about politics? I did a fast analysis of the last 1000 tweets from the 16 most popular political bloggers and the last 1000 tweets from the 16 most popular sports bloggers.

Here are the overall word counts, the counts for political and sports tweets separately, and a measure of the politics/sports diagnosticity.

Word Count Political Count Sports Count Chi Square
obama 1137 1132 5 558.54
gop 659 659 0 329.50
game 758 28 730 325.07
house 551 540 11 253.94
yankees 500 2 498 246.02
senate 452 452 0 226.00
party 432 413 19 179.67
vs 598 69 529 176.92
democrats 319 319 0 159.50
health 346 336 10 153.58
tea 324 319 5 152.15
president 370 349 21 145.38
ufc 293 1 292 144.51
angels 282 0 282 141.00
election 271 271 0 135.50
political 265 262 3 126.57
vote 345 319 26 124.42
rangers 256 3 253 122.07
reform 242 242 0 121.00
update 568 100 468 119.21
#yankees 232 0 232 116.00
giants 248 5 243 114.20
#ufc 225 0 225 112.50
lakers 225 2 223 108.54
#mma 216 0 216 108.00
on 4265 2605 1660 104.69
watch 446 374 72 102.25
government 204 204 0 102.00
campaign 217 213 4 100.65
law 230 222 8 99.56
obama’s 199 199 0 99.50
us 406 343 63 96.55
dodgers 197 1 196 96.51
bowl 200 2 198 96.04
(video) 211 206 5 95.74
rally 258 240 18 95.51
football 205 6 199 90.85
republicans 181 181 0 90.50
tax 194 189 5 87.26
today 516 407 109 86.05
polls 185 181 4 84.67
obamacare 169 169 0 84.50
republican 169 169 0 84.50
palin 168 168 0 84.00
coach 175 2 173 83.55
bush 184 179 5 82.27
dems 163 163 0 81.50
nfl 189 8 181 79.18
voters 180 174 6 78.40
basketball 160 1 159 78.01
o’donnell 153 153 0 76.50
fans 178 7 171 75.55
players 162 3 159 75.11
news 389 315 74 74.65
race 214 196 18 74.03
[delicious] 147 0 147 73.50
obamas 143 143 0 71.50
kings 158 4 154 71.20
team 307 49 258 71.14
#rangers 139 0 139 69.50
care 228 202 26 67.93
play 230 27 203 67.34
congress 139 137 2 65.56
season 199 19 180 65.13
kobe 130 0 130 65.00
bill 253 217 36 64.75
america 170 159 11 64.42
politics 124 124 0 62.00
in 5766 3304 2462 61.48
et 149 142 7 61.16
sox 122 0 122 61.00
jobs 137 133 4 60.73
economy 125 124 1 60.52
notes 178 16 162 59.88
debate 163 151 12 59.27
cam 116 0 116 58.00
democratic 116 116 0 58.00
125 115 0 115 57.50
brandon 115 0 115 57.50
cbs 125 122 3 56.64
christine 112 112 0 56.00
player 123 3 120 55.65
preview 169 16 153 55.53
elections 115 114 1 55.52
democrat 111 111 0 55.50
sen 110 110 0 55.00
americans 121 118 3 54.65
supreme 113 112 1 54.52
dem 111 110 1 53.52
rep 122 118 4 53.26
speech 110 109 1 53.02
pelosi 106 106 0 53.00
fox 143 133 10 52.90
its 300 239 61 52.81
#playoffs 105 0 105 52.50
security 116 113 3 52.16
espn 113 3 110 50.66
usc 108 2 106 50.07
deal 199 29 170 49.95
mosque 99 99 0 49.50
american 194 166 28 49.08

Chi-square (the last column) was calculated with the chi-square formula: (Observed frequency – Expected Frequency)^2 / Expected Frequency. The “Expected Frequency” in this case was half the total number of times the word appeared. In other words, we assume each word has an equal chance of appearing in a sports tweet or political tweet, and then measure how much that assumption was violated.

This table lists only the top most diagnostic words. Of course there were tens of thousands more. However, if you ever need to build a quick-and-dirty classifier or settle a bet on which words separate the pols from the jocks, here’s your answer. 🙂

UPDATE: The Algorithm for Facemash in The Social Network

Original post: The Algorithm for Facemash in The Social Network

After some Googling, it would appear that the Facemash algorithm corresponds to the Elo rating system. Thus, the equations involved are:
Ea = 1/(1 + 10^((Rb -Ra)/400))
Eb = 1/(1 + 10^((Ra -Rb)/400))

The Mathematical Details section of the wiki article explains the implementation of the algorithm.
Thanks to this Quora article for the most relevant links.

I’m surprised to see this algorithm treated with such reverence in the movie. While it is sophisticated and useful, and has been used in official chess rankings for decades, it has recently taken a beating at the hands of modern data miners. As of 10/8/2010, ninety teams have developed more predictive methods than Elo ratings for handicapping chess matches: see the Kaggle contest leaderboard.