Study Computational Sociology at Stony Brook University

Come study computational sociology with myself, Arnout van de Rijt and Jennifer Heerwig!

In collaboration with Stony Brook University‘s Institute for Advanced Computational Science (IACS) and the Big Data for the Social Sciences (BDSS) cluster, the Department of Sociology is excited to inform graduating seniors and Master’s students about a new funding opportunity for those interested in pursuing a doctoral degree in sociology. For students with a background that combines programming skills and an interest in social science, the IACS offers a New Recruit Fellowship, which supplements the regular Sociology stipend bringing total compensation to $32,000 annually (with extension for subsequent years based on performance). It also provides a $4,000 budget for travel and equipment expenses. As a result of the BDSS cluster initiative, the Department of Sociology has hired leading scholars in this new field and established inter-disciplinary connections with top research faculty from computer science and political science.

Summary:

For whom:
* Graduating seniors and Master’s students whose background combines programming and computing skills with a deep interest in social science

Where:
* Sociology Department and Institute for Advanced Computational Science at Stony Brook University

What:
* Graduate studies in computational sociology
* Generous financial support and close supervision for incoming Ph.D. students:
– IACS New Recruit Fellowship increasing regular stipend to $32k + $4k travel funds 1st year
– Big Data for the Social Sciences cluster, with leading scholars in sociology, political science and computer science

More:
Requirements for the doctoral program in the Department of Sociology
Start an Application

Comments on Lorge & Kruglov, 1950

Leave a reply

Summary and commentary for “The Relationship Between the Readability of Pupils’ Compositions and Their Measured Intelligence“

Irving Lorge and Lorraine Kruglov

The Journal of Educational Research, Vol. 43, No. 6 (Feb., 1950), pp. 467-474

I have to admit it was a little bit dispiriting to read this article. First, it describes a project very similar to the one I am about to undertake. Second, this project beat me to the punch by more than fifty years. Third, the findings were negative, while I’m expecting my findings to be positive. And finally, in the 62 years this article has existed, it has garnered exactly 7 citations, so I have to wonder how interested the academy will be in the project I am just starting. Anyway, back to the article at hand.

In this paper, Lorge and Kruglov use the high-school entrance exam scores of 50 eighth- and ninth-graders to correlate the “readability” of the students’ writing to the same students’ scores on the intelligence-testing portion of the same exam. They find positive correlations, but the values are low (~.10) and not significantly different from zero. They conclude that for people matched on education and age level, the complexity of their writing is not a good predictor/substitute/correlate of general intelligence.

The main reason they do not find a significant correlation is likely to be the restricted range of the data. In the article, the authors mention two successful demonstrations of correlation between readability measures and education levels. It seems Lorge and Kruglov were too ambitious in thinking that readability would be successful in predicting intelligence in a small sample of relatively similar students: all were eighth- and ninth-graders in New York schools applying for a selective science high school.

One could rightly argue that the data are nearly useless in answering the question of whether there exists a relationship between writing complexity and intelligence in general. The lack of a significant correlation in this narrow range of measured data points does not disprove an overall relationship that may still exist.

The paper is important in practical terms. Suppose the test evaluators had intended to use Lorge Readability as the sole measure of subjects’ ability. The fact that it does not correlate with intelligence in this sample shows this would be a grave mistake.

I still hypothesize that – in general – writing complexity and intelligence will be correlated, but this article gave me some pause. If evaluation in a narrow range is the goal, I will need to be extremely careful as to whether my methods are rigorous and precise enough to meet that goal. And I will need to be clear in explaining that they do not, if that is the case.

Quick hits:

It sounds like the authors had thousands of exam results to choose from and chose 50 at random for this study. Times change, I guess. Although I might have done the same if I was computing all the scores and correlations by hand.
On average, students write two grade levels below their current level. The authors claim this is because students comprehension runs ahead of their ability to compose.
The intelligence measure was the total score on 30 arithmetic problems, 60 multiple-choice vocabulary questions 15 “proverb-matching” items. Compositions were of ~100 words. I wonder how much longer compositions or multiple compositions per student would have increased the precision of the readability measure.

Learning How Things Go Together

Leave a reply

[This is my attempt at converting my dissertation abstract to “Up-Goer Five speak” (i.e. using only the 1000 most-frequently used English words). For context, here’s the xkcd comic that started the trend. Search the #upgoer5 hashtag on Twitter for more. Try it yourself on the Up-Goer Five text editor.]

Big things are just many small things put together. It would be good to know which small things go together. You could learn how a brain works by thinking this way. Or you could learn which people like which other people. Thinking about how small things are put together to make big things is a good idea. It would be good to know how we learn, and how we should learn which things go together.

To this end, I did five studies in which people learned which things in a set were joined together. To show you what I mean, some people learned “who is friends with who” in a friend group. But other people learned about other things that were joined together – like which cities have roads that go between them. By doing these studies, I found out a few things. One thing I learned was that it matters how the things are joined up. To show you what I mean, think about the friend group again. It is easier to learn who is friends with who in a group where few people have many friends and many people have few friends. If things are more even, and all people have about the same number of friends, it is hard to learn exactly who is friends with who.

It doesn’t matter if the joined things are people or cities or computers. It is all the same. Also, it doesn’t seem to matter much why it is you are learning what things go together.

I also show that people learn better by seeing a picture of joined-together things rather than reading about joined-together things. This is the case even more when the things that are joined are made to be close together in the picture.

Finally, I talk about an all-around idea for how people learn about groups of joined together things. I say people start out by quickly sorting things into much-joined and few-joined types. Then they more slowly learn which one thing is joined to which one other thing a little at a time.

Dollar Value of Personal Data

Leave a reply

Personal Data – Median Fair Price

How much is your personal data worth? Worth, as in – how much should you sell it for? In dollar terms.

I went looking for attempts to answer this question and didn’t find much. So I took a shortcut and asked a bunch of people. Here’s how I set up the survey:

Imagine your friend has just told you about his new job. He now works for a company that pays people for their personal data. For example, you would tell the company your name and the name of your favorite TV show, and you would receive a certain amount of money in return.

Your friend is in charge of setting fair prices for each piece of personal data. He needs advice from you about how much to pay. For each of the items below, please provide the price you believe would be fair to pay someone to provide that information.

More methodology details are below, but let’s get straight to the results. In this table is each facet of personal data requested, and the survey respondents’ median¹ dollar price value.

Personal Data	Median Fair Price
Home or Cell Phone Number	$6.25
Home Address (Street Address)	$5.00
Name of Employer	$5.00
Previous Employers	$5.00
Brand Name of Bank Used Most Often	$3.00
Brand Name of Credit Card Used Most Often	$2.00
Link to (Public) Facebook Profile	$2.00
Twitter Username	$2.00
Age(s) of Children Living at Home	$1.75
Yearly Income	$1.50
Make and Model of Car Driven Most Often	$1.00
Home Address (City)	$1.00
Brand Name of Computer Used Most Often	$1.00
Date of Birth (Month, Day and Year)	$1.00
Highest Level of Education	$1.00
College Major	$1.00
Marital Status	$1.00
Political Party Affiliation	$1.00
Religious Affiliation	$1.00
Home Address (State)	$1.00
Home Address (Zip Code)	$1.00
Gender	$0.50
Favorite Book	$0.50
Favorite Movie	$0.50
Favorite Restaurant	$0.50
Favorite Song	$0.50
Favorite TV Show	$0.50

Results

People just don’t want to be bothered. Phone number and street address are the pieces of personal information held most dear.
Employer and previous employer data are curiously highly-valued.
Twitter and Facebook identities are valued more highly than a number of demographic variables, including annual income.

What Personal Data is it Inappropriate Even to Ask About?

I wanted to make an option available to users to indicate that no price could ever persuade them to part with some data. The option I settled upon was a checkbox labeled “None / Inappropriate / Should Not Ask” that could be selected instead of entering a dollar value.

In the table below, I list the personal data labels and the number of respondents who chose “None / Inappropriate / Should Not Ask” instead of entering a dollar value for the item.

Personal Data	Marked Inappropriate
Previous Employers	37
Home or Cell Phone Number	36
Home Address (Street Address)	34
Name of Employer	34
Brand Name of Bank Used Most Often	34
Brand Name of Credit Card Used Most Often	30
Link to (Public) Facebook Profile	29
Twitter Username	28
Age(s) of Children Living at Home	22
Date of Birth (Month, Day and Year)	22
Make and Model of Car Driven Most Often	20
Home Address (State)	17
Religious Affiliation	15
Home Address (City)	15
Political Party Affiliation	14
Home Address (Zip Code)	14
Brand Name of Computer Used Most Often	13
Highest Level of Education	7
College Major	7
Marital Status	7
Gender	7
Favorite Book	6
Favorite Movie	6
Favorite Song	6
Favorite TV Show	6
Favorite Restaurant	5
Yearly Income	1

The ordering should look rather familiar. The median price ordering and the inappropriate to ask ordering are almost identical. The (Spearman) r-value for the correlation is 0.96. This suggests that the same personal data components that were given high dollar values (by respondents willing to affix a dollar value) were the same components that other respondents thought should be unavailable for sale at any price.

Survey Methodology

I used Mechanical Turk to field a survey to 104 people. I limited the survey to users in the United States. Below is a partial screenshot that is exactly what the survey-takers saw when beginning the survey.

Survey Instructions and Example Questions

The average time each survey-taker spent answering questions was just over four minutes. They gave a dollar value for the fair price or marked Inappropriate for 27 items.

If you are interested in working with the raw fair price response data, please contact me and I will provide it.

Caveats

This survey should not be considered “scientific.” I did not attempt to obtain a random sample of the human population nor even the United States population. The sample is representative of those people using Amazon Mechanical Turk and willing to take a survey about the dollar value of personal data. How much of a limitation that is is up to you.

I specifically asked users to provide “the price you believe would be fair to pay someone” for each item. I did not ask them what the price would have to be for them to sell their own data. I purposefully did this to reduce noise due to uniquely personal preferences in the data, but I recognize some might feel it better to ask the price question more directly.

Footnotes

Because of the inevitable positive skew for these questions, the median is both nicer to look at and more representative of the actual distribution than the mean. One way to think about the median value in this context is that of the people surveyed 50% would accept this value as a fair price. It meets or exceeds their asking price.

Twitterjjj – 180 Million Tweets at a Glance

Leave a reply

Twitterjjj is a set of scripts I use to take a first look at what people say when they tweet about a particular topic, brand or person.

I wrote it to a small set of (self-imposed) specifications.

Create a 1-page report describing how people discuss a keyword or phrase in their tweets.
Ensure the report is easy to read on the web, and that results can be easily downloaded for further analysis.
Respond immediately with preliminary results and continue to update as the processor churns through the terabytes.

My corpus of tweets starts in April 2012 and continues to the present. To be non-specific about the numbers, there are about 180 million tweets in this corpus, with about 3 million more added every day.

Many of the reports I’ve run I make publicly available. For instance, you might discover that people use more negative sentiment than positive when they discuss republicans on twitter. Don’t worry, the same is true for democrats. You might have guessed that, but did you know that tweeters are 3 times more likely to mention the “GOP” than “republicans”? Brevity is the soul of twitter.

Beyond politicians, who occupies a lot of twitter mindshare? Let’s look at the first three celebrity names that popped into my head: Justin Bieber, Lady Gaga and Kim Kardassian. We’ll measure mindshare in tweets-per-million. That is, in every one million tweets, how many times is the keyword (celebrity name in this case) present?

Celeb	Tweets per Million
Bieber	908
Gaga	541
Kardassian	116

It looks like Bieber fever is more contagious than the Kardassian cough.

Find me on Twitter @jasonjones_jjj. I’m happy to hear feedback and suggestions for new keywords to explore.

London 2012 Twitter Olympics

Leave a reply

The London 2012 Olympics are upon us. Lots of athletes will be judged, timed and measured for the athletic things they do. But aren’t the non-athletic things they do much more interesting? Like tweeting?

No, of course not. But that’s not going to stop me from holding my very own Twitter Olympics and handing out (virtual) medals for exceptional Twitter performances.

Below is the list of events and the results so far. I’ll be rolling out more results as the real Olympics go on.

Preview Event: Games-Dropping

Who’s most pumped up for the games? In this event, Olympians score one point for every time they have used the words Olympics, Games, or London in their recent tweets.

	Olympian	Games, Olympics or London in Tweets
	@SwissDom	41
	@lolojones	37
	@NickSymmonds	28

Full Results: london2012_gamesd (.xlsx)

Sexy At-Mention

In this event, a sampling of thousands of tweets mentioning Olympic athletes was scored. The winning Olympian was the one with the most co-occurrences of their Twitter handle and the words “sexy,” “hottest,” “beautiful,” “cute,” “handsome,” “pretty,” or “babe.”

	Olympian	Sexy At-Mentions
	@matthew_mitcham	25
	@Joeingles7	8
	@hopesolo	7

Full Results: london2012_sexy (.xlsx)

The Sesquipedaliathon

In this event, scores are awarded based on the average number of syllables per word in the Olympian’s tweets.

Olympian	Syllabes Per Word	Longest Word
@juanmata10	1.76	visitaremos
@Njr92	1.66	spideranderson
@TipsarevicJanko	1.62	pantomime

Full Results: london2012_sesq

Most Followed

In this event, one point is scored for each Twitter follower.

	Olympian	Followers
	@Njr92	4,953,514
	@juanmata10	1,164,329
	@DjokerNole	1,111,326

Full Results: london2012_user_info

Most Followed (by other Olympians)

In this event, one point is scored for each Olympian follower.

	Olympian	Fellow Olympian Followers
	@MichaelPhelps	12
	@usainbolt	11
	@lolojones	9

Full Results: london2012_degrees

London 2012 Olympians Twitter Follow Network. Arrows point from follower to followee. Click the picture to view a larger version.

Most Follows

In this event, one point is scored for each Twitter user the Olympian follows.

	Olympian	Followees
	@officialasafa	3,794
	@Njr92	630
	@TomDaley1994	542

Full Results: london2012_user_info

Most Follows (of other Olympians)

In this event, one point is scored for each fellow Olympian the Olympian follows.

	Olympian	Follows X Fellow Olympians
	@ItsStephRice	8
	@OscarPistorius	7
	@RickyBerens	7
	@drewsullivan8	6
	@matthew_mitcham	6
	@MichaelPhelps	6
	@PopsMBonsu	6

Full Results: london2012_degrees

Special Event: Non-Olympian Most Followed by Olympians

The only event (so far) in which non-Olympians compete. Medals to those non-Olympians who are followed by the most Olympians.

Non-Olympian	Name	Olympian Followers
@OMGFacts	OMG Facts	10
@NBA	NBA	9
@SportsCenter	SportsCenter	9
@espn	ESPN	8
@Sports_Greats	Sports Quotes	8

Full Results: london2012_nonlist_followees

Olympic Followback

Most athletes have many more followERS than followEES. In this event, Olympians are scored according the proportion of their followers that they follow back.

	Olympian	Followback Percentage
	@drewsullivan8	16.7%
	@SmoothKJ88	14.2%
	@EricBoateng	10.1%

Full Results: london2012_followback

Resources

Download list of London 2012 Olympic athletes on Twitter. (.xlsx)
Twitter API Documentation
You can tweet me @jasonjones_jjj. I especially welcome tweets with more London 2012 athlete Twitter handles and suggestions for new events.
Check out the Twitter Olympics practice run in my last blog post. In that version I used Twitter’s list of “verified” Olympians – consisting mostly of Winter Games athletes.

Olympians on Twitter Olympics

2 Replies

The Olympics bring together the world’s most talented and dedicated athletes. And so does Twitter. As a part of my continuing effort to try to do interesting things with the Twitter API, I decided to create my own Olympics for Olympians on Twitter. Er, yeah I think that’s right.

To begin with I created the sociomatrix of Olympian Tweeters. A sociomatrix is a table where every person in a group gets a row and a column. Each cell in the table indicates whether a relationship exists between two people (the row person and the column person). To indicate this, one just places a zero in the cell if the relationship does not exist and one if it does.

	Jack	Rose	Cal
Jack		1	0
Rose	1		0
Cal	0	1

Example Sociomatrix. The relationship is row in love with column as per James Cameron’s Titanic.

I created a sociomatrix of Olympians on Twitter where the relationship was follows. Given a sociomatrix, row sums and columns sums are usually interesting, quick summaries of the data. In our case, a row sum is the number of Olympians one particular account follows. A column sum is the number of Olympians following a given account. So, without further ado, let’s get to our first event: Olympian most followed by other Olympians.

Most Followed (by other Olympians)

Medal	Olympian	Followed By
Gold	@BillyDemong	30
Gold	@Shaun_White	30
Gold	@ApoloOhno	30
Silver	@lindseyvonn	28
Silver	@emilycook	28
Bronze	@GretchenBleiler	25

Do they allow ties in the real Olympics? Probably not, but since these are virtual gold medals I’m handing out, why not?

You can probably guess the next event. And this would probably be the easiest event to win if you knew it was coming. We know who has the most followers, but who does the most following?

Most Follows (of Olympians)

Medal	Olympian	Follows
Gold	@emilycook	73
Silver	@StevenHolcomb	34
Bronze	@TFletchernordic	32

The Sesquipedaliathon

Medal	Olympian	Syllables per word	Longest word
Gold	@LMCHOLEWINSKI	1.88	obesity
Silver	@AngelaRuggiero	1.58	sustainability
Bronze	@Pchiddy	1.57	anniversary

In the sesquipedaliathon, Olympians compete on their vocabularies. Tweeters are ranked by the mean number of syllables in the words in their tweets. Polysyllabic expressions win out over short words.

Sesquipedalian tweets may be the mark of a skilled wordsmith discussing a complex topic, or they may be the result of needless pretentiousness. Syllables per word is one component of the Flesch-Kincaid readability scale. According to the Flesch-Kincaid scale, the more syllables-per-word one uses, the more sophisticated the writing (or the less readable the text, depending on how you want to look at it).

The gold winner @LMCHOLEWINSKI is tweeting at about a 10th grade level. @LMCHOLEWINSKI’s tweets clock in at about the same level as the discourse in the United States Congress, according to recent analyses.

(For fun, I checked the syllables per word my dissertation tweetbot outputs. At 1.61, my doctoral dissertation would take home a silver.)

Games-Dropping

Medal	Olympian	Tweets about “Games” or “Olympics”
Gold	@ShaniDavis	19
Silver	@AngelaRuggiero	17
Bronze	@GretchenBleiler	16

For this event, Olympians score every time they use the word “games” or “Olympics.” So the medal winners are (presumably) those who are talking about the Olympics most often.

Sexy At-Mentions

Medal	Olympian	Sexy At-Mentions
Gold	@vitya_zvesda	16
Silver	@lindseyvonn	15
Bronze	@louievito	11

Yes, it has come to this. I needed to find something to do with at-mentions, right? So why not count for each Olympian how many times someone calls them sexy in a tweet? And why stop with sexy?

One point for each tweet that mentions the athlete by their twitter handle and also contains one of the following words: hot, sexy, babe, handsome, pretty, beautiful or cute.

Non-Olympian Most Followed by Olympians

Medal	Tweeter	Olympians Following
Gold	@lancearmstrong	30
Silver	@ConanOBrien	24
Bronze	@BarackObama	20
Bronze	@TheEllenShow	20
Bronze	@StephenAtHome	20
Bronze	@universalsports	20
Bronze	@shitmydadsays	20

This event was the toughest – as far as programming time goes. First, I grabbed everyone my list of Olympians follow. Then I aggregated to find out exactly how many Olympians followed each account. Then I filtered out Olympians to get this list of non-Olympians most followed by Olympians.

That’s the last of the events for now. Please check below for updates, and leave ideas for new Twitter Olympics events in the comments!

Resources

Download the sociomatrix of Olympians on Twitter. (The data is actually in edge-list form, but all the information you need to build the sociomatrix is there.) Data gathered 7/22/2012.
My source for Olympians on Twitter is Twitter’s list of verified Olympians.
Twitter API Documentation

UPDATE: The list of Olympians used here came straight from Twitter’s verified accounts page. However, it’s rather wonky. I have a new, better list of London 2012 Olympians on Twitter and I’ll be re-running all of these analyses on this list. Check for a link to the London 2012 version of these events on Friday the 27th.

UPDATE: London 2012 Twitter Olympics now available.

Employment Progression

Leave a reply

In my new dataset, each row is a series of jobs that one person has had.

Most of them are quotidian:
Junior Tax Analyst –> Senior Tax Analyst
Investment Banking –> Investment Banking –> Investment Banking

Some of them are funny:
Corn Detassler –> Flight Delivery Center Technician
Quabbity Assuance –> Electronics Sales Associate

Some baffling:
Gymnast –> Air Traffic Controller
bust boy –> bust boy –> bust boy?

And some inspiring:
Dishwasher –> Dishwasher –> Model

Account	inDegree	outDegree	Eigen. Centrality
JaciKettler	8	14	0.36
smotus	23	12	0.32
kwcollins	14	12	0.31
jlove1982	6	9	0.28
JohnCluverius	7	9	0.28
JeffGulati	6	9	0.25
RebeccaHannagan	2	8	0.24
therriaultphd	12	8	0.23
BrendanNyhan	21	9	0.21
davekarpf	7	7	0.21
richardmskinner	6	8	0.21
ianpcook	3	7	0.2
hsquared47	3	7	0.19
jon_m_rob	0	6	0.16
sissenberg	7	6	0.16
First_Street	1	6	0.15
FHQ	12	4	0.1
heathbrown	2	5	0.1
DocPolitics	1	4	0.09
ajungherr	0	3	0.07
archimedino	0	3	0.07
GeoffLorenz	0	3	0.07
JoeLenski	2	4	0.07
slimbock	0	2	0.05
James_H_Fowler	2	3	0.04
PolNetworks	34	1	0.04
jasonjones_jjj	1	3	0.03
krmckelv	1	2	0.03
dogaker	0	1	0.01
DominikBatorski	0	1	0.01
janschulz	0	1	0.01
jboxstef	0	1	0.01
matthewhitt	0	1	0.01
ophastings	0	1	0.01
stefanjwojcik	0	1	0.01

The Shotgun Approach

Double-barreled, smoothbore research and distraction.

Comments on Lorge & Kruglov, 1950

Dollar Value of Personal Data

Results

What Personal Data is it Inappropriate Even to Ask About?

Survey Methodology

Caveats

Footnotes

Twitterjjj – 180 Million Tweets at a Glance

Employment Progression