Comments on Lorge & Kruglov, 1950

Summary and commentary for “The Relationship Between the Readability of Pupils’ Compositions and Their Measured Intelligence

Irving Lorge and Lorraine Kruglov

The Journal of Educational Research, Vol. 43, No. 6 (Feb., 1950), pp. 467-474

I have to admit it was a little bit dispiriting to read this article.  First, it describes a project very similar to the one I am about to undertake.  Second, this project beat me to the punch by more than fifty years.  Third, the findings were negative, while I’m expecting my findings to be positive.  And finally, in the 62 years this article has existed, it has garnered exactly 7 citations, so I have to wonder how interested the academy will be in the project I am just starting.  Anyway, back to the article at hand.

In this paper, Lorge and Kruglov use the high-school entrance exam scores of 50 eighth- and ninth-graders to correlate the “readability” of the students’ writing to the same students’ scores on the intelligence-testing portion of the same exam.  They find positive correlations, but the values are low (~.10) and not significantly different from zero.  They conclude that for people matched on education and age level, the complexity of their writing is not a good predictor/substitute/correlate of general intelligence.

The main reason they do not find a significant correlation is likely to be the restricted range of the data.  In the article, the authors mention two successful demonstrations of correlation between readability measures and education levels.  It seems Lorge and Kruglov were too ambitious in thinking that readability would be successful in predicting intelligence in a small sample of relatively similar students: all were eighth- and ninth-graders in New York schools applying for a selective science high school.

One could rightly argue that the data are nearly useless in answering the question of whether there exists a relationship between writing complexity and intelligence in general.  The lack of a significant correlation in this narrow range of measured data points does not disprove an overall relationship that may still exist.

The paper is important in practical terms.  Suppose the test evaluators had intended to use Lorge Readability as the sole measure of subjects’ ability.  The fact that it does not correlate with intelligence in this sample shows this would be a grave mistake.

I still hypothesize that – in general – writing complexity and intelligence will be correlated, but this article gave me some pause.  If evaluation in a narrow range is the goal, I will need to be extremely careful as to whether my methods are rigorous and precise enough to meet that goal.  And I will need to be clear in explaining that they do not, if that is the case.

Quick hits:

  • It sounds like the authors had thousands of exam results to choose from and chose 50 at random for this study.  Times change, I guess.  Although I might have done the same if I was computing all the scores and correlations by hand.
  • On average, students write two grade levels below their current level.  The authors claim this is because students comprehension runs ahead of their ability to compose.
  • The intelligence measure was the total score on 30 arithmetic problems, 60 multiple-choice vocabulary questions 15 “proverb-matching” items.  Compositions were of ~100 words.  I wonder how much longer compositions or multiple compositions per student would have increased the precision of the readability measure.

Learning How Things Go Together

[This is my attempt at converting my dissertation abstract to “Up-Goer Five speak” (i.e. using only the 1000 most-frequently used English words).  For context, here’s the xkcd comic that started the trend.  Search the #upgoer5 hashtag on Twitter for more.  Try it yourself on the Up-Goer Five text editor.]

Big things are just many small things put together. It would be good to know which small things go together. You could learn how a brain works by thinking this way. Or you could learn which people like which other people. Thinking about how small things are put together to make big things is a good idea. It would be good to know how we learn, and how we should learn which things go together.

To this end, I did five studies in which people learned which things in a set were joined together. To show you what I mean, some people learned “who is friends with who” in a friend group. But other people learned about other things that were joined together – like which cities have roads that go between them. By doing these studies, I found out a few things. One thing I learned was that it matters how the things are joined up. To show you what I mean, think about the friend group again. It is easier to learn who is friends with who in a group where few people have many friends and many people have few friends. If things are more even, and all people have about the same number of friends, it is hard to learn exactly who is friends with who.

It doesn’t matter if the joined things are people or cities or computers. It is all the same. Also, it doesn’t seem to matter much why it is you are learning what things go together.

I also show that people learn better by seeing a picture of joined-together things rather than reading about joined-together things. This is the case even more when the things that are joined are made to be close together in the picture.

Finally, I talk about an all-around idea for how people learn about groups of joined together things. I say people start out by quickly sorting things into much-joined and few-joined types. Then they more slowly learn which one thing is joined to which one other thing a little at a time.

Dollar Value of Personal Data

Personal Data - Median Fair Price

Personal Data – Median Fair Price

How much is your personal data worth?  Worth, as in – how much should you sell it for?  In dollar terms.

I went looking for attempts to answer this question and didn’t find much.  So I took a shortcut and asked a bunch of people.  Here’s how I set up the survey:

Imagine your friend has just told you about his new job.  He now works for a company that pays people for their personal data.  For example, you would tell the company your name and the name of your favorite TV show, and you would receive a certain amount of money in return.

Your friend is in charge of setting fair prices for each piece of personal data.  He needs advice from you about how much to pay.  For each of the items below, please provide the price you believe would be fair to pay someone to provide that information.

More methodology details are below, but let’s get straight to the results.  In this table is each facet of personal data requested, and the survey respondents’ median1 dollar price value.

Personal Data Median Fair Price
Home or Cell Phone Number $6.25
Home Address (Street Address) $5.00
Name of Employer $5.00
Previous Employers $5.00
Brand Name of Bank Used Most Often $3.00
Brand Name of Credit Card Used Most Often $2.00
Link to (Public) Facebook Profile $2.00
Twitter Username $2.00
Age(s) of Children Living at Home $1.75
Yearly Income $1.50
Make and Model of Car Driven Most Often $1.00
Home Address (City) $1.00
Brand Name of Computer Used Most Often $1.00
Date of Birth (Month, Day and Year) $1.00
Highest Level of Education $1.00
College Major $1.00
Marital Status $1.00
Political Party Affiliation $1.00
Religious Affiliation $1.00
Home Address (State) $1.00
Home Address (Zip Code) $1.00
Gender $0.50
Favorite Book $0.50
Favorite Movie $0.50
Favorite Restaurant $0.50
Favorite Song $0.50
Favorite TV Show $0.50

Results

  • People just don’t want to be bothered.  Phone number and street address are the pieces of personal information held most dear.
  • Employer and previous employer data are curiously highly-valued.
  • Twitter and Facebook identities are valued more highly than a number of demographic variables, including annual income.

What Personal Data is it Inappropriate Even to Ask About?

I wanted to make an option available to users to indicate that no price could ever persuade them to part with some data.  The option I settled upon was a checkbox labeled “None / Inappropriate / Should Not Ask” that could be selected instead of entering a dollar value.

In the table below, I list the personal data labels and the number of respondents who chose “None / Inappropriate / Should Not Ask” instead of entering a dollar value for the item.

Personal Data Marked Inappropriate
Previous Employers 37
Home or Cell Phone Number 36
Home Address (Street Address) 34
Name of Employer 34
Brand Name of Bank Used Most Often 34
Brand Name of Credit Card Used Most Often 30
Link to (Public) Facebook Profile 29
Twitter Username 28
Age(s) of Children Living at Home 22
Date of Birth (Month, Day and Year) 22
Make and Model of Car Driven Most Often 20
Home Address (State) 17
Religious Affiliation 15
Home Address (City) 15
Political Party Affiliation 14
Home Address (Zip Code) 14
Brand Name of Computer Used Most Often 13
Highest Level of Education 7
College Major 7
Marital Status 7
Gender 7
Favorite Book 6
Favorite Movie 6
Favorite Song 6
Favorite TV Show 6
Favorite Restaurant 5
Yearly Income 1

The ordering should look rather familiar.  The median price ordering and the inappropriate to ask ordering are almost identical. The (Spearman) r-value for the correlation is 0.96. This suggests that the same personal data components that were given high dollar values (by respondents willing to affix a dollar value) were the same components that other respondents thought should be unavailable for sale at any price.

Survey Methodology

I used Mechanical Turk to field a survey to 104 people.  I limited the survey to users in the United States.  Below is a partial screenshot that is exactly what the survey-takers saw when beginning the survey.

Survey Instructions and Example Questions

Survey Instructions and Example Questions

The average time each survey-taker spent answering questions was just over four minutes.  They gave a dollar value for the fair price or marked Inappropriate for 27 items.

If you are interested in working with the raw fair price response data, please contact me and I will provide it.

Caveats

This survey should not be considered “scientific.”  I did not attempt to obtain a random sample of the human population nor even the United States population.  The sample is representative of those people using Amazon Mechanical Turk and willing to take a survey about the dollar value of personal data.  How much of a limitation that is is up to you.

I specifically asked users to provide “the price you believe would be fair to pay someone” for each item.  I did not ask them what the price would have to be for them to sell their own data.  I purposefully did this to reduce noise due to uniquely personal preferences in the data, but I recognize some might feel it better to ask the price question more directly.

Footnotes

  1. Because of the inevitable positive skew for these questions, the median is both nicer to look at and more representative of the actual distribution than the mean.  One way to think about the median value in this context is that of the people surveyed 50% would accept this value as a fair price.  It meets or exceeds their asking price.

Twitterjjj – 180 Million Tweets at a Glance

Twitterjjj is a set of scripts I use to take a first look at what people say when they tweet about a particular topic, brand or person.

I wrote it to a small set of (self-imposed) specifications.

  • Create a 1-page report describing how people discuss a keyword or phrase in their tweets.
  • Ensure the report is easy to read on the web, and that results can be easily downloaded for further analysis.
  • Respond immediately with preliminary results and continue to update as the processor churns through the terabytes.

My corpus of tweets starts in April 2012 and continues to the present. To be non-specific about the numbers, there are about 180 million tweets in this corpus, with about 3 million more added every day.

Many of the reports I’ve run I make publicly available.  For instance, you might discover that people use more negative sentiment than positive when they discuss republicans on twitter.  Don’t worry, the same is true for democrats.  You might have guessed that, but did you know that tweeters are 3 times more likely to mention the “GOP” than “republicans”?  Brevity is the soul of twitter.

Beyond politicians, who occupies a lot of twitter mindshare?  Let’s look at the first three celebrity names that popped into my head: Justin Bieber, Lady Gaga and Kim Kardassian.  We’ll measure mindshare in tweets-per-million.  That is, in every one million tweets, how many times is the keyword (celebrity name in this case) present?

Celeb Tweets per Million
Bieber 908
Gaga 541
Kardassian 116

It looks like Bieber fever is more contagious than the Kardassian cough.

Find me on Twitter @jasonjones_jjj.  I’m happy to hear feedback and suggestions for new keywords to explore.

London 2012 Twitter Olympics

The London 2012 Olympics are upon us.  Lots of athletes will be judged, timed and measured for the athletic things they do.  But aren’t the non-athletic things they do much more interesting?  Like tweeting?

No, of course not.  But that’s not going to stop me from holding my very own Twitter Olympics and handing out (virtual) medals for exceptional Twitter performances.

Below is the list of events and the results so far.  I’ll be rolling out more results as the real Olympics go on.

Preview Event: Games-Dropping

  • Who’s most pumped up for the games? In this event, Olympians score one point for every time they have used the words Olympics, Games, or London in their recent tweets.
Olympian Games, Olympics or London in Tweets
@SwissDom 41
@lolojones 37
@NickSymmonds 28

Full Results:  london2012_gamesd (.xlsx)

Sexy At-Mention

  • In this event, a sampling of thousands of tweets mentioning Olympic athletes was scored. The winning Olympian was the one with the most co-occurrences of their Twitter handle and the words “sexy,” “hottest,” “beautiful,” “cute,” “handsome,” “pretty,” or “babe.”
Olympian Sexy At-Mentions
@matthew_mitcham 25
@Joeingles7 8
@hopesolo 7

Full Results:  london2012_sexy (.xlsx)

The Sesquipedaliathon

  • In this event, scores are awarded based on the average number of syllables per word in the Olympian’s tweets.
Olympian Syllabes Per Word Longest Word
@juanmata10 1.76 visitaremos
@Njr92 1.66 spideranderson
@TipsarevicJanko 1.62 pantomime

Full Results:  london2012_sesq

Most Followed

  • In this event, one point is scored for each Twitter follower.
Olympian Followers
@Njr92 4,953,514
@juanmata10 1,164,329
@DjokerNole 1,111,326

Full Results:  london2012_user_info

Most Followed (by other Olympians)

  • In this event, one point is scored for each Olympian follower.
Olympian Fellow Olympian Followers
@MichaelPhelps 12
@usainbolt 11
@lolojones 9

Full Results:  london2012_degrees

London 2012 Olympians Twitter Follow Network

London 2012 Olympians Twitter Follow Network.  Arrows point from follower to followee.  Click the picture to view a larger version.

Most Follows

  • In this event, one point is scored for each Twitter user the Olympian follows.
Olympian Followees
@officialasafa 3,794
@Njr92 630
@TomDaley1994 542

Full Results: london2012_user_info

Most Follows (of other Olympians)

  • In this event, one point is scored for each fellow Olympian the Olympian follows.
Olympian Follows X Fellow Olympians
@ItsStephRice 8
@OscarPistorius 7
@RickyBerens 7
@drewsullivan8 6
@matthew_mitcham 6
@MichaelPhelps 6
@PopsMBonsu 6

Full Results: london2012_degrees

Special Event: Non-Olympian Most Followed by Olympians

  • The only event (so far) in which non-Olympians compete. Medals to those non-Olympians who are followed by the most Olympians.
Non-Olympian Name Olympian Followers
@OMGFacts OMG Facts 10
@NBA NBA 9
@SportsCenter SportsCenter 9
@espn ESPN 8
@Sports_Greats Sports Quotes 8

Full Results:  london2012_nonlist_followees

Olympic Followback

  • Most athletes have many more followERS than followEES. In this event, Olympians are scored according the proportion of their followers that they follow back.
Olympian Followback Percentage
@drewsullivan8 16.7%
@SmoothKJ88 14.2%
@EricBoateng 10.1%

Full Results:  london2012_followback

Resources

Olympians on Twitter Olympics

The Olympics bring together the world’s most talented and dedicated athletes.  And so does Twitter.  As a part of my continuing effort to try to do interesting things with the Twitter API, I decided to create my own Olympics for Olympians on Twitter. Er, yeah I think that’s right.

To begin with I created the sociomatrix of Olympian Tweeters.  A sociomatrix is a table where every person in a group gets a row and a column.  Each cell in the table indicates whether a relationship exists between two people (the row person and the column person).  To indicate this, one just places a zero in the cell if the relationship does not exist and one if it does.

Jack Rose Cal
Jack 1 0
Rose 1 0
Cal 0 1

Example Sociomatrix.  The relationship is row in love with column as per James Cameron’s Titanic.

I created a sociomatrix of Olympians on Twitter where the relationship was follows.  Given a sociomatrix, row sums and columns sums are usually interesting, quick summaries of the data.  In our case, a row sum is the number of Olympians one particular account follows.  A column sum is the number of Olympians following a given account.  So, without further ado, let’s get to our first event:  Olympian most followed by other Olympians.

Most Followed (by other Olympians)

Medal Olympian Followed By
Gold @BillyDemong 30
Gold @Shaun_White 30
Gold @ApoloOhno 30
Silver @lindseyvonn 28
Silver @emilycook 28
Bronze @GretchenBleiler 25

Do they allow ties in the real Olympics?  Probably not, but since these are virtual gold medals I’m handing out, why not?

You can probably guess the next event.  And this would probably be the easiest event to win if you knew it was coming.  We know who has the most followers, but who does the most following?

Most Follows (of Olympians)

Medal Olympian Follows
Gold @emilycook 73
Silver @StevenHolcomb 34
Bronze @TFletchernordic 32

The Sesquipedaliathon

Medal Olympian Syllables per word Longest word
Gold @LMCHOLEWINSKI 1.88   obesity
Silver @AngelaRuggiero 1.58   sustainability
Bronze @Pchiddy 1.57   anniversary

In the sesquipedaliathon, Olympians compete on their vocabularies.  Tweeters are ranked by the mean number of syllables in the words in their tweets.  Polysyllabic expressions win out over short words.

Sesquipedalian tweets may be the mark of a skilled wordsmith discussing a complex topic, or they may be the result of needless pretentiousness.  Syllables per word is one component of the Flesch-Kincaid readability scale.  According to the Flesch-Kincaid scale, the more syllables-per-word one uses, the more sophisticated the writing (or the less readable the text, depending on how you want to look at it).

The gold winner @LMCHOLEWINSKI is tweeting at about a 10th grade level.  @LMCHOLEWINSKI’s tweets clock in at about the same level as the discourse in the United States Congress,  according to recent analyses.

(For fun, I checked the syllables per word my dissertation tweetbot outputs.  At 1.61, my doctoral dissertation would take home a silver.)

Games-Dropping

Medal Olympian Tweets about “Games” or “Olympics”
Gold @ShaniDavis 19
Silver @AngelaRuggiero 17
Bronze @GretchenBleiler 16

For this event, Olympians score every time they use the word “games” or “Olympics.”  So the medal winners are (presumably) those who are talking about the Olympics most often.

Sexy At-Mentions

Medal Olympian Sexy At-Mentions
Gold @vitya_zvesda 16
Silver @lindseyvonn 15
Bronze @louievito 11

Yes, it has come to this.  I needed to find something to do with at-mentions, right?  So why not count for each Olympian how many times someone calls them sexy in a tweet?  And why stop with sexy?

One point for each tweet that mentions the athlete by their twitter handle and also contains one of the following words: hot, sexy, babe, handsome, pretty, beautiful or cute.

Non-Olympian Most Followed by Olympians

 

Medal Tweeter Olympians Following
Gold @lancearmstrong 30
Silver @ConanOBrien 24
Bronze @BarackObama 20
Bronze @TheEllenShow 20
Bronze @StephenAtHome 20
Bronze @universalsports 20
Bronze @shitmydadsays 20

This event was the toughest – as far as programming time goes.  First, I grabbed everyone my list of Olympians follow.  Then I aggregated to find out exactly how many Olympians followed each account.  Then I filtered out Olympians to get this list of non-Olympians most followed by Olympians.

That’s the last of the events for now.  Please check below for updates, and leave ideas for new Twitter Olympics events in the comments!

Resources


UPDATE: The list of Olympians used here came straight from Twitter’s verified accounts page.  However, it’s rather wonky.  I have a new, better list of London 2012 Olympians on Twitter and I’ll be re-running all of these analyses on this list.  Check for a link to the London 2012 version of these events on Friday the 27th.

UPDATE: London 2012 Twitter Olympics now available.

Twitter Follow Network for Political Networks Conference

I am currently attending the 5th Annual Political Networks Conference in beautiful Boulder, CO.  On twitter, the conference is served by the account @PolNetworks and the hash tag #PolNet2012.  Just for fun, below is a depiction of the follow network for the @PolNetworks account and all the twitter users who follow @PolNetworks.

Image

Figure: Best described as the first-degree egocentric follow network of @PolNetworks.  Click the picture for a larger version.

This is a directed graph.  Arrows point from follower to followee.  Obviously, PolNetworks is in the center of this graph, because every user follows PolNetworks.

Graph Density:  0.15

Graph Transitivity:  0.56

Graph Connectedness:  1.00

Graph Efficiency:  0.87

Some Node-Level Measures:

Account inDegree outDegree Eigen. Centrality
JaciKettler 8 14 0.36
smotus 23 12 0.32
kwcollins 14 12 0.31
jlove1982 6 9 0.28
JohnCluverius 7 9 0.28
JeffGulati 6 9 0.25
RebeccaHannagan 2 8 0.24
therriaultphd 12 8 0.23
BrendanNyhan 21 9 0.21
davekarpf 7 7 0.21
richardmskinner 6 8 0.21
ianpcook 3 7 0.2
hsquared47 3 7 0.19
jon_m_rob 0 6 0.16
sissenberg 7 6 0.16
First_Street 1 6 0.15
FHQ 12 4 0.1
heathbrown 2 5 0.1
DocPolitics 1 4 0.09
ajungherr 0 3 0.07
archimedino 0 3 0.07
GeoffLorenz 0 3 0.07
JoeLenski 2 4 0.07
slimbock 0 2 0.05
James_H_Fowler 2 3 0.04
PolNetworks 34 1 0.04
jasonjones_jjj 1 3 0.03
krmckelv 1 2 0.03
dogaker 0 1 0.01
DominikBatorski 0 1 0.01
janschulz 0 1 0.01
jboxstef 0 1 0.01
matthewhitt 0 1 0.01
ophastings 0 1 0.01
stefanjwojcik 0 1 0.01

Edge list in xlsx format:  polnetworks_edge_list

Data collected 6/12/2012

Employment Progression

In my new dataset, each row is a series of jobs that one person has had.

Most of them are quotidian:
Junior Tax Analyst –> Senior Tax Analyst
Investment Banking –> Investment Banking –> Investment Banking

Some of them are funny:
Corn Detassler –> Flight Delivery Center Technician
Quabbity Assuance –> Electronics Sales Associate

Some baffling:
Gymnast –> Air Traffic Controller
bust boy –> bust boy –> bust boy?

And some inspiring:
Dishwasher –> Dishwasher –> Model