Computing Reading Speed

Ashley and I have been going over the District Data Labs Blog trying to figure out a method to make it more accessible both to readers (who are at various levels) and to encourage writers to contribute. To that end, she’s been exploring other blogs to see if we can put multiple forms of content up; long form tutorials (the bulk of what’s there) and shorter idea articles, possibly even as short as the posts I put on my dev journal. One interesting suggestion she had was to mark the reading time of each post, something that the Longreads Blog does. This may help give readers a better sense of the time committment and be able to engage more easily.

So computing the reading time is simple right? Take the number of words in the post divided by the average words per minute reading rate and bam - the number of minutes per post. Also, we’re not going to simply split on space, we know better - so we can use NLTK’s word_tokenize function. Seems like we’re good to go, but what’s the average words per minute reading rate of the average DDL reader?

After a bit of a search, we first found a study published by Reading Plus that charted the normal reading read in words per minute against high school grade level. Unfortunately, this led to the question, what level is our content at? Further searching found an LSAT reading speed calculation formula by Graeme Blake, moderator of the Reddit LSAT forum. We figured our content is probably as complex as the LSAT, and moreover, he gave speeds for slow, average, high average, fast, and rare LSAT students.

We ran each of these WPM speeds against published articles in the DDL corpus and came up with the following words per minute for each title:

Post	LSAT	Slow	Average	Fast
Announcing the District Data Labs Blog	26 seconds	23 seconds	18 seconds	15 seconds
How to Transition from Excel to R	12 minutes	11 minutes	9 minutes	7 minutes
What Are the Odds?	12 minutes	10 minutes	8 minutes	7 minutes
How to Develop Quality Python Code	28 minutes	25 minutes	20 minutes	17 minutes
Markup for Fast Data Science Publication	16 minutes	14 minutes	11 minutes	9 minutes
The Age of the Data Product	27 minutes	24 minutes	19 minutes	16 minutes
A Practical Guide to Anonymizing Datasets with Python & Faker	19 minutes	17 minutes	14 minutes	11 minutes
Computing a Bayesian Estimate of Star Rating Means	19 minutes	17 minutes	14 minutes	11 minutes
Conditional Probability with R	12 minutes	11 minutes	9 minutes	7 minutes
Creating a Hadoop Pseudo-Distributed Environment	13 minutes	12 minutes	10 minutes	8 minutes
Getting Started with Spark (in Python)	32 minutes	29 minutes	23 minutes	19 minutes
Graph Analytics Over Relational Datasets with Python	11 minutes	10 minutes	8 minutes	7 minutes
An Introduction to Machine Learning with Python	18 minutes	16 minutes	13 minutes	11 minutes
Modern Methods for Sentiment Analysis	12 minutes	11 minutes	9 minutes	7 minutes
Parameter Tuning with Hyperopt	12 minutes	11 minutes	9 minutes	7 minutes
Simple CSV Data Wrangling with Python	18 minutes	16 minutes	13 minutes	11 minutes
Time Maps: Visualizing Discrete Events Across Many Timescales	10 minutes	9 minutes	7 minutes	6 minutes

We’d be happy to have any feedback on if these times look correct or not. The code to produce the table follows:

Of course this is a straight count of words and does not take into account the number of sections or whether or not there are any code blocks. In the future, I hope to do an HTML version of this that takes into account the number of paragraphs, the density of each paragraph and the length of sentences, as well as the frequency of vocabulary words etc. I’ll need to gather feedback for a supervised learning algorithm though to train actual WPM on these features!