Ashley and I have been going over the District Data Labs Blog trying to figure out a method to make it more accessible both to readers (who are at various levels) and to encourage writers to contribute. To that end, she’s been exploring other blogs to see if we can put multiple forms of content up; long form tutorials (the bulk of what’s there) and shorter idea articles, possibly even as short as the posts I put on my dev journal. One interesting suggestion she had was to mark the reading time of each post, something that the Longreads Blog does. This may help give readers a better sense of the time committment and be able to engage more easily.

So computing the reading time is simple right? Take the number of words in the post divided by the average words per minute reading rate and bam - the number of minutes per post. Also, we’re not going to simply split on space, we know better - so we can use NLTK’s word_tokenize function. Seems like we’re good to go, but what’s the average words per minute reading rate of the average DDL reader?

After a bit of a search, we first found a study published by Reading Plus that charted the normal reading read in words per minute against high school grade level. Unfortunately, this led to the question, what level is our content at? Further searching found an LSAT reading speed calculation formula by Graeme Blake, moderator of the Reddit LSAT forum. We figured our content is probably as complex as the LSAT, and moreover, he gave speeds for slow, average, high average, fast, and rare LSAT students.

We ran each of these WPM speeds against published articles in the DDL corpus and came up with the following words per minute for each title:

PostLSATSlowAverageFast
Announcing the District Data Labs Blog26 seconds23 seconds18 seconds15 seconds
How to Transition from Excel to R12 minutes11 minutes9 minutes7 minutes
What Are the Odds?12 minutes10 minutes8 minutes7 minutes
How to Develop Quality Python Code28 minutes25 minutes20 minutes17 minutes
Markup for Fast Data Science Publication16 minutes14 minutes11 minutes9 minutes
The Age of the Data Product27 minutes24 minutes19 minutes16 minutes
A Practical Guide to Anonymizing Datasets with Python & Faker19 minutes17 minutes14 minutes11 minutes
Computing a Bayesian Estimate of Star Rating Means19 minutes17 minutes14 minutes11 minutes
Conditional Probability with R12 minutes11 minutes9 minutes7 minutes
Creating a Hadoop Pseudo-Distributed Environment13 minutes12 minutes10 minutes8 minutes
Getting Started with Spark (in Python)32 minutes29 minutes23 minutes19 minutes
Graph Analytics Over Relational Datasets with Python11 minutes10 minutes8 minutes7 minutes
An Introduction to Machine Learning with Python18 minutes16 minutes13 minutes11 minutes
Modern Methods for Sentiment Analysis12 minutes11 minutes9 minutes7 minutes
Parameter Tuning with Hyperopt12 minutes11 minutes9 minutes7 minutes
Simple CSV Data Wrangling with Python18 minutes16 minutes13 minutes11 minutes
Time Maps: Visualizing Discrete Events Across Many Timescales10 minutes9 minutes7 minutes6 minutes

We’d be happy to have any feedback on if these times look correct or not. The code to produce the table follows:

Of course this is a straight count of words and does not take into account the number of sections or whether or not there are any code blocks. In the future, I hope to do an HTML version of this that takes into account the number of paragraphs, the density of each paragraph and the length of sentences, as well as the frequency of vocabulary words etc. I’ll need to gather feedback for a supervised learning algorithm though to train actual WPM on these features!