Ashley and I have been going over the District Data Labs Blog trying to figure out a method to make it more accessible both to readers (who are at various levels) and to encourage writers to contribute. To that end, she’s been exploring other blogs to see if we can put multiple forms of content up; long form tutorials (the bulk of what’s there) and shorter idea articles, possibly even as short as the posts I put on my dev journal. One interesting suggestion she had was to mark the reading time of each post, something that the Longreads Blog does. This may help give readers a better sense of the time committment and be able to engage more easily.
So computing the reading time is simple right? Take the number of words in the post divided by the average words per minute reading rate and bam - the number of minutes per post. Also, we’re not going to simply split on space, we know better - so we can use NLTK’s
word_tokenize function. Seems like we’re good to go, but what’s the average words per minute reading rate of the average DDL reader?
After a bit of a search, we first found a study published by Reading Plus that charted the normal reading read in words per minute against high school grade level. Unfortunately, this led to the question, what level is our content at? Further searching found an LSAT reading speed calculation formula by Graeme Blake, moderator of the Reddit LSAT forum. We figured our content is probably as complex as the LSAT, and moreover, he gave speeds for slow, average, high average, fast, and rare LSAT students.
We ran each of these WPM speeds against published articles in the DDL corpus and came up with the following words per minute for each title:
|Announcing the District Data Labs Blog||26 seconds||23 seconds||18 seconds||15 seconds|
|How to Transition from Excel to R||12 minutes||11 minutes||9 minutes||7 minutes|
|What Are the Odds?||12 minutes||10 minutes||8 minutes||7 minutes|
|How to Develop Quality Python Code||28 minutes||25 minutes||20 minutes||17 minutes|
|Markup for Fast Data Science Publication||16 minutes||14 minutes||11 minutes||9 minutes|
|The Age of the Data Product||27 minutes||24 minutes||19 minutes||16 minutes|
|A Practical Guide to Anonymizing Datasets with Python & Faker||19 minutes||17 minutes||14 minutes||11 minutes|
|Computing a Bayesian Estimate of Star Rating Means||19 minutes||17 minutes||14 minutes||11 minutes|
|Conditional Probability with R||12 minutes||11 minutes||9 minutes||7 minutes|
|Creating a Hadoop Pseudo-Distributed Environment||13 minutes||12 minutes||10 minutes||8 minutes|
|Getting Started with Spark (in Python)||32 minutes||29 minutes||23 minutes||19 minutes|
|Graph Analytics Over Relational Datasets with Python||11 minutes||10 minutes||8 minutes||7 minutes|
|An Introduction to Machine Learning with Python||18 minutes||16 minutes||13 minutes||11 minutes|
|Modern Methods for Sentiment Analysis||12 minutes||11 minutes||9 minutes||7 minutes|
|Parameter Tuning with Hyperopt||12 minutes||11 minutes||9 minutes||7 minutes|
|Simple CSV Data Wrangling with Python||18 minutes||16 minutes||13 minutes||11 minutes|
|Time Maps: Visualizing Discrete Events Across Many Timescales||10 minutes||9 minutes||7 minutes||6 minutes|
We’d be happy to have any feedback on if these times look correct or not. The code to produce the table follows:
Of course this is a straight count of words and does not take into account the number of sections or whether or not there are any code blocks. In the future, I hope to do an HTML version of this that takes into account the number of paragraphs, the density of each paragraph and the length of sentences, as well as the frequency of vocabulary words etc. I’ll need to gather feedback for a supervised learning algorithm though to train actual WPM on these features!