NLTK Corpus Reader for Extracted Corpus

Yesterday I wrote a blog about [extracting a corpus]({% post_url 2016-04-10-extract-ddl-corpus %}) from a directory containing Markdown, such as for a blog that is deployed with Silvrback or Jekyll. In this post, I’ll briefly show how to use the built in CorpusReader objects in nltk for streaming the data to the segmentation and tokenization preprocessing functions that are built into NLTK for performing analytics. The dataset that I’ll be working with is the District Data Labs Blog, in particular the state of the blog as of today. The dataset can be downloaded from the ddl corpus, which also has the code in this post for you to use to perform other analytics. ...

April 11, 2016 · 6 min · 1081 words · Benjamin Bengfort

Extracting the DDL Blog Corpus

We have some simple text analyses coming up and as an example, I thought it might be nice to use the DDL blog corpus as a data set. There are relatively few DDL blogs, but they all are long with a lot of significant text and discourse. It might be interesting to try to do some lightweight analysis on them. So, how to extract the corpus? The DDL blog is currently hosted on Silvrback which is designed for text-forward, distraction-free blogging. As a result, there isn’t a lot of cruft on the page. I considered doing a scraper that pulled the web pages down or using the RSS feed to do the data ingestion. After all, I wouldn’t have to do a lot of HTML cleaning. ...

April 10, 2016 · 2 min · 347 words · Benjamin Bengfort

Dispatching Types to Handler Methods

A while I ago, I discussed the [observer pattern]({% post_url 2016-02-16-observer-pattern %}) for dispatching events based on a series of registered callbacks. In this post, I take a look at a similar, but very different methodology for dispatching based on type with pre-assigned handlers. For me, this is actually the more common pattern because the observer pattern is usually implemented as an API to outsider code. On the other hand, this type of dispatcher is usually a programmer’s pattern, used for development and decoupling. ...

April 5, 2016 · 2 min · 426 words · Benjamin Bengfort

Class Variables

These snippets are just a short reminder of how class variables work in Python. I understand this topic a bit too well, I think; I always remember the gotchas and can’t remember which gotcha belongs to which important detail. I generally come up with the right answer then convince myself I’m wrong until I write a bit of code and experiment. Hopefully this snippet will shortcut that process. Consider the following class hierarchy: ...

April 4, 2016 · 3 min · 446 words · Benjamin Bengfort

Simple Password Generation

I was talking with @looselycoupled the other day about how we generate passwords for use on websites. We both agree that every single domain should have its own password (to prevent one crack ruling all your Internets). However, we’ve both evolved on the method over time, and I’ve written a simple script that allows me to generate passwords using methodologies discussed in this post. In particular I use the generator to create passwords for pwSafe, the tool I currently use for password management (due to its use of the open source database format created by Bruce Schneier). It is my hope that this script can be embedded directly into pwSafe, or at least allow me to write directly to the database; but for now I just copy and paste with the pbcopy utility. ...

March 30, 2016 · 5 min · 973 words · Benjamin Bengfort

Visualizing Pi with matplotlib

Happy Pi day! As is the tradition at the University of Maryland (and to a certain extent, in my family) we are celebrating March 14 with pie and Pi. A shoutout to @konstantinosx who, during last year’s Pi day, requested blueberry pie, which was the strangest pie request I’ve received for Pi day. Not that blueberry pie is strange, just that someone would want one so badly for Pi day (he got a mixed berry pie). ...

March 14, 2016 · 2 min · 409 words · Benjamin Bengfort

Adding a Git Commit to Header Comments

You may have seen the following type of header at the top of my source code: # main # short description # # Author: Benjamin Bengfort <benjamin@bengfort.com> # Created: Tue Mar 08 14:07:24 2016 -0500 # # Copyright (C) 2016 Bengfort.com # For license information, see LICENSE.txt # # ID: main.py [] benjamin@bengfort.com $ All of this is pretty self explanatory with the exception of the final line. This final line is a throw back to Subversion actually, when you could add a $Id$ tag to your code, and Subversion would automatically populate it with something that looks like: ...

March 8, 2016 · 3 min · 577 words · Benjamin Bengfort