Anonymizing User Profile Data with Faker

This post is an early draft of expanded work that will eventually appear on the District Data Labs Blog. Your feedback is welcome, and you can submit your comments on the draft GitHub issue. In order to learn (or teach) data science you need data (surprise!). The best libraries often come with a toy dataset to show examples and how the code works. However, nothing can replace an actual, non-trivial dataset for a tutorial or lesson because it provides for deep and meaningful further exploration. Non-trivial datasets can provide surprise and intuition in a way that toy datasets just cannot. Unfortunately, non-trivial datasets can be hard to find for a few reasons, but one common reason is that the dataset contains personally identifying information (PII). ...

February 25, 2016 · 21 min · 4466 words · Benjamin Bengfort

Implementing the Observer Pattern with an Event System

I was looking back through some old code (hoping to find a quick post before I got back to work) when I ran across a project I worked on called Mortar. Mortar was a simple daemon that ran in the background and watched a particular directory. When a file was added or removed from that directory, Mortar would notify other services or perform some other task (e.g. if it was integrated into a library). At the time, we used Mortar to keep an eye on FTP directories, and when a file was uploaded Mortar would move it to a staging directory based on who uploaded it, then do some work on the file. ...

February 16, 2016 · 3 min · 621 words · Benjamin Bengfort

Running on Schedule

Automation with Python is a lovely thing, particularly for very repetitive or long running tasks; but unfortunately someone still has to press the button to make it go. It feels like there should be an easy way to set up a program such that it runs routinely, in the background, without much human intervention. Daemonized services are the route to go in server land; but how do you routinely schedule a process to run on your local computer, which may or may not be turned off1? Moreover, long running daemon processes seem expensive when you just want a quick job to execute routinely. ...

February 10, 2016 · 10 min · 2079 words · Benjamin Bengfort

Iterators and Generators

This post is an attempt to explain what iterators and generators are in Python, defend the yield statement, and reveal why a library like SimPy is possible. But first some terminology (that specifically targets my friends who Java). Iteration is a syntactic construct that implements a loop over an iterable object. The for statement provides iteration, the while statement may provide iteration. An iterable object is something that implements the iteration protocol (Java folks, read interface). A generator is a function that produces a sequence of results instead of a single value and is designed to make writing iterable objects easier. ...

February 5, 2016 · 6 min · 1104 words · Benjamin Bengfort

On Interval Calls with Threading

Event driven programming can be a wonderful thing, particularly when the execution of your code is dependent on user input. It is for this reason that JavaScript and other user facing languages implement very strong event based semantics. Many times event driven semantics depends on elapsed time (e.g. wait then execute). Python, however, does not provide a native setTimeout or setInterval that will allow you to call a function after a specific amount of time, or to call a function again and again at a specific interval. ...

February 2, 2016 · 3 min · 607 words · Benjamin Bengfort

Timeline Visualization with Matplotlib

Several times it’s come up that I’ve needed to visualize a time sequence for a collection of events across multiple sources. Unlike a normal time series, events don’t necessarily have a magnitude, e.g. a stock market series is a graph with a time and a price. Events simply have times, and possibly types. A one dimensional number line is still interesting in this case, because the frequency or density of events reveal patterns that might not easily be analyzed with non-visual methods. Moreover, if you have multiple sources, overlaying a timeline on each can show which is busier, when and possibly also demonstrate some effect or causality. ...

January 28, 2016 · 2 min · 245 words · Benjamin Bengfort

Building a Console Utility with Commis

Applications like Git or Django’s management utility provide a rich interaction between a software library and their users by exposing many subcommands from a single root command. This style of what is essentially better argument parsing simplifies the user experience by only forcing them to remember one primary command, and allows the exploration of the utility hierarchy by using --help and other visibility mechanisms. Moreover, it allows the utility writer to decouple different commands or actions from each other. ...

January 23, 2016 · 9 min · 1753 words · Benjamin Bengfort