Text Classification with NLTK and Scikit-Learn

This post is an early draft of expanded work that will eventually appear on the District Data Labs Blog. Your feedback is welcome, and you can submit your comments on the draft GitHub issue. I’ve often been asked which is better for text processing, NLTK or Scikit-Learn (and sometimes Gensim). The answer is that I use all three tools on a regular basis, but I often have a problem mixing and matching them or combining them in meaningful ways. In this post, I want to show how I use NLTK for preprocessing and tokenization, but then apply machine learning techniques (e.g. building a linear SVM using stochastic gradient descent) using Scikit-Learn. In a follow on post, I’ll talk about vectorizing text with word2vec for machine learning in Scikit-Learn. ...

May 19, 2016 · 11 min · 2208 words · Benjamin Bengfort

Creating a Microservice in Go

Yesterday I built my first microservice (a RESTful API) using Go, and I wanted to collect a few of my thoughts on the experience here before I forgot them. The project, Scribo, is intended to aid in my research by collecting data about a specific network that I’m looking to build distributed systems for. I do have something running, which will need to evolve a lot, and it could be helpful to know where it started. ...

May 11, 2016 · 6 min · 1139 words · Benjamin Bengfort

Extracting Diffs from Git with Python

One of the first steps to performing analysis of Git repositories is extracting the changes over time, e.g. the Git log. This seems like it should be a very simple thing to do, as visualizations on GitHub and elsewhere show file change analyses through history on a commit by commit basis. Moreover, by using the GitPython library you have direct access to Git repositories that is scriptable. Unfortunately, things aren’t as simple as that, so I present a snippet for extracting change information from a Repository. ...

May 6, 2016 · 2 min · 287 words · Benjamin Bengfort

Visualizing Distributed Systems

As I’ve dug into my distributed systems research, one question keeps coming up: “How do you visualize distributed systems?” Distributed systems are hard, so it feels like being able to visualize the data flow would go a long way to understanding them in detail and avoiding bugs. Unfortunately, the same things that make architecting distributed systems difficult also make them hard to visualize. I don’t have an answer to this question, unfortunately. However, in this post I’d like to state my requirements and highlight some visualizations that I think are important. Hopefully this will be the start of a more complete investigation or at least allow others to comment on what they’re doing and whether or not visualization is important. ...

April 26, 2016 · 5 min · 1020 words · Benjamin Bengfort

Scikit-Learn Data Management: Bunches

One large issue that I encounter in development with machine learning is the need to structure our data on disk in a way that we can load into Scikit-Learn in a repeatable fashion for continued analysis. My proposal is to use the sklearn.datasets.base.Bunch object to load the data into data and target attributes respectively, similar to how Scikit-Learn’s toy datasets are structured. Using this object to manage our data will mirror the native API and allow us to easily copy and paste code that demonstrates classifiers and techniques with the built in datasets. Importantly, this API will also allow us to communicate to other developers and our future-selves exactly how to use the data. ...

April 19, 2016 · 5 min · 1020 words · Benjamin Bengfort

Lessons in Discrete Event Simulation

Part of my research involves the creation of large scale distributed systems, and while we do build these systems and deploy them, we do find that simulating them for development and research gives us an advantage in trying new things out. To that end, I employ discrete event simulation (DES) using Python’s SimPy library to build very large simulations of distributed systems, such as the one I’ve built to inspect consistency patterns in variable latency, heterogenous, partition prone networks: CloudScope. ...

April 15, 2016 · 4 min · 823 words · Benjamin Bengfort

NLTK Corpus Reader for Extracted Corpus

Yesterday I wrote a blog about [extracting a corpus]({% post_url 2016-04-10-extract-ddl-corpus %}) from a directory containing Markdown, such as for a blog that is deployed with Silvrback or Jekyll. In this post, I’ll briefly show how to use the built in CorpusReader objects in nltk for streaming the data to the segmentation and tokenization preprocessing functions that are built into NLTK for performing analytics. The dataset that I’ll be working with is the District Data Labs Blog, in particular the state of the blog as of today. The dataset can be downloaded from the ddl corpus, which also has the code in this post for you to use to perform other analytics. ...

April 11, 2016 · 6 min · 1081 words · Benjamin Bengfort