Compression Benchmarks

One of the projects I’m currently working on is the ingestion of RSS feeds into a Mongo database. It’s been running for the past year, and as of this post has collected 1,575,987 posts for 373 feeds after 8,126 jobs. This equates to about 585GB of raw data, and a firm requirement for compression in order to exchange data. Recently, @ojedatony1616 downloaded the compressed zip file (53GB) onto a 1TB external hard disk and attempted to decompress it. After three days, he tried to cancel it and ended up restarting his computer because it wouldn’t cancel. His approach was simply to double click the file on OS X, but that got me to thinking – it shouldn’t have taken that long; why did it choke? Inspecting the export logs on the server, I noted that it took 137 minutes to compress the directory; shouldn’t it take that long to decompress as well? ...

June 7, 2017 · 4 min · 841 words · Benjamin Bengfort

Decorating Nose Tests

Was introduced to an interesting problem today when decorating tests that need to be discovered by the nose runner. By default, nose explores a directory looking for things named test or tests and then executes those functions, classes, modules, etc. as tests. A standard test suite for me looks something like: import unittest class MyTests(unittest.TestCase): def test_undecorated(self): """ assert undecorated works """ self.assertEqual(2+2, 4) The problem came up when we wanted to decorate a test with some extra functionality, for example loading a fixture: ...

May 22, 2017 · 1 min · 184 words · Benjamin Bengfort

In Process Cacheing

I have had some recent discussions regarding cacheing to improve application performance that I wanted to share. Most of the time those conversations go something like this: “have you heard of Redis?” I’m fascinated by the fact that an independent, distributed key-value store has won the market to this degree. However, as I’ve pointed out in these conversations, cacheing is a hierarchy (heck, even the processor has varying levels of cacheing). Especially when considering micro-service architectures that require extremely low latency responses, cacheing should be a critical part of the design, not just a bolt-on after thought! ...

May 17, 2017 · 5 min · 877 words · Benjamin Bengfort

Unique Values in Python: A Benchmark

An interesting question came up in the development of Yellowbrick: given a vector of values, what is the quickest way to get the unique values? Ok, so maybe this isn’t a terribly interesting question, however the results surprised us and may surprise you as well. First we’ll do a little background, then I’ll give the results and then discuss the benchmarking method. The problem comes up in Yellowbrick when we want to get the discrete values for a target vector, y — a problem that comes up in classification tasks. By getting the unique set of values we know the number of classes, as well as the class names. This information is necessary during visualization because it is vital in assigning colors to individual classes. Therefore in a Visualizer we might have a method as follows: ...

May 2, 2017 · 5 min · 963 words · Benjamin Bengfort

Measuring Throughput

Part of my research is taking me down a path where I want to measure the number of reads and writes from a client to a storage server. A key metric that we’re looking for is throughput — the number of accesses per second that a system supports. As I discovered in a very simple test to get some baseline metrics, even this simple metric can have some interesting complications. ...

April 28, 2017 · 5 min · 965 words · Benjamin Bengfort

OAuth Tokens on the Command Line

This week I discovered I had a problem with my Google Calendar — events accidentally got duplicated or deleted and I needed a way to verify that my primary calendar was correct. Rather than painstakingly go through the web interface and spot check every event, I instead wrote a Go console program using the Google Calendar API to retrieve events and save them in a CSV so I could inspect them all at once. This was great, and very easy using Google’s Go libraries for their APIs, and the quick start was very handy. ...

April 20, 2017 · 4 min · 719 words · Benjamin Bengfort

Gmail Notifications with Python

I routinely have long-running scripts (e.g. for a data processing task) that I want to know when they’re complete. It seems like it should be simple for me to add in a little snippet of code that will send an email using Gmail to notify me, right? Unfortunately, it isn’t quite that simple for a lot of reasons, including security, attachment handling, configuration, etc. In this snippet, I’ve attached my constant copy and paste notify() function, written into a command line script for easy sending on the command line. ...

April 17, 2017 · 2 min · 398 words · Benjamin Bengfort