Posts

Pretty Print Directories

It feels like there are many questions like this one on Stack Overflow: Representing Directory & File Structure in Markdown Syntax, basically asking “how can we represent a directory structure in text in a pleasant way?” I too use these types of text representations in slides, blog posts, books, etc. It would be very helpful if I had an automatic way of doing this so I didn’t have to create it from scratch. ...

Interview - Ben Bengfort of District Data Labs

Description We talk to Benjamin Bengfort about his Data Day Seattle talks, District Data Labs, and Ben’s popular O’Reilly books.

Visualizing the Model Selection Process

Description Machine learning is the hacker art of describing the features of instances that we want to make predictions about, then fitting the data that describes those instances to a model form. Applied machine learning has come a long way from it’s beginnings in academia, and with tools like Scikit-Learn, it’s easier than ever to generate operational models for a wide variety of applications. Thanks to the ease and variety of the tools in Scikit-Learn, the primary job of the data scientist is model selection. Model selection involves performing feature engineering, hyperparameter tuning, and algorithm selection. These dimensions of machine learning often lead computer scientists towards automatic model selection via optimization (maximization) of a model’s evaluation metric. However, the search space is large, and grid search approaches to machine learning can easily lead to failure and frustration. Human intuition is still essential to machine learning, and visual analysis in concert with automatic methods can allow data scientists to steer model selection towards better fitted models, faster. In this talk, we will discuss interactive visual methods for better understanding, steering, and tuning machine learning models. ...

Data Product Architectures: Seattle Data Day

Description Data products derive their value from data and generate new data in return; as a result, machine learning techniques must be applied to their architecture and their development. Machine learning fits models to make predictions on unknown inputs and must be generalizable and adaptable. As such, fitted models cannot exist in isolation; they must be operationalized and user facing so that applications can benefit from the new data, respond to it, and feed it back in to the data product. Data product architectures are therefore life cycles and understanding the data product life cycle will enable architects to develop robust, failure free workflows and applications. In this talk we will discuss the data product life cycle, explore how to engage a model build, evaluation, and selection phase with an operation and interaction phase. Following the lambda architecture, we will investigate wrapping a central computational store for speed and querying, as well as incorporating a discussion of monitoring, management, and data exploration for hypothesis driven development. From web applications to big data appliances; this architecture serves as a blueprint for handling data services of all sizes! ...

Color Map Utility

Many of us are spoiled by the use of matplotlib’s colormaps which allow you to specify a string or object name of a color map (e.g. Blues) then simply pass in a range of nearly continuous values which are spread along the color map. However, using these color maps for categorical or discrete values (like the colors of nodes) can pose challenges as the colors may not be distinct enough for the representation you’re looking for. ...

Visualizing Normal Distributions

Normal distributions are the backbone of random number generation for simulation. By selecting a mean (μ) and standard deviation (σ) you can generate simulated data representative of the types of models you’re trying to build (and certainly better than simple uniform random number generators). However, you might already be able to tell that selecting μ and σ is a little backward! Typically these metrics are computed from data, not used to describe data. As a result, utilities for tuning the behavior of your random number generators are simply not discussed. ...

Background Work with Goroutines on a Timer

As I’m moving deeper into my PhD, I’m getting into more Go programming for the systems that I’m building. One thing that I’m constantly doing is trying to create a background process that runs forever, and does some work at an interval. Concurrency in Go is native and therefore the use of threads and parallel processing is very simple, syntax-wise. However I am still solving problems that I wanted to make sure I recorded here. ...