Visualizing the Model Selection Process

Description Machine learning is the hacker art of describing the features of instances that we want to make predictions about, then fitting the data that describes those instances to a model form. Applied machine learning has come a long way from it’s beginnings in academia, and with tools like Scikit-Learn, it’s easier than ever to generate operational models for a wide variety of applications. Thanks to the ease and variety of the tools in Scikit-Learn, the primary job of the data scientist is model selection. Model selection involves performing feature engineering, hyperparameter tuning, and algorithm selection. These dimensions of machine learning often lead computer scientists towards automatic model selection via optimization (maximization) of a model’s evaluation metric. However, the search space is large, and grid search approaches to machine learning can easily lead to failure and frustration. Human intuition is still essential to machine learning, and visual analysis in concert with automatic methods can allow data scientists to steer model selection towards better fitted models, faster. In this talk, we will discuss interactive visual methods for better understanding, steering, and tuning machine learning models. ...

July 23, 2016 · 1 min · 184 words · Benjamin Bengfort

Data Product Architectures: Seattle Data Day

Description Data products derive their value from data and generate new data in return; as a result, machine learning techniques must be applied to their architecture and their development. Machine learning fits models to make predictions on unknown inputs and must be generalizable and adaptable. As such, fitted models cannot exist in isolation; they must be operationalized and user facing so that applications can benefit from the new data, respond to it, and feed it back in to the data product. Data product architectures are therefore life cycles and understanding the data product life cycle will enable architects to develop robust, failure free workflows and applications. In this talk we will discuss the data product life cycle, explore how to engage a model build, evaluation, and selection phase with an operation and interaction phase. Following the lambda architecture, we will investigate wrapping a central computational store for speed and querying, as well as incorporating a discussion of monitoring, management, and data exploration for hypothesis driven development. From web applications to big data appliances; this architecture serves as a blueprint for handling data services of all sizes! ...

July 21, 2016 · 1 min · 186 words · Benjamin Bengfort

Color Map Utility

Many of us are spoiled by the use of matplotlib’s colormaps which allow you to specify a string or object name of a color map (e.g. Blues) then simply pass in a range of nearly continuous values which are spread along the color map. However, using these color maps for categorical or discrete values (like the colors of nodes) can pose challenges as the colors may not be distinct enough for the representation you’re looking for. ...

July 15, 2016 · 3 min · 502 words · Benjamin Bengfort

Visualizing Normal Distributions

Normal distributions are the backbone of random number generation for simulation. By selecting a mean (μ) and standard deviation (σ) you can generate simulated data representative of the types of models you’re trying to build (and certainly better than simple uniform random number generators). However, you might already be able to tell that selecting μ and σ is a little backward! Typically these metrics are computed from data, not used to describe data. As a result, utilities for tuning the behavior of your random number generators are simply not discussed. ...

June 27, 2016 · 2 min · 370 words · Benjamin Bengfort

Background Work with Goroutines on a Timer

As I’m moving deeper into my PhD, I’m getting into more Go programming for the systems that I’m building. One thing that I’m constantly doing is trying to create a background process that runs forever, and does some work at an interval. Concurrency in Go is native and therefore the use of threads and parallel processing is very simple, syntax-wise. However I am still solving problems that I wanted to make sure I recorded here. ...

June 26, 2016 · 2 min · 256 words · Benjamin Bengfort

Converting NetworkX to Graph-Tool

This week I discovered graph-tool, a Python library for network analysis and visualization that is implemented in C++ with Boost. As a result, it can quickly and efficiently perform manipulations, statistical analyses of Graphs, and draw them in a visual pleasing style. It’s like using Python with the performance of C++, and I was rightly excited: It's a bear to get setup, but once you do things get pretty nice. Moving my network viz over to it now! ...

June 23, 2016 · 3 min · 576 words · Benjamin Bengfort

Natural Language Processing with NLTK and Gensim

Natural Language Processing with NLTK and Gensim Description Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practitioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python, Gensim, and the Natural Language Toolkit (NLTK). ...

May 30, 2016 · 2 min · 252 words · Benjamin Bengfort