Generic JSON Serialization with Go

This post is just a reminder as I work through handling JSON data with Go. Go provides first class JSON support through its standard library json package. The interface is simple, primarily through json.Marshal and json.Unmarshal functions which are analagous to typed versions of json.load and json.dump. Type safety is the trick, however, and generally speaking you define a struct to serialize and deserialize as follows: type Person struct { Name string `json:"name,omitempty"` Age int `json:"age,omitempty"` Salary int `json:"-"` } op := &Person{"John Doe", 42} data, _ := json.Marshal(op) var np Person json.Unmarshall(data, &np) So this is all well and good, until you start wanting to just send around arbirtray data. Luckly the json package will allow you to do that using reflection to load data into a map[string]interface{}, e.g. a dictionary whose keys are strings and whose values are any arbitrary type (anything that implements the null interface, that is has zero or more methods, which all Go types do). So you might see code like this: ...

January 18, 2017 · 1 min · 209 words · Benjamin Bengfort

Yielding Functions for Iteration in Go

It is very common for me to design code that expects functions to return an iterable context, particularly because I have been developing in Python with the yield statement. The yield statement allows functions to “return” the execution context to the caller while still maintaining state such that the caller can return state to the function and continue to iterate. It does this by actually returning a generator, iterable object constructed from the local state of the closure. ...

December 22, 2016 · 3 min · 573 words · Benjamin Bengfort

Modifying an Image's Aspect Ratio

When making slides, I generally like to use Flickr to search for images that are licensed via Creative Commons to use as backgrounds. My slide deck tools of choice are either Reveal.js or Google Slides. Both tools allow you to specify an image as a background for the slide, but for Google Slides in particular, if the aspect ratio of the image doesn’t match the aspect ratio of the slide deck, then weird things can happen. ...

September 13, 2016 · 2 min · 389 words · Benjamin Bengfort

Serializing GraphML

This is mostly a post of annoyance. I’ve been working with graphs in Python via NetworkX and trying to serialize them to GraphML for use in Gephi and graph-tool. Unfortunately the following error is really starting to get on my nerves: networkx.exception.NetworkXError: GraphML writer does not support <class 'datetime.datetime'> as data values. Also it doesn’t support <type NoneType> or list or dict or … So I have to do something about it: ...

September 9, 2016 · 1 min · 130 words · Benjamin Bengfort

Parallel Enqueue and Workers

I was recently asked about the parallelization of both the enqueuing of tasks and their processing. This is a tricky subject because there are a lot of factors that come into play. For example do you have two parallel phases, e.g. a map and a reduce phase that need to be synchronized, or is there some sort of data parallelism that requires multiple tasks to be applied to the data (e.g. Storm-style topology). While there are a lot of tools for parallel processing in batch for large data sets, how do you take care of simple problems with large datasets (say hundreds of gigabytes) on a single machine with a quad core or hyperthreading multiprocessor? ...

September 7, 2016 · 3 min · 602 words · Benjamin Bengfort

Parallel NLP Preprocessing

A common source of natural language corpora comes from the web, usually in the form of HTML documents. However, in order to actually build models on the natural language, the structured HTML needs to be transformed into units of discourse that can then be used for learning. In particular, we need to strip away extraneous material such as navigation or advertisements, targeting exactly the content we’re looking for. Once done, we need to split paragraphs into sentences, sentences into tokens, and assign part-of-speech tags to each token. The preprocessing therefore transforms HTML documents to a list of paragraphs, which are themselves a list of sentences, which are lists of token, tag tuples. ...

August 12, 2016 · 3 min · 615 words · Benjamin Bengfort

Pretty Print Directories

It feels like there are many questions like this one on Stack Overflow: Representing Directory & File Structure in Markdown Syntax, basically asking “how can we represent a directory structure in text in a pleasant way?” I too use these types of text representations in slides, blog posts, books, etc. It would be very helpful if I had an automatic way of doing this so I didn’t have to create it from scratch. ...

August 1, 2016 · 2 min · 387 words · Benjamin Bengfort