Computing Reading Speed

Ashley and I have been going over the District Data Labs Blog trying to figure out a method to make it more accessible both to readers (who are at various levels) and to encourage writers to contribute. To that end, she’s been exploring other blogs to see if we can put multiple forms of content up; long form tutorials (the bulk of what’s there) and shorter idea articles, possibly even as short as the posts I put on my dev journal. One interesting suggestion she had was to mark the reading time of each post, something that the Longreads Blog does. This may help give readers a better sense of the time committment and be able to engage more easily. ...

October 28, 2016 · 4 min · 659 words · Benjamin Bengfort

Dynamics in Graph Analysis Adding Time as a Structure for Visual and Statistical Insight

I gave this talk twice, both at PyData DC on October 24, 2016 and at PyData Carolinas on September 15, 2016. Both videos are below if you feel like figuring out which presentation was better! PyData DC PyData Carolinas Slides Description Network analyses are powerful methods for both visual analytics and machine learning but can suffer as their complexity increases. By embedding time as a structural element rather than a property, we will explore how time series and interactive analysis can be improved on Graph structures. Primarily we will look at decomposition in NLP-extracted concept graphs using NetworkX and Graph Tool. ...

October 24, 2016 · 2 min · 311 words · Benjamin Bengfort

Modifying an Image's Aspect Ratio

When making slides, I generally like to use Flickr to search for images that are licensed via Creative Commons to use as backgrounds. My slide deck tools of choice are either Reveal.js or Google Slides. Both tools allow you to specify an image as a background for the slide, but for Google Slides in particular, if the aspect ratio of the image doesn’t match the aspect ratio of the slide deck, then weird things can happen. ...

September 13, 2016 · 2 min · 389 words · Benjamin Bengfort

Serializing GraphML

This is mostly a post of annoyance. I’ve been working with graphs in Python via NetworkX and trying to serialize them to GraphML for use in Gephi and graph-tool. Unfortunately the following error is really starting to get on my nerves: networkx.exception.NetworkXError: GraphML writer does not support <class 'datetime.datetime'> as data values. Also it doesn’t support <type NoneType> or list or dict or … So I have to do something about it: ...

September 9, 2016 · 1 min · 130 words · Benjamin Bengfort

Parallel Enqueue and Workers

I was recently asked about the parallelization of both the enqueuing of tasks and their processing. This is a tricky subject because there are a lot of factors that come into play. For example do you have two parallel phases, e.g. a map and a reduce phase that need to be synchronized, or is there some sort of data parallelism that requires multiple tasks to be applied to the data (e.g. Storm-style topology). While there are a lot of tools for parallel processing in batch for large data sets, how do you take care of simple problems with large datasets (say hundreds of gigabytes) on a single machine with a quad core or hyperthreading multiprocessor? ...

September 7, 2016 · 3 min · 602 words · Benjamin Bengfort

Parallel NLP Preprocessing

A common source of natural language corpora comes from the web, usually in the form of HTML documents. However, in order to actually build models on the natural language, the structured HTML needs to be transformed into units of discourse that can then be used for learning. In particular, we need to strip away extraneous material such as navigation or advertisements, targeting exactly the content we’re looking for. Once done, we need to split paragraphs into sentences, sentences into tokens, and assign part-of-speech tags to each token. The preprocessing therefore transforms HTML documents to a list of paragraphs, which are themselves a list of sentences, which are lists of token, tag tuples. ...

August 12, 2016 · 3 min · 615 words · Benjamin Bengfort

Pretty Print Directories

It feels like there are many questions like this one on Stack Overflow: Representing Directory & File Structure in Markdown Syntax, basically asking “how can we represent a directory structure in text in a pleasant way?” I too use these types of text representations in slides, blog posts, books, etc. It would be very helpful if I had an automatic way of doing this so I didn’t have to create it from scratch. ...

August 1, 2016 · 2 min · 387 words · Benjamin Bengfort