Posts

Message Latency: Ping vs. gRPC

Building distributed systems means passing messages between devices over a network connection. My research specifically considers networks that have extremely variable latencies or that can be partition prone. This led me to the natural question, “how variable are real world networks?” In order to get real numbers, I built a simple echo protocol using Go and gRPC called Orca. I ran Orca for a few days and got some latency measurements as I traveled around with my laptop. Orca does a lot of work, including GeoIP look ups, IP address resolution, and database queries and storage. This post, however, is not about Orca. The latencies I was getting were very high relative to the round-trip latencies reported by the simple ping command that implements the ICMP protocol. ...

Computing Reading Speed

Ashley and I have been going over the District Data Labs Blog trying to figure out a method to make it more accessible both to readers (who are at various levels) and to encourage writers to contribute. To that end, she’s been exploring other blogs to see if we can put multiple forms of content up; long form tutorials (the bulk of what’s there) and shorter idea articles, possibly even as short as the posts I put on my dev journal. One interesting suggestion she had was to mark the reading time of each post, something that the Longreads Blog does. This may help give readers a better sense of the time committment and be able to engage more easily. ...

Dynamics in Graph Analysis Adding Time as a Structure for Visual and Statistical Insight

I gave this talk twice, both at PyData DC on October 24, 2016 and at PyData Carolinas on September 15, 2016. Both videos are below if you feel like figuring out which presentation was better! PyData DC PyData Carolinas Slides Description Network analyses are powerful methods for both visual analytics and machine learning but can suffer as their complexity increases. By embedding time as a structural element rather than a property, we will explore how time series and interactive analysis can be improved on Graph structures. Primarily we will look at decomposition in NLP-extracted concept graphs using NetworkX and Graph Tool. ...

Modifying an Image's Aspect Ratio

When making slides, I generally like to use Flickr to search for images that are licensed via Creative Commons to use as backgrounds. My slide deck tools of choice are either Reveal.js or Google Slides. Both tools allow you to specify an image as a background for the slide, but for Google Slides in particular, if the aspect ratio of the image doesn’t match the aspect ratio of the slide deck, then weird things can happen. ...

Serializing GraphML

This is mostly a post of annoyance. I’ve been working with graphs in Python via NetworkX and trying to serialize them to GraphML for use in Gephi and graph-tool. Unfortunately the following error is really starting to get on my nerves: networkx.exception.NetworkXError: GraphML writer does not support <class 'datetime.datetime'> as data values. Also it doesn’t support <type NoneType> or list or dict or … So I have to do something about it: ...

Parallel Enqueue and Workers

I was recently asked about the parallelization of both the enqueuing of tasks and their processing. This is a tricky subject because there are a lot of factors that come into play. For example do you have two parallel phases, e.g. a map and a reduce phase that need to be synchronized, or is there some sort of data parallelism that requires multiple tasks to be applied to the data (e.g. Storm-style topology). While there are a lot of tools for parallel processing in batch for large data sets, how do you take care of simple problems with large datasets (say hundreds of gigabytes) on a single machine with a quad core or hyperthreading multiprocessor? ...

Parallel NLP Preprocessing

A common source of natural language corpora comes from the web, usually in the form of HTML documents. However, in order to actually build models on the natural language, the structured HTML needs to be transformed into units of discourse that can then be used for learning. In particular, we need to strip away extraneous material such as navigation or advertisements, targeting exactly the content we’re looking for. Once done, we need to split paragraphs into sentences, sentences into tokens, and assign part-of-speech tags to each token. The preprocessing therefore transforms HTML documents to a list of paragraphs, which are themselves a list of sentences, which are lists of token, tag tuples. ...