Contributing a Multiprocess Memory Profiler

In this post I wanted to catalog the process of an open source contribution I was a part of, which added a feature to the memory profiler Python library by Fabian Pedregosa and Philippe Gervais. It’s a quick story to tell but took over a year to complete, and I learned a lot from the process. I hope that the story is revealing, particularly to first time contributors and shows that even folks that have been doing this for a long time still have to find ways to positively approach collaboration in an open source environment. I also think it’s a fairly standard example of how contributions work in practice and perhaps this story will help us all think about how to better approach the pull request process. ...

March 20, 2017 · 7 min · 1409 words · Benjamin Bengfort

Pseudo Merkle Tree

A Merkle tree is a data structure in which every non-leaf node is labeled with the hash of its child nodes. This makes them particular useful for comparing large data structures quickly and efficiently. Given trees a and b, if the root hash of either is different, it means that part of the tree below is different (if they are identical, they are probably also identical). You can then proceed in a a breadth first fashion, pruning nodes with identical hashes to directly identify the differences. ...

March 16, 2017 · 3 min · 485 words · Benjamin Bengfort

Using Select in Go

Ask a Go programmer what makes Go special and they will immediately say “concurrency is baked into the language”. Go’s concurrency model is one of communication (as opposed to locks) and so concurrency primitives are implemented using channels. In order to synchronize across multiple channels, go provides the select statement. A common pattern for me has become to use a select to manage broadcasted work (either in a publisher/subscriber model or a fanout model) by initializing go routines and passing them directional channels for synchronization and communication. In the example below, I create a buffered channel for output (so that the workers don’t block waiting for the receiver to collect data), a channel for errors (first error kills the program) and a timer to update the state of my process on a routine basis. The select waits for the first channel to receive a message and then continues processing. By keeping the select in a for loop, I can continually read of the channels until I’m done. ...

March 8, 2017 · 2 min · 386 words · Benjamin Bengfort

Benchmarking Secure gRPC

A natural question to ask after the previous post is “how much overhead does security add?” So I’ve benchmarked the three methods discussed; mutual TLS, server-side TLS, and no encryption. The results are below: Here are the numeric results for one of the runs: BenchmarkMutualTLS-8 200 9331850 ns/op BenchmarkServerTLS-8 300 5004505 ns/op BenchmarkInsecure-8 2000 1179252 ns/op PASS ok github.com/bbengfort/sping 7.364s Here is the code for the benchmarking for reference: ...

March 5, 2017 · 1 min · 162 words · Benjamin Bengfort

Secure gRPC with TLS/SSL

One of the primary requirements for the systems we build is something we call the “minimum security requirement”. Although our systems are not designed specifically for high security applications, they must use minimum standards of encryption and authentication. For example, it seems obvious to me that a web application that stores passwords or credit card information would encrypt their data on disk on a per-record basis with a salted hash. In the same way, a distributed system must be able to handle encrypted blobs, encrypt all inter-node communication, and authenticate and sign all messages. This adds some overhead to the system but the cost of overhead is far smaller than the cost of a breach, and if minimum security is the baseline then the overhead is just an accepted part of doing business. ...

March 3, 2017 · 10 min · 2128 words · Benjamin Bengfort

Synchronizing Structs for Safe Concurrency in Go

Go is built for concurrency by providing language features that allow developers to embed complex concurrency patterns into their applications. These language features can be intuitive and a lot of safety is built in (for example a race detector) but developers still need to be aware of the interactions between various threads in their programs. In any shared memory system the biggest concern is synchronization: ensuring that separate go routines operate in the correct order and that no race conditions occur. The primary way to handle synchronization is the use of channels. Channels synchronize execution by forcing sends on the channel to block until the value on the channel is received. In this way, channels act as a barrier since the go routine can not progress while being blocked by the channel and enforce a specific ordering to execution, the ordering of routines arriving at the barrier. ...

February 21, 2017 · 4 min · 852 words · Benjamin Bengfort

Fixed vs. Variable Length Chunking

FluidFS and other file systems break large files into recipes of hash-identified blobs of binary data. Blobs can then be replicated with far more ease than a single file, as well as streamed from disk in a memory safe manner. Blobs are treated as single, independent units so the underlying data store doesn’t grow as files are duplicated. Finally, blobs can be encrypted individually and provide more opportunities for privacy. ...

February 8, 2017 · 3 min · 443 words · Benjamin Bengfort