Performance is key when building streaming gRPC services. When you’re trying to maximize throughput (e.g. messages per second) benchmarking is essential to understanding where the bottlenecks in your application are.

However, as a start, you can pretty much guarantee that one bottleneck is going to be the serialization (marshaling) and deserialization (unmarshaling) of protocol buffer messages.

We have a use case where the server does not need all of the information in the message in order to process the message. E.g. we have header information such as IDs and client information that the server does need to update as part of processing. The other part of the message is data that needs to be saved to disk and does not have to be unmarshaled until it’s read. However, our protocol buffer schema right now is “flat” — meaning that all fields whether they are required for processing or not are defined by a single protocol buffer message.

So we thought - could we break the flat protocol buffer message into two parts with one part wrapping the other? E.g. the outer message contains just the information the server needs for processing and the inner message remains marshalled until it is needed? Would this increase the performance of marshaling and unmarshaling?

The answer surprised me — yes, sort of. In the figure below, smaller throughput is better (e.g. it is faster):

Serialization Benchmark

The event size in this case is the size of the inner message which is just bytes. When the event size is “small” (less than 16KiB) then having a wrapped message outperforms the flat message for both marshaling and unmarshaling. However, as the inner message gets larger, serialization of the flat message gets faster.

My hypothesis for this is that the serializing a fixed-size outer wrapper is a constant cost; but the serializer does still have to read the entire data field into memory. At some point the time that reading the data field into memory takes starts to outweigh the benefits of the wrapper object.

Also, don’t get me wrong; overall it will take longer to serialize the entire message. The client will have to first serialize the inner message, then the outer message, which will take longer on the client side; and when reading the inner message will have to be deserialized again. However, having the client do the work does increase the throughput of the server, so it’s worth it to us.

Also, in the case of unmarshaling when you have nested types, the number of allocs falls by the number of nested types in the inner struct - another bonus!

The full code for this benchmark can be found here:

More detailed results:

Tiny/FlatMarshal-10         	 1945454	       607.1 ns/op	    1536 B/op	       1 allocs/op
Tiny/WrappedMarshal-10      	 2869138	       418.6 ns/op	    1536 B/op	       1 allocs/op
XSmall/FlatMarshal-10       	 1202126	       919.8 ns/op	    4864 B/op	       1 allocs/op
XSmall/WrappedMarshal-10    	 1773110	       797.0 ns/op	    4864 B/op	       1 allocs/op
Small/FlatMarshal-10        	 1000000	      1217 ns/op	    9472 B/op	       1 allocs/op
Small/WrappedMarshal-10     	 1000000	      1076 ns/op	    9472 B/op	       1 allocs/op
Medium/FlatMarshal-10       	  331611	      3792 ns/op	   40960 B/op	       1 allocs/op
Medium/WrappedMarshal-10    	  335064	      3512 ns/op	   40960 B/op	       1 allocs/op
Large/FlatMarshal-10        	   18306	     64584 ns/op	 1474560 B/op	       1 allocs/op
Large/WrappedMarshal-10     	   16376	     69056 ns/op	 1474561 B/op	       1 allocs/op
XLarge/FlatMarshal-10       	    7090	    167709 ns/op	 5251074 B/op	       1 allocs/op
XLarge/WrappedMarshal-10    	    6192	    174831 ns/op	 5251075 B/op	       1 allocs/op
Tiny/FlatUnmarshal-10       	 1000000	      1102 ns/op	    2280 B/op	      24 allocs/op
Tiny/WrappedUnmarshal-10    	 1783885	       691.9 ns/op	    2008 B/op	      14 allocs/op
XSmall/FlatUnmarshal-10     	  794436	      1528 ns/op	    5352 B/op	      24 allocs/op
XSmall/WrappedUnmarshal-10  	 1256288	       942.3 ns/op	    5464 B/op	      14 allocs/op
Small/FlatUnmarshal-10      	  631819	      1936 ns/op	    9448 B/op	      24 allocs/op
Small/WrappedUnmarshal-10   	  986050	      1243 ns/op	   10072 B/op	      14 allocs/op
Medium/FlatUnmarshal-10     	  320212	      3653 ns/op	   34024 B/op	      24 allocs/op
Medium/WrappedUnmarshal-10  	  322702	      3714 ns/op	   41560 B/op	      14 allocs/op
Large/FlatUnmarshal-10      	   21088	     52541 ns/op	 1475816 B/op	      24 allocs/op
Large/WrappedUnmarshal-10   	   19327	     56723 ns/op	 1475160 B/op	      14 allocs/op
XLarge/FlatUnmarshal-10     	    8012	    131589 ns/op	 5252329 B/op	      24 allocs/op
XLarge/WrappedUnmarshal-10  	    8575	    146534 ns/op	 5251674 B/op	      14 allocs/op