In my current experimental setup, each process is a single instance of sample, from start to finish. This means that I need to aggregate results across multiple process runs that are running concurrently. Moreover, I may need to aggregate those results between machines.
The most compact format to store results in is CSV. This was my first approach and it had some benefits including:
- small file sizes
- readability
- CSV files can just be concatenated together
The problems were:
- headers become very difficult
- everything is a string, no int or float types without parsing
The headers problem is really the biggest problem, since I need future me to be able to read the results files and understand what’s going on in them. I therefore opted instead for .jsonl format, where each object is newline delimited JSON. Though way more verbose a format than CSV, it does preclude the headers problem and allows me to aggregate different results versions with ease. Again, I can just concatenate the results from different files together.
This is becoming so common in my Go code, here is a simple function that takes a path to append to as input as well as the JSON value (the interface) and appends the marshaled data to disk:
Now my current worry is atomic appends from multiple processes (is this possible?!) I was hoping that the file system would lock the file between writes, but I’m not sure it does: Is file append atomic in UNIX?. Anyway, more on that later.