Gold Master Testing
Automatically Validating Millions of Data Points
Eyelids closed, gold sun shines on
The world’s coated in the gold Krylon.
–Macklemore and Ryan Lewis, featuring Eighty4 Fly
At Code Climate, we feel it’s critical to deliver dependable and accurate static analysis results. To do so, we employ a variety of quality assurance techniques, including unit tests, acceptance tests, manual testing and incremental rollouts. They all are valuable, but we still had too much risk of introducing hard-to-detect bugs. To fill the gap, we’ve added a new tactic to our arsenal: gold master testing.
Gold master testing refers to capturing the result of a process, and then comparing future runs against the saved “gold master” (or known good) version to discover unexpected changes. For us, that means running full Code Climate analyses of a number of open source repos, and validating every aspect of the result. We only started doing this last week, but it’s already caught some hard-to-detect bugs that we otherwise may not have discovered until code hit production.
Why gold master testing?
Gold master testing is common when working with legacy code. Rather than trying to specify all of the logical paths through an untested module, you can feed it a varied set of inputs and turn the outputs into automatically verifying tests. There’s no guarantee the outputs are correct in this case, but at least you can be sure they don’t change (which, in some systems is even more important).
For us, given that we have a relatively reliable and comprehensive set of unit tests for our analysis code, the situation is a bit different. In short, we find gold master testing valuable because of three key factors:
- The inputs and output from our analysis is extremely detailed. There are a huge number of syntactical structures in Ruby, and we derive a ton of information from them.
- Our analysis depends on external code that we do not control, but do want to update from time-to-time (e.g. RubyParser)
- We are extremely sensitive to any changes in results. For example, even a tiny variance in our detection of complex methods across the 20k repositories we analyze would ripple into changes of class ratings, resulting in incorrect metrics being delivered to our customers.
These add up to mean that traditional unit and acceptance testing is necessary but not sufficient. We use unit and acceptance tests to provide faster results and more localized detection of regressions, but we use our gold master suite (nicknamed Krylon) to sanity check our results against a dozen or so repositories before deploying changes.
How to implement gold master testing
The high level plan is pretty straightforward:
- Choose (or randomly generate, using a known seed) a set of inputs for your module or program.
- Run the inputs through a known-good version of the system, persisting the output.
- When testing a change, run the same inputs through the new version of the system and flag any output variation.
- For each variation, have a human determine whether or not the change is expected and desirable. If it is, update the persisted gold master records.
The devil is in the details, of course. In particular, if the outputs of your system are non-trivial (in our case a set of MongoDB documents spanning multiple tables), persisting them was a little tricky. We could keep them in MongoDB, of course, but that would not make them as accessible to humans (and tools like diff
and GitHub) as a plain-test format like JSON would. So I wrote a little bit of code to dump records out as JSON:
dir = "krylon/#{slug}"
repo_id = Repo.create!(url: "git://github.com/#{slug}")
run_analysis(repo_id)
FileUtils.mkdir_p(dir)
%w[smells constants etc.].each do |coll|
File.open("#{dir}/#{coll}.json", "w") do |f|
docs = db[coll].find(repo_id: repo_id).map do |doc|
round_floats(doc.except(*ignored_fields))
end
sorted_docs = JSON.parse(docs.sort_by(&:to_json).to_json)
f.puts JSON.pretty_generate(sorted_docs)
end
end
Then there is the matter of comparing the results of a test run against the gold master. Ruby has a lot of built-in functionality that makes this relatively easy, but it took a few tries to get a harness set up properly. We ended up with something like this:
dir = "krylon/#{slug}"
repo_id = Repo.create!(url: "git://github.com/#{slug}")
run_analysis(repo_id)
%w[smells constants etc.].each do |coll|
actual_docs = db[coll].find(repo_id: repo_id).to_a
expected_docs = JSON.parse(File.read("#{dir}/#{coll}.json"))
actual_docs.each do |actual|
actual = JSON.parse(actual.to_json).except(*ignored_fields)
if (index = expected_docs.index(actual))
# Delete the match so it can only match one time
expected_docs.delete_at(index)
else
puts "Unable to find match:"
puts JSON.pretty_generate(JSON.parse(actual.to_json))
puts
puts "Expected:"
puts JSON.pretty_generate(JSON.parse(expected_docs.to_json))
raise
end
end
if expected_docs.empty?
puts " PASS #{coll} (#{actual_docs.count} docs)"
else
puts "Expected not empty after search. Remaining:"
puts JSON.pretty_generate(JSON.parse(expected_docs.to_json))
raise
end
end
All of this is invoked by a couple Rake tasks:
rake krylon:save[brynary/rack-test] # Save the results to disk
rake krylon:validate[brynary/rack-test] # Validate against the gold master
Our CI system runs the rake krylon:validate
task. If it fails, someone on the Code Climate team reviews the results, and either fixes an issue or uses rake krylon:save
to update the gold master.
Gotchas
In building Krylon, we ran into a few issues. They were all pretty simple to fix, but I’ll list them here to hopefully save someone some time:
- Floats – Floating point numbers can not be reliably compared using the equality operator. We took the approach of rounding them to two decimal places, and that has been working so far.
- Timestamps – Columns like
created_at
,updated_at
will vary every time your code runs. We just exclude them. - Record IDs – Same as above.
- Non-deterministic ordering of hash keys and arrays – This took a bit more time to track down. Sometimes Code Climate would generate hashes or arrays, but the order of those data structures was undefined and variable. We had two choices: update the Krylon validation code to allow this, or make them deterministic. We went with updating the production code to be deterministic with respect to order because it was simple.
Wrapping up
Gold master testing is not a substitute for unit tests and acceptance tests. However, it can be a valuable tool in your toolbox for dealing with legacy systems, as well as certain specialized cases. It’s a fancy name, but implementing a basic system took less than a day and began yielding benefits right away. Like us, you can start with something simple and rough, and iterate it down the road.