What is a Focus Document?
The Factory 2.0 team produces a confusing number of documents. The first round was about the Problem Statements we were trying to solve. Let’s retroactively call them Problem Documents. The Focus Documents (like this one) focus on some system or some aspect of our solutions that cut across different problems. The content here doesn’t fit cleanly in one problem statement document, which is why we broke it out.
Background on ResultsDB
- ResultsDB is a database for storing results. Unsurprising!
- It is a passive system, it doesn’t actively do anything.
- It has a HTTP REST interface. You POST new results to it and GET them out.
- It was written by Josef Skladanka of Tim Flink’s Fedora QA team.
- It was originally written as a part of the larger Taskotron system, but we’re using it independently here.
What problems can we solve with it?
In formal Factory 2.0 problem statement terms, this helps us solve the Serialization and Automation problems directly, and indirectly all of the problems that depend on those two.
Beyond that, let’s look at fragmentation. The goal of Central CI in Red Hat was to consolidate all of the fragmentation around various CI solutions. This was a success in terms of operational and capex costs -- instead of 100 different CI systems running on 100 different servers sitting under 100 different desks, we have one Central CI infrastructure backed by OpenStack serving on-demand Jenkins masters. Win. A side-effect of this has been that teams can request and configure their own Jenkins masters, without commonality in their configuration. While teams are incentivized to move to a common test execution tool (Jenkins), there’s no common way to organize jobs and results. While we reduced fragmentation at one level, it remains untouched at another. People speak of this as the problem of “the fourteen Jenkins masters” of Platform QE.
Beyond Jenkins, some PnT DevOps tools perform tasks that are QE-esque but yet are not a part of the Central CI infrastructure. Notably, the Errata Tool directly runs jobs like covscan, rpmgrill, rpmdiff, and TPS/insanity that are unnecessarily tied to the “release checklist” phase of the workflow. They could benefit from the common infrastructure of Central CI.
One option could be to attempt to corral all of the various dev and QE groups into getting onto the same platform and configuring their jobs the same way. That’s a possibility, but there is a high cost to achieving that level of social coordination.
Instead, we intend to use resultsdb and a small number of messagebus hooks to insulate consuming services from the details of job execution.
Wait! Why not an ELK stack?
ELK is cool, and people are putting data in it anyways. Why bother standing up resultsdb if some or all of this data is going to be in ELK?
- ELK has a schema that is very unopinionated. You can store anything in it. This is attractive, because there is a low barrier to entry for getting stuff in. When it comes time for scripts to query for results, we worry that we’ll encounter unforeseen costs as we have to handle innumerable undocumented variations in the data: heterogeneous data.
- On the other hand, resultsdb is actually quite opinionated about its schema. You must fit the mold. This is good only as long as that schema remains simple.
- We support teams in Red Hat populating ELK instances and using them. However, we want those teams to get that information on to the message bus first and use that feed to populate ELK. We can then consume the same event feed to populate resultsdb. Different storage tools for different purposes. (We can furthermore protect ourselves from future bit-rot in either storage tool if we rely on the bus for our feed abstraction.)
This can be summarized in the following mantra: “ELK is for humans. Resultsdb is for machines.”
Getting data out of resultsdb
Resultsdb, unsurprisingly, stores results. A result must be associated with a testcase, which is just a namespaced name (for example,
general.rpmlint). It must also be associated with an item, which you can think about as the unique name of a build artifact produced by some RCM tool: the
nevra of an rpm is a typical value for the item field indicating that a particular result is associated with a particular rpm.
Take a look at some examples of queries to the Fedora QA production instance of taskotron, to get an idea for what this thing can store:
- A list of known testcases
- Information on the
- All known results for the
dist.depcheckresults associated with builds
dist.rpmlintresults associated with the
- All results of any testcase associated with that same build
For the release checklist
For the Errata Tool problems described in the introduction, we need to:
- Set up Jenkins jobs that do exactly what the Errata Tool processes do today: rpmgrill, covscan, rpmdiff, TPS/Insanity. Ondrej Hudlicky is working on this. Those jobs need to:
- Be triggered by appropriate message bus events (build complete, dist-git commit, etc..)
- Publish to the bus using the CI-Metrics format, driven by Jiri Canderle.
- We need to ingest data from the bus about those jobs, and store that in resultsdb. The Factory 2.0 team will be working on that.
- We also need to write and stand up an accompanying waiverdb service, that allows overriding an immutable result in resultsdb.
- Should have an audit trail to track who waived and when.
- May need an approval workflow, i.e. a waiver requested by person A then approved or disapproved by person B (with comments about why).
- May need waivers to be related to a purpose somehow. We may want to waive a result for an advisory, or for a cloud image, or for one product but not another. Some research should go into thinking about how best to do this. Referring to PDC’s product/release keys may be a good candidate here.
- The Errata Tool needs to be modified to refer to resultsdb’s stored results instead of its own.
- We can decommission Errata Tool’s scheduling and storage of QE-esque activities.
Note that, in Fedora the Bodhi Updates System already works along these lines to gate updates on their resultsdb status. A subset of testcases are declared as required. However, if a testcase is failing erroneously, a developer must change the requirements associated with the update to get it out the door. This is silly. Writing and deploying something like waiverdb will make that much more straightforward.
Note also that the fedimg tool, used to upload newly composed images to AWS, currently has no gating in place at all. It uploads everything. While talking about how we actually want to introduce gating into its workflow, it was proposed that it should query the cloud-specific test executor called autocloud. Our answer here should be no. Autocloud should store its results in resultsdb, and fedimg should consult resultsdb to know if an image is good or not. This insulates fedimg’s code from the details of autocloud and enables us to more flexibly change out QE methods and tools in the future.
For rebuild automation
For Fedora Modularity, we know we need to build and deploy tools to automate rebuilds. In order to avoid unnecessary rebuilds of Tier 2 and Tier 3 artifacts, we’ll want to first ensure that Tier 1 artifacts are “good”. The rebuild tooling we design will need to:
- Refer to resultsdb to gather testcase results. It should not query test-execution systems directly for the reasons mentioned above.
- Have configurable policy. Resultsdb gives us access to all test results. Do we block rebuilds if one test fails? How do we introduce new experimental tests while not blocking the rebuild process? A constrained subset of the total set of testcases should be used on a per-product/per-component basis to define the rebuild criteria: a policy.
Putting data in resultsdb
- Resultsdb receives new results by way of an HTTP POST.
- In Fedora, the Taskotron system puts results directly into resultsdb.
- Internally, we’ll need a level of indirection due to the social coordination issue described above. Any QE process that wants to have its results stored in resultsdb (and therefore be considered in PnT DevOps rebuild and release processes) will need to publish to the unified message bus or the CI-bus using the CI-Metrics format, driven by Jiri Canderle.
- The Factory 2.0 team will write, deploy and maintain a service that listens for those messages, formats them appropriately, and stores them in resultsdb.
What data on the bus?
- For our MVP, the target is to consume the CI-Metrics data feed coming out of Platform QE, but long-term we don’t want to be limited to just Platform.
- The ship-shift initiative out of CI-ops looks like a very promising source of information. They will publish events about Jenkins job completion to the bus, and a ship-shift worker will pick up that event and archive the job metadata and artifacts into elasticsearch and cold storage.
- Observe that the most expensive part of this project will be “herding the cats”, getting all of the owners of all of the Jenkins masters to start publishing events about their jobs.
- We want to drive the resultsdb-updater process using the same data feed produced for ship-shift, which means we will only have to solve that coordination problem once. This further enables us to integrate CI activity across all of the engineerings organizations, not just Platform.
- Write up a description of how to translate TAP or xUnit into resultsdb’s expected format.
- We won’t expect any test runners to actually do this themselves. The Factory 2.0 service that listens on the bus will do it for them. Still, it will be useful to write down here (the request comes from Ari).
- Tim linked to https://bitbucket.org/fedoraqa/resultsdb_api in a comment above, which is useful here.
- Write about handling results for manual tests. It may make sense for the Errata Tool to gate on those (and show % progress when the gate is closed?) This would take us closer to eliminating the manual handoff from QE to RCM in the release checklist.