Zuul v3 - What's Coming

tl;dr - At last week's OpenStack PTG, the OpenStack Infra team ran the first Zuul v3 job, so it's time to start getting everybody ready for what's coming

Don't Panic! Awesome changes are coming, but you are NOT on the hook for rewriting all of your project's gate jobs or anything crazy like that. Now grab a seat by the fire, pour yourself a drink while I spin a yarn about days gone by and days yet to come.

First, some background

The OpenStack Infra team has been hard at work for quite a while on a new version of Zuul (where by 'quite some time' I mean that Jim Blair and I had our first Zuul v3 design whiteboarding session in 2014). As you might be able to guess given the amount of time, there are some big things coming that will have a real and visible impact on the OpenStack community and beyond. Since we have a running Zuul v3 now^[1], it seemed like the time to start getting folks up to speed on what to expect.

There is other deep-dive information on architecture and rationale if you're interested^[2], but for now we'll focus on what's relevant for end users. We're also going to start sending out a bi-weekly "Status of Zuul v3" email to the openstack-dev@lists.openstack.org mailing list ... so stay tuned!

Important Note This post includes some code snippets - but v3 is still a work in progress. We know of at least one breaking change that is coming to the config format, so please treat this not as a tutorial, but as a conceptual overview. Syntax is subject to change.

The Big Ticket Items

While there are a bunch of changes behind the scenes, there are a reasonably tractable number of user-facing differences.

Self-testing In-Repo Job Config
Ansible Job Content
First-class Multi-node Jobs
Improved Job Reuse
Support for non-OpenStack Code and Node Systems
and Much, Much More

Self-testing In-Repo Job Config

This is probably the biggest deal. There are a lot of OpenStack Devs (around 2k in Ocata) and a lot of repositories (1689) There a lot fewer folks on the project-config-core team who are the ones who review all of the job config changes (please everyone thank Andreas Jaeger next time you see him). That's not awesome.

Self-testing in-repo job config is awesome.

Many systems out there these days have an in-repo job config system. Travis CI has had it since day one, and Jenkins has recently added support for a Jenkinsfile inside of git repos. With Zuul v3, we'll have it too.

Once we roll out v3 to everyone, as a supplement to jobs defined in our central config repositories, each project will be able to add a .zuul.yaml file to their own repo:


- job:
    name: my_awesome_job
    nodes:
      - name: controller
        label: centos-7

- project:
    name: openstack/awesome_project
    check:
      jobs:
        - my_awesome_job

It's a small file, but there is a lot going on, so let's unpack it.

First we define a job to run. It's named my_awesome_job and it needs one node. That node will be named controller and will be based on the centos-7 base node in nodepool.

In the next section, we say that we want to run that job in the check pipeline, which in OpenStack is defined as the jobs that run when patchsets are proposed.

And it's also self-testing!

Everyone knows the fun game of writing a patch to the test jobs, getting it approved, then hoping it works once it starts running. With Zuul v3 in-repo jobs, if there is a change to job definitions in a proposed patch, that patch will be tested with those changes applied. And since it's Zuul, Depends-On footers are honored as well - so iteration on getting a test job right becomes just like iterating on any other patch or sequence of patches.

Ansible Job Content

The job my_awesome_job isn't very useful if it doesn't define any content. That's done in the repo as well, in playbooks/my_awesome_job.yaml:


- hosts: controller
  tasks:
    - name: Run make tests
      shell: make distcheck

As previously mentioned, the job content is now defined in Ansible rather than using our Jenkins Job Builder tool. This playbook is going to run tasks on a host called controller which you may remember we requested in the job definition. On that host, it will run make distcheck. Pretty much anything you can do in Ansible, you can do in a Zuul job now, and the playbooks should also be re-usable outside of a testing context.

First Class Multi-Node Jobs

The previous example was for running a job on a node. What if you want to do multi-node?


- job:
    name: my_awesome_job
    nodes:
      - name: controller
        label: ubuntu-xenial
      - name: compute
        label: centos-7

- project:
    name: openstack/awesome_project
    check:
      jobs:
        - my_awesome_job

As you may have surmised, nodes is a list, so you can have more than one. Then, since Ansible is naturally mutli-node aware, you use that to write the multi-node content:


- hosts: controller
  tasks:
    - name: Install Keystone
      shell: pip install {{ zuul.git_root }}/openstack/keystone
- hosts: compute
  tasks:
    - name: Install Nova
      shell: pip install {{ zuul.git_root }}/openstack/nova
- hosts: *
  tasks:
    - name: Install CloudKitty
      shell: pip install {{ zuul.git_root }}/openstack/cloudkitty

That will install Keystone on controller, Nova on compute and CloudKitty on both.

Improved Job Reuse

In our current system, because of some details about how Jenkins works and the fact that our CI system used to be based on Jenkins, we have a ton of templated jobs that lead both to magically long job names and a bunch of cargo culting of job definitions.

In the new system, a lot of the duplication goes away. So instead of having gate-nova-python27 and gate-swift-python27 and gate-keystone-python27, there will just be a job called "python27" and each of the projects can use it. Similarly, for more complex job content like devstack-gate, since Ansible is a fully-fledged system on its own that was designed for modularity and re-use, we can compose things into roles that take parameters and can be reused without copy/paste.

ssssh! In fact, the python27 job will almost certainly be a job that uses an extremely small playbook that itself uses a role called tox. But also, the tox role, the python27 playbook and the python27 job definition will all be things we define centrally in a standard library of pieces, so as a user of the system you should be able to just choose to run "python27" and not worry about it - unless you want to dig in and learn more.

Support for non-OpenStack Code and Node Systems

Zuul was originally written to support the OpenStack project, but since then we've grown more people who have interest in running Zuul. Since we wrote it the first time to solve our problems of extremely massive scale, we didn't put a ton of effort into making it easily consumable elsewhere. That hasn't stopped people, and there are tons of Zuul installations out there ... but that doesn't mean life is easy for those folks. With Zuul v3 we've also been explicitly focused on making it much more easily reconsumable.

Part of supporting friends in other communities means embracing support of tools that OpenStack does not use. The fine folks at Gooddata wrote a set of patches adding support for Github which they have been using for a while. We'll be landing those, which should allow us to add jobs to the system that check things like "will this pull request to pip break devstack". There is also work from the CentOS community via a tool called linch-pin that we're looking at incorporating into Nodepool that should allow creating build nodes on any system Ansible knows how to talk to. Those features are intended to follow quickly on after we get OpenStack migrated.

What's Next?

Zuul on Zuul
Infra Repos
Job conversion Script
OpenStack Migration

We currently have a Zuul v3 running against changes to the Zuul repo. We're using to iterate on job content and other features. There is a change coming to the job definition syntax to allow job dependencies to be a graph instead of just a tree which will be fairly invasive, so we're keeping the affected surface area small until that's ready.

Once we're happy with how things are running, we'll move the rest of the Infra repos over, probably in chunks. Although Infra test jobs are typically a bit different than the jobs in the rest of OpenStack, we do have enough representative examples that we should be able to work out the kinks before we throw things at other folks. (shade and nodepool both do integration testing on devstack-gate jobs, for instance)

While we work on Infra migration, we'll be developing a conversion script to convert the existing jobs. A good portion of that will be fully automatable. For instance, mapping everyone's gate-{project}-python27 to a reference to the python27 job is easy for a computer to do. However, there's still a ton of snowflake jobs that we'll likely wind up just converting the content of as is and then iterating on refactoring to be more efficient or improved over time.

Then the Big Day will arrive. When the conversion script is as good as we can get it and we're satisfied with stability of the job language, security and scalability, there will be a Big Bang cutover of all of the rest of OpenStack. If all goes well, most developers should mostly just notice that a bunch of job names got shorter and that it's a user named Zuul commenting on patches. Folks who have patches to project-config in-flight at that time will need to rework patches, but the conversion script should hopefully make that a minimal burden.

and Much, Much More

There are far too many new and exciting things in Zuul v3 to cover in a single post, and many of the topics (such as Ansible Jobs, or Job Inheritance and Reuse) are deep topics we can dive in to over time. The long and short of it is that Zuul v3 is coming soon to an OpenStack Infra near you, so expect more and more communication about what that means over the next few months.

Notes

OpenStack is not running Zuul v3 in production at the moment. We have an instance running and only responding to events from the Zuul v3 repo. As of the time of this writing, OpenStack is still running 2.5 in production. Believe me - when we hit production, you'll know it.
Links to deeper information: