Test it like you Deploy It

Monty Taylor

http://inaugust.com/talks/test-it-like-you-deploy-it.html

twitter: @e_monty

Who am I?

Office of Technology

Zuul

Ansible

Who am I?

Technical Committee

Developer Infrastructure Core Team

Former Foundation Board of Directors

PTL of shade project

What are we going to talk about?

  • Zuul
  • Ansible

What is Zuul?

  • Multi-cloud, scalable, elastic CI/CD engine
  • Validation of speculative future states
  • Single-use VM build nodes - safely run tests that need root
  • Multi-node builds
  • Multi-repo projects
  • Native support for gating configuration

Terminology

  • Periodic: jobs run in response to a timer
  • Post: jobs run after a change
  • Check: job run when someone proposes a change
  • Gate: jobs run between change approval and landing

Why?

The original OpenStack use case

  • Fully automated gated commits
  • Full end-to-end integration tests from scratch for every commit
  • Massive scale

OpenStack Scale by the numbers

  • 2 KJPH (kilo-jobs / hour)
  • 2376 arbitrary developers
  • 1474 git repositories
  • 11727 Jobs
  • Merge 10k Changes / Month

ansible has received 13171 PRs (changes), has merged 8190 of them and has 37788 commits in its entire lifetime

Multi Repository Speculative Execution

  • Zuul constructs speculative states as-if a change were merged
  • Tests future states without landing those changes first
  • The as-if spans multiple repos
  • In the Gate pipeline, speculative changes are put into a virtual serial queue, then tested in parallel as-if each change combination in front of them had landed

Multi-Repo Dependencies


commit 30039f04109efa2263aba6eb302a29bd8d5e8f53
Author: Monty Taylor
Date:   Mon Sep 26 14:14:15 2016 -0500

    Add simple field for disabled flavors

    When we were equalizing out the old silly names from the new pretty
    ones, we missed disabled.

    Change-Id: I4cbf5f7c27f640c566460c18951ab9030aae84e4
    Depends-On: I523e0ab6e376f5ff6205b1cc1748aa6d546919cb
      

Multi-Repo Dependencies - an example

  • shade library runs functional tests against openstack
  • neutron change breaks shade tests (more later)
  • neutron fix is proposed
  • shade change Depends-On proposed neutron change
  • shade tests are run as-if neutron change has landed
  • shade change cannot land until neutron change lands

Isn't this supposed to prevent neutron from breaking shade?

  • shade and neutron do not share a gating relationship
  • shade has, by choice, a test that tests against master of OpenStack
  • Such a test is 'risky' to shade devs
  • shade devs desire the risk - can work around breaks, or submit bugs

Not Specific to OpenStack

  • "Gate" and "Check" are configurations
  • 50+ OpenStack Vendors use Zuul for "3rd Party CI" of drivers
  • BMW and Wikimedia use Zuul

Pluggable

  • Triggers
  • Reporters
  • Node Providers
  • Execution content

Zuul v2

  • In production for OpenStack for 4 years
  • What most people run
  • Triggers: Gerrit, Periodic
  • Reporters: Gerrit, Email, MySQL
  • Node Providers: Elastic OpenStack Nodepool, Static servers
  • Jobs executed by Jenkins

image

Zuul v2.5

  • In use only by OpenStack (on purpose)
  • Replaced Jenkins with Ansible
  • Jobs still written using JJB - playbooks generated on the fly

http://logs.openstack.org/09/352209/7/check/gate-networking-ovn-python35/0fae86c/_zuul_ansible/

Why Replace Jenkins?

Not for lack of trying

  • OpenStack started on Jenkins (actually, on Hudson, remember that?)
  • We funded the Jenkins JClouds Plugin
  • We did deep dev in the Gerrit Trigger Plugin
  • Maintain the SCP artifact plugin (added console log support)
  • Added 0mq notification plugin
  • Added Gearman Worker plugin - allowed us to grow to 8 Masters/1000 slaves
  • We wrote Jenkins Job Builder

Before I answer that ...

The world is a better place with Jenkins existing

Jenkins Problems

Security

  • don't run WebUI on the internet
  • ssh slave plugin - it's possible for a slave to run arbitrary code on the master

Stability

  • almost every Jenkins upgrade has broken us

Scalability

  • Jenkins has global mutexes, especially in plugins
  • Extra large cloud server could handle ~100 concurrent jobs
  • We ran 8 Jenkins Masters with slaves sharded across them

Overkill

  • we only used it as a remote shell execution engine

We know a better engine for remote execution

image

Zuul v3

Test things like you deploy them

  • Intended for broad use
  • Jobs written in and executed with ansible
  • GitHub Support
  • Multi-node build clusters as first class resource
  • Self-testing tests
  • Multi-Tenant
  • In-repo config
  • Fully support Bare Metal, VMs and Containers
  • AWS, GCE, Mac Stadium and Static Node Provider Support
  • Docker and Kubernetes Build Resources

Ansible execution

Merge the mergers and the launchers

  • Constructs relevant repo states and inventory
  • Push repos to test nodes
  • Execute playbook with inventory
  • Playbook can be defined centrally or in each repo

In repo config

  • Minimal config at start - other than "where are my repos"
  • 'trusted' config repos
  • config per-repo
  • At start, zuul asks the fleet of merger-launchers to calculate config

Test tests before you land them

  • Config is in git. Zuul does multo-repo git depends.
  • v3 adds with in-tree test definitions
  • Speculative state includes speculative state of test jobs

First class multi-node support


- job:
    name: ursula
    nodes:
    - name: controller
      image: fedora-26
    - name: db
      image: rhel-7
      

Active node requests

  • zuul requests named resources
  • resources are created/checked out of pool
  • at finish, checked back in
  • Single-use environments delete on check in
  • Multi-use environments become available for subsequent jobs
  • Request non-node resources, such as kubernetes cluster

Status

Focus

  • OpenStack next week, and the hard problems that brings
  • Extra-hard is handled. So is simple - but zuul is complex to run if you only have the simple use cases

Next up

  • Get it ready for Ansible project
  • Making it truly suitable for not-OpenStack Infra to run
  • Making the easy tasks simple

For More Information

Thank you!

http://inaugust.com/talks/test-it-like-you-deploy-it.html

twitter: @e_monty