Infrastructure/Rethinking staging and development

Infrastructure machines can be classified into three environments depending on their use for developing, testing, and deploying changes. We need to redesign how we manage these three environments as the current method is fraught with problems.

Definitions of the three environments

Production

Production is what our users will see when they come to visit the Fedora Project. The other servers are replications of the production environment to some extent. At times, changes to production will be extremely locked down as the Fedora websites and services need to be in a known stable configuration for serving Fedora releases or other important needs.

Staging

Testing new releases of packages and minor development of apps (Changes destined to be hotfixes, etc) are done on staging. Staging is used to both evaluate changes to an application before it goes into production and evaluate how changes to an application will affect other services in production.

Development

Major development of web applications we code for is done in the dev environment. This includes hosts like ask01.dev, app01.dev, and publictest01. Development boxes are for development of the web applications we code for and testing of new services that we want to deploy. By their nature, we have to let people that we don't know as well use them to experiment with their features.

Differentiating hosts

Staging hosts have .stg in their domain name.

Development hosts currently either have .dev in their domain name or have a hostname starting with publictest. We are planning on phasing out the publictest machine in favour of .dev machines.

Production hosts are all other hosts.

Permissions

Staging hosts mirror the permissions of their production counterparrts. However, the machines get their account information from the stg fas server instead of the production fas server. One place where this may come in handy would be if we want to enable a developer of one of our web apps to log in and experiment with a new release on the staging app servers -- in this case, they can be added to the sysadmin-web group on the staging instance only. There are limits to this, however. We cannot give developers increased access to the staging fas servers, for instance, as it has a copy of the data that is used in production and therefore gives people access to the password hashes and information protected under our privacy policy.

.dev hosts are intended to give developers more freedom. Developers of an application typically have sudo on the boxes that they are using there.

Separation of power

The publictest and app01.dev hosts suffer from having multiple groups that have access to sudo on each box. We've decided that we want to change things so that each service is given its own host to develop on. That group and sysadmin-main will have sudo (possibly, restrict login as well) while other groups will not. this means that a compromise of the host will not necessarily propagate to other groups. (We do have to be mindful that sysadmin-main has wide powers, however, so a compromise of a dev host may have wide reaching consequences if those people have logged into the box).

Limit where passwords are typed

The dev hosts should not have passwords typed into them. We want to deny someone easy access to passwords should they compromise the box (note, they'll still be able to get password hashes. Just not the plaintext passwords). We can go to password-less sudo now and look at using otp later (otp still have the issue of giving an attacker a single valid password that they can intercept and potentially use to do something that they wish instead of what the user had intended).

Time span

Staging hosts are intended to stay up for long periods. However, they can be rebuilt periodically as everything should be in puppet. Having a schedule for that may make keeping the staging hosts in sync with production slightly better (as hosts tend to acquire additional packages and tools as they are being used to debug issues, have new packages installed with dependencies that are later removed, etc).

Development hosts should be rebuilt much more frequently than we currently do. They should also be shutdown when not in use. many of them should also be time limited. None of these procedures are currently done well. Perhaps we should have someone in charge of this and try to find ways that relatively new users (not sysadmin-main levels of trust) can create and destroy the guests. Then we can have a group of people who can take care of scheduling rebuilds, notifying users, creating and destroying the guests, etc. we could also look at having a cron job that shuts down a guest whenever there is no one logged in and giving the developers using that guest a tool that can turn the guest back on.

Having guests shutdown means that they are not there for an attacker to probe, attack, or use during the period when they are powered down. Currently, if a publictest machine that is seeing little use for its intended purpose was compromised, an attacker would have a long period of time to attempt to break-in and use it for their purposes. Since the machine is seeing little use, we would likely not notice that the box was compromised for a long period of time.

Syncing environments

Staging and production need to be synced in several ways. Staging needs to mirror production as closely as possible so we need to be able to take changes from production and push them into staging. When doing this, there will be some things that must always be different in staging. We won't want to remember those every time. Going the other way, some new features may be implemented in staging over an extended period of time (for instance, migrating app servers to RHEL6). When that happens, we need to keep acquiring changes in staging and then move them over to production later.

Git branches deemed harmful

Our current method is to use separate git branches for staging and production (master). However, this hasn't proven to work that well in practice. We are able to merge from production to staging reasonably well but cherrypicking changes from staging into master is not supported very well by git. git has no way to track when we take singular commits that were deployed in staging and apply them to production. We'd have to make a feature branch for each change we make, apply it to staging, and then apply it to production to get this affect. Which is a lot of work when most of our changes are single commits (and something to remember since not all of our services have a staging instance). Not everything in staging is drawn from the staging branch either. We have node definitions, for instance, which will not overlap between staging and production because the staging node has .stg. in its domain name. Having these in both the staging and production branch is confusing.

So we want to get away from using git branches. But what do we switch to?

One branch

We'll have one branch for our definitions. The staging configs and modules will be in a separate hierarchy from the production ones. Staging servers may be in the new hierarchy or it may live in the current manifests directory.(the nodes won't conflict as they all have .stg in their domain names).

Conditionals and templates

For differences between staging and production that are intended to be permanent we want to use a conditional on whether something is being used in production or staging. Both puppet templates and puppet manifests support conditionals. For small changes we can have templates where the conditional selects between an option for stg versus an option for production. For larger changes, the puppet manifest can contain a conditional to select a different file on staging as opposed to production (note that this is somewhat dangerous as we may tend to put things into the staging version and then forget it when we move it into production. diff -r won't show the differences because the changes will be in separate files.)

We need to also look into what we want our combined repo to look like. Right now

Freeze policy

The three environments have different policies about what can be changed at what points in time. This seems to work well at the moment and doesn't need changes.

Search