December 02, 2020
Dustin Mitchell
Taskcluster's DB (Part 3) - Online Migrations
This is part 3 of a deep-dive into the implementation details of Taskcluster’s backend data stores. If you missed the first two, see part 1 and part 2 for the background, as we’ll jump right in here!
Big Data
A few of the tables holding data for Taskcluster contain a tens or hundreds of millions of lines. That’s not what the cool kids mean when they say “Big Data”, but it’s big enough that migrations take a long time. Most changes to Postgres tables take a full lock on that table, preventing other operations from occurring while the change takes place. The duration of the operation depends on lots of factors, not just of the data already in the table, but on the kind of other operations going on at the same time.
The usual approach is to schedule a system downtime to perform time-consuming database migrations, and that’s just what we did in July. By running it a clone of the production database, we determined that we could perform the migration completely in six hours. It turned out to take a lot longer than that. Partly, this was because we missed some things when we shut the system down, and left some concurrent operations running on the database. But by the time we realized that things were moving too slowly, we were near the end of our migration window and had to roll back. The time-consuming migration was version 20 - migrate queue_tasks, and it had been estimated to take about 4.5 hours.
When we rolled back, the DB was at version 19, but the code running the Taskcluster services corresponded to version 12. Happily, we had planned for this situation, and the redefined stored functions described in part 2 bridged the gap with no issues.
Patch-Fix
Our options were limited: scheduling another extended outage would have been difficult. We didn’t solve all of the mysteries of the poor performance, either, so we weren’t confident in our prediction of the time required.
The path we chose was to perform an “online migration”. I wrote a custom migration script to accomplish this. Let’s look at how that worked.
The goal of the migration was to rewrite the queue_task_entities
table into a tasks
table, with a few hundred million rows.
The idea with the online migration was to create an empty tasks
table (a very quick operation), then rewrite the stored functions to write to tasks
, while reading from both tables.
Then a background task can move rows from the queue_task_entitites
table to the tasks
table without blocking concurrent operations.
Once the old table is empty, it can be removed and the stored functions rewritten to address only the tasks
table.
A few things made this easier than it might have been.
Taskcluster’s tasks have a deadline
after which they become immutable, typically within one week of the task’s creation.
That means that the task mutation functions can change the task in-place in whichever table they find it in.
The background task only moves tasks with deadlines in the past.
This eliminates any concerns about data corruption if a row is migrated while it is being modified.
A look at the script linked above shows that there were some complicating factors, too – notably, two more tables to manage – but those factors didn’t change the structure of the migration.
With this in place, we ran the replacement migration script, creating the new tables and updating the stored functions. Then a one-off JS script drove migration of post-deadline tasks with a rough ETA calculation. We figured this script would run for about a week, but in fact it was done in just a few days. Finally, we cleaned up the temporary functions, leaving the DB in precisely the state that the original migration script would have generated.
Supported Online Migrations
After this experience, we knew we would run into future situations where a “regular” migration would be too slow. Apart from that, we want users to be able to deploy Taskcluster without scheduling downtimes: requiring downtimes will encourage users to stay at old versions, missing features and bugfixes and increasing our maintenance burden.
We devised a system to support online migrations in any migration.
Its structure is pretty simple: after each migration script is complete, the harness that handles migrations calls a _batch
stored function repeatedly until it signals that it is complete.
This process can be interrupted and restarted as necessary.
The “cleanup” portion (dropping unnecessary tables or columns and updating stored functions) must be performed in a subsequent DB version.
The harness is careful to call the previous version’s online-migration function before it starts a version’s upgrade, to ensure it is complete. As with the old “quick” migrations, all of this is also supported in reverse to perform a downgrade.
The _batch
functions are passed a state
parameter that they can use as a bookmark.
For example, a migration of the tasks might store the last taskId that it migrated in its state.
Then each batch can begin with select .. where task_id > last_task_id
, allowing Postgres to use the index to quickly find the next task to be migrated.
When the _batch
function indicates that it processed zero rows, the handler calls an _is_completed
function.
If this function returns false, then the whole process starts over with an empty state
.
This is useful for tables where more rows that were skipped during the migration, such as tasks with deadlines in the future.
Testing
An experienced engineer is, at this point, boggling at the number of ways this could go wrong! There are lots of points at which a migration might fail or be interrupted, and the operators might then begin a downgrade. Perhaps that downgrade is then interrupted, and the migration re-started! A stressful moment like this is the last time anyone wants surprises, but these are precisely the circumstances that are easily forgotten in testing.
To address this, and to make such testing easier, we developed a test framework that defines a suite of tests for all manner of circumstances. In each case, it uses callbacks to verify proper functionality at every step of the way. It tests both the “happy path” of a successful migration and the “unhappy paths” involving failed migrations and downgrades.
In Practice
The impetus to actually implement support for online migrations came from some work that Alex Lopez has been doing to change the representation of worker pools in the queue.
This requires rewriting the tasks
table to transform the provisioner_id
and worker_type
columns into a single, slash-separated task_queue_id
column.
The pull request is still in progress as I write this, but already serves as a great practical example of an online migration (and online dowgrade, and tests).
Summary
As we’ve seen in this three-part series, Taskcluster’s data backend has undergone a radical transformation this year, from a relatively simple NoSQL service to a full Postgres database with sophisticated support for ongoing changes to the structure of that DB.
In some respects, Taskcluster is no different from countless other web services abstracting over a data-storage backend. Indeed, Django provides robust support for database migrations, as do many other application frameworks. One factor that sets Taskcluster apart is that it is a “shipped” product, with semantically-versioned releases which users can deploy on their own schedule. Unlike for a typical web application, we – the software engineers – are not “around” for the deployment process, aside from the Mozilla deployments. So, we must make sure that the migrations are well-tested and will work properly in a variety of circumstances.
We did all of this with minimal downtime and no data loss or corruption. This involved thousands of lines of new code written, tested, and reviewed; a new language (SQL) for most of us; and lots of close work with the Cloud Operations team to perform dry runs, evaluate performance, and debug issues. It couldn’t have happened without the hard work and close collaboration of the whole Taskcluster team. Thanks to the team, and thanks to you for reading this short series!
October 30, 2020
Dustin Mitchell
Taskcluster's DB (Part 2) - DB Migrations
This is part 2 of a deep-dive into the implementation details of Taskcluster’s backend data stores. Check out part 1 for the background, as we’ll jump right in here!
Azure in Postgres
As of the end of April, we had all of our data in a Postgres database, but the data was pretty ugly. For example, here’s a record of a worker as recorded by worker-manager:
partition_key | testing!2Fstatic-workers
row_key | cc!2Fdd~ee!2Fff
value | {
"state": "requested",
"RowKey": "cc!2Fdd~ee!2Fff",
"created": "2015-12-17T03:24:00.000Z",
"expires": "3020-12-17T03:24:00.000Z",
"capacity": 2,
"workerId": "ee/ff",
"providerId": "updated",
"lastChecked": "2017-12-17T03:24:00.000Z",
"workerGroup": "cc/dd",
"PartitionKey": "testing!2Fstatic-workers",
"lastModified": "2016-12-17T03:24:00.000Z",
"workerPoolId": "testing/static-workers",
"__buf0_providerData": "eyJzdGF0aWMiOiJ0ZXN0ZGF0YSJ9Cg==",
"__bufchunks_providerData": 1
}
version | 1
etag | 0f6e355c-0e7c-4fe5-85e3-e145ac4a4c6c
To reap the goodness of a relational database, that would be a “normal”[*] table: distinct columns, nice data types, and a lot less redundancy.
All access to this data is via some Azure-shaped stored functions, which are also not amenable to the kinds of flexible data access we need:
<tableName>_load
- load a single row<tableName>_create
- create a new row<tableName>_remove
- remove a row<tableName>_modify
- modify a row<tableName>_scan
- return some or all rows in the table
[*] In the normal sense of the word – we did not attempt to apply database normalization.
Database Migrations
So the next step, which we dubbed “phase 2”, was to migrate this schema to one more appropriate to the structure of the data.
The typical approach is to use database migrations for this kind of work, and there are lots of tools for the purpose. For example, Alembic and Django both provide robust support for database migrations – but they are both in Python.
The only mature JS tool is knex, and after some analysis we determined that it both lacked features we needed and brought a lot of additional features that would complicate our usage. It is primarily a “query builder”, with basic support for migrations. Because we target Postgres directly, and because of how we use stored functions, a query builder is not useful. And the migration support in knex, while effective, does not support the more sophisticated approaches to avoiding downtime outlined below.
We elected to roll our own tool, allowing us to get exactly the behavior we wanted.
Migration Scripts
Taskcluster defines a sequence of numbered database versions. Each version corresponds to a specific database schema, which includes the structure of the database tables as well as stored functions. The YAML file for each version specifies a script to upgrade from the previous version, and a script to downgrade back to that version. For example, an upgrade script might add a new column to a table, with the corresponding downgrade dropping that column.
version: 29
migrationScript: |-
begin
alter table secrets add column last_used timestamptz;
end
downgradeScript: |-
begin
alter table secrets drop column last_used;
end
So far, this is a pretty normal approach to migrations. However, a major drawback is that it requires careful coordination around the timing of the migration and deployment of the corresponding code. Continuing the example of adding a new column, if the migration is deployed first, then the existing code may execute INSERT queries that omit the new column. If the new code is deployed first, then it will attempt to read a column that does not yet exist.
There are workarounds for these issues. In this example, adding a default value for the new column in the migration, or writing the queries such that they are robust to a missing column. Such queries are typically spread around the codebase, though, and it can be difficult to ensure (by testing, of course) that they all operate correctly.
In practice, most uses of database migrations are continuously-deployed applications – a single website or application server, where the developers of the application control the timing of deployments. That allows a great deal of control, and changes can be spread out over several migrations that occur in rapid succession.
Taskcluster is not continuously deployed – it is released in distinct versions which users can deploy on their own cadence. So we need a way to run migrations when upgrading to a new Taskcluster release, without breaking running services.
Stored Functions
Part 1 mentioned that all access to data is via stored functions. This is the critical point of abstraction that allows smooth migrations, because stored functions can be changed at runtime.
Each database version specifies definitions for stored functions, either introducing new functions or replacing the implementation of existing functions.
So the version: 29
YAML above might continue with
methods:
create_secret:
args: name text, value jsonb
returns: ''
body: |-
begin
insert
into secrets (name, value, last_used)
values (name, value, now());
end
get_secret:
args: name text
returns: record
body: |-
begin
update secrets
set last_used = now()
where secrets.name = get_secret.name;
return query
select name, value from secrets
where secrets.name = get_secret.name;
end
This redefines two existing functions to operate properly against the new table.
The functions are redefined in the same database transaction as the migrationScript
above, meaning that any calls to create_secret
or get_secret
will immediately begin populating the new column.
A critical rule (enforced in code) is that the arguments and return type of a function cannot be changed.
To support new code that references the last_used
value, we add a new function:
get_secret_with_last_used:
args: name text
returns: record
body: |-
begin
update secrets
set last_used = now()
where secrets.name = get_secret.name;
return query
select name, value, last_used from secrets
where secrets.name = get_secret.name;
end
Another critical rule is that DB migrations must be applied fully before the corresponding version of the JS code is deployed.
In this case, that means code that uses get_secret_with_last_used
is deployed only after the function is defined.
All of this can be thoroughly tested in isolation from the rest of the Taskcluster code, both unit tests for the functions and integration tests for the upgrade and downgrade scripts. Unit tests for redefined functions should continue to pass, unchanged, providing an easy-to-verify compatibility check.
Phase 2 Migrations
The migrations from Azure-style tables to normal tables are, as you might guess, a lot more complex than this simple example. Among the issues we faced:
- Azure-entities uses a multi-field base64 encoding for many data-types, that must be decoded (such as
__buf0_providerData
/__bufchunks_providerData
in the example above) - Partition and row keys are encoded using a custom variant of urlencoding that is remarkably difficult to implement in pl/pgsql
- Some columns (such as secret values) are encrypted.
- Postgres generates slightly different ISO8601 timestamps from JS’s
Date.toJSON()
We split the work of performing these migrations across the members of the Taskcluster team, supporting each other through the tricky bits, in a rather long but ultimately successful “Postgres Phase 2” sprint.
0042 - secrets phase 2
Let’s look at one of the simpler examples: the secrets service.
The migration script creates a new secrets
table from the data in the secrets_entities
table, using Postgres’s JSON function to unpack the value
column into “normal” columns.
The database version YAML file carefully redefines the Azure-compatible DB functions to access the new secrets
table.
This involves unpacking function arguments from their JSON formats, re-packing JSON blobs for return values, and even some light parsing of the condition string for the secrets_entities_scan
function.
It then defines new stored functions for direct access to the normal table. These functions are typically similar, and more specific to the needs of the service. For example, the secrets service only modifies secrets in an “upsert” operation that replaces any existing secret of the same name.
Step By Step
To achieve an extra layer of confidence in our work, we landed all of the phase-2 PRs in two steps. The first step included migration and downgrade scripts and the redefined stored functions, as well as tests for those. But critically, this step did not modify the service using the table (the secrets service in this case). So the unit tests for that service use the redefined stored functions, acting as a kind of integration-test for their implementations. This also validates that the service will continue to run in production between the time the database migration is run and the time the new code is deployed. We landed this step on GitHub in such a way that reviewers could see a green check-mark on the step-1 commit.
In the second step, we added the new, purpose-specific stored functions and rewrote the service to use them. In services like secrets, this was a simple change, but some other services saw more substantial rewrites due to more complex requirements.
Deprecation
Naturally, we can’t continue to support old functions indefinitely: eventually they would be prohibitively complex or simply impossible to implement. Another deployment rule provides a critical escape from this trap: Taskcluster must be upgraded at most one major version at a time (e.g., 36.x to 37.x). That provides a limited window of development time during which we must maintain compatibility.
Defining that window is surprisingly tricky, but essentially it’s two major revisions. Like the software engineers we are, we packaged up that tricky computation in a script, and include the lifetimes in some generated documentation
What’s Next?
This post has hinted at some of the complexity of “phase 2”. There are lots of details omitted, of course!
But there’s one major detail that got us in a bit of trouble.
In fact, we were forced to roll back during a planned migration – not an engineer’s happiest moment.
The queue_tasks_entities
and queue_artifacts_entities
table were just too large to migrate in any reasonable amount of time.
Part 3 will describe what happened, how we fixed the issue, and what we’re doing to avoid having the same issue again.
October 28, 2020
Dustin Mitchell
Taskcluster's DB (Part 1) - Azure to Postgres
This is a deep-dive into some of the implementation details of Taskcluster. Taskcluster is a platform for building continuous integration, continuous deployment, and software-release processes. It’s an open source project that began life at Mozilla, supporting the Firefox build, test, and release systems.
The Taskcluster “services” are a collection of microservices that handle distinct tasks: the queue coordinates tasks; the worker-manager creates and manages workers to execute tasks; the auth service authenticates API requests; and so on.
Azure Storage Tables to Postgres
Until April 2020, Taskcluster stored its data in Azure Storage tables, a simple NoSQL-style service similar to AWS’s DynamoDB. Briefly, each Azure table is a list of JSON objects with a single primary key composed of a partition key and a row key. Lookups by primary key are fast and parallelize well, but scans of an entire table are extremely slow and subject to API rate limits. Taskcluster was carefully designed within these constraints, but that meant that some useful operations, such as listing tasks by their task queue ID, were simply not supported. Switching to a fully-relational datastore would enable such operations, while easing deployment of the system for organizations that do not use Azure.
Always Be Migratin’
In April, we migrated the existing deployments of Taskcluster (at that time all within Mozilla) to Postgres. This was a “forklift migration”, in the sense that we moved the data directly into Postgres with minimal modification. Each Azure Storage table was imported into a single Postgres table of the same name, with a fixed structure:
create table queue_tasks_entities(
partition_key text,
row_key text,
value jsonb not null,
version integer not null,
etag uuid default public.gen_random_uuid()
);
alter table queue_tasks_entities add primary key (partition_key, row_key);
The importer we used was specially tuned to accomplish this import in a reasonable amount of time (hours). For each known deployment, we scheduled a downtime to perform this migration, after extensive performance testing on development copies.
We considered options to support a downtime-free migration. For example, we could have built an adapter that would read from Postgres and Azure, but write to Postgres. This adapter could support production use of the service while a background process copied data from Azure to Postgres.
This option would have been very complex, especially in supporting some of the atomicity and ordering guarantees that the Taskcluster API relies on. Failures would likely lead to data corruption and a downtime much longer than the simpler, planned downtime. So, we opted for the simpler, planned migration. (we’ll revisit the idea of online migrations in part 3)
The database for Firefox CI occupied about 350GB. The other deployments, such as the community deployment, were much smaller.
Database Interface
All access to Azure Storage tables had been via the azure-entities library, with a limited and very regular interface (hence the _entities
suffix on the Postgres table name).
We wrote an implementation of the same interface, but with a Postgres backend, in taskcluster-lib-entities.
The result was that none of the code in the Taskcluster microservices changed.
Not changing code is a great way to avoid introducing new bugs!
It also limited the complexity of this change: we only had to deeply understand the semantics of azure-entities, and not the details of how the queue service handles tasks.
Stored Functions
As the taskcluster-lib-entities README indicates, access to each table is via five stored database functions:
<tableName>_load
- load a single row<tableName>_create
- create a new row<tableName>_remove
- remove a row<tableName>_modify
- modify a row<tableName>_scan
- return some or all rows in the table
Stored functions are functions defined in the database itself, that can be redefined within a transaction. Part 2 will get into why we made this choice.
Optimistic Concurrency
The modify
function is an interesting case.
Azure Storage has no notion of a “transaction”, so the azure-entities library uses an optimistic-concurrency approach to implement atomic updates to rows.
This uses the etag
column, which changes to a new value on every update, to detect and retry concurrent modifications.
While Postgres can do much better, we replicated this behavior in taskcluster-lib-entities, again to limit the changes made and avoid introducing new bugs.
A modification looks like this in Javascript:
await task.modify(task => {
if (task.status !== 'running') {
task.status = 'running';
task.started = now();
}
});
For those not familiar with JS notation, this is calling the modify
method on a task, passing a modifier function which, given a task, modifies that task.
The modify
method calls the modifier and tries to write the updated row to the database, conditioned on the etag still having the value it did when the task was loaded.
If the etag does not match, modify
re-loads the row to get the new etag, and tries again until it succeeds.
The effect is that updates to the row occur one-at-a-time.
This approach is “optimistic” in the sense that it assumes no conflicts, and does extra work (retrying the modification) only in the unusual case that a conflict occurs.
What’s Next?
At this point, we had fork-lifted Azure tables into Postgres and no longer require an Azure account to run Taskcluster. However, we hadn’t yet seen any of the benefits of a relational database:
- data fields were still trapped in a JSON object (in fact, some kinds of data were hidden in base64-encoded blobs)
- each table still only had a single primary key, and queries by any other field would still be prohibitively slow
- joins between tables would also be prohibitively slow
Part 2 of this series of articles will describe how we addressed these issues. Then part 3 will get into the details of performing large-scale database migrations without downtime.
August 24, 2020
Pete Moore
ZX Spectrum +4 - kind of
The ZX Spectrum +2A was my first computer, and I really loved it. On it, I learned to program (badly), and learned how computers worked (or so I thought). I started writing my first computer programs in Sinclair BASIC, and tinkered a little with Z80 machine code (I didn’t have an assembler, but I did have a book that documented the opcodes and operand encodings for each of the Z80 assembly mnemonics).
Fast forward 25 years, and I found myself middle aged, having spent my career thus far as a programmer, but never writing any assembly (let alone machine code), and having lost touch with the low level computing techniques that attracted me to programming in the first place.
So I decided to change that, and start a new project. My idea was to adapt the original Spectrum 128K ROM from Z80 machine code, to 64 bit ARM assembly, running on the Raspberry Pi 3B (which I happened to own). The idea was not to create an emulator (there are plenty of those around), but instead, to create a new “operating system” (probably monitor program would be accurate term) that had roughly the same feature set as the original Spectrum computers, but designed to run on modern hardware, at faster speeds, with higher resolution, more sophisticated sound etc.
What I loved about the original Spectrum, was the ease at which you could learn to program, and the simplicity of the platform. You did not need to study computer science for 30 years to understand it. That isn’t true with modern operating systems, they are much more complex. I wanted to create something simple and intuitive again, that would provide a similar computing experience. Said another way, I wanted to create something with the sophistication of the original spectrum, but that would run on modern hardware. Since it was meant to be an evolution of the Spectrum, I decided to call it the ZX Spectrum +4 (since the ZX Spectrum +3 was the last Spectrum that was ever made).
Well it is a work-in-progress, and has been a lot of fun to write so far. Please feel free to get involved with the project, and leave a comment, or open an issue or pull request against the repository. I think I have a fair bit of work to do, but it is doable. The original Spectrum ROMs were 16K each, so there is 32Kb of Z80 machine code and tables to translate, but given that instructions are variable length (typically 1, 2, or 3 bytes) there are probably something like 15,000 instructions to translate, which could be a year or two of hobby time, given my other commitments. Or less, if other people get involved! :-)
The github repo can be found here.
Open Source Alternative to ntrights.exe
If you wish to modify LSA policy programmatically on Windows, the
ntrights.exe
utility from the Windows 2000 Server Resource Kit may help you.
But if you need to ship a product that uses it, you may wish to consider an
open source tool to avoid any licensing issues.
Needing to do a similar thing myself, I’ve written the ntr
open source
utility for this purpose. It contains both a standalone executable (like
ntrights.exe
) and a Go library interface.
The project is on github here.
I hope you find it useful!
May 06, 2020
Dustin Mitchell
Debugging Docker Connection Reset by Peer
(this post is co-written with @imbstack and cross-posted on his blog)
Symptoms
At the end of January this year the Taskcluster team was alerted to networking issues in a user’s tasks. The first
report involved ETIMEDOUT
but later on it became clear that the more frequent issue was involving ECONNRESET
in the middle of downloading artifacts necessary to
run the tests in the tasks. It seemed it was only occurring on downloads from Google (https://dl.google.com) on our workers running in GCP, and only with relatively large artifacts. This led
us to initially blame some bit of infrastructure outside of Taskcluster but eventually we found the issue to be with how Docker was handling networking on our worker machines.
Investigation
The initial stages of the investigation were focused on exploring possible causes of the error and on finding a way to reproduce the error.
Investigation of an intermittent error in a high-volume system like this is slow and difficult work. It’s difficult to know if an intervention fixed the issue just because the error does not recur. And it’s difficult to know if an intervention did not fix the issue, as “Connection reset by peer” can be due to transient network hiccups. It’s also difficult to gather data from production systems as the quantity of data per failure is unmanageably high.
We explored a few possible causes of the issue, all of which turned out to be dead ends.
- Rate Limiting or Abuse Prevention - The TC team has seen cases where downloads from compute clouds were limited as a form of abuse prevention. Like many CI processes, the WPT jobs download Chrome on every run, and it’s possible that a series of back-to-back tasks on the same worker could appear malicious to an abuse-prevention device.
- Outages of the download server - This was unlikely, given Google’s operational standards, but worth exploring since the issues seemed limited to
dl.google.com
. - Exhaustion of Cloud NAT addresses - Resource exhaustion in the compute cloud might have been related. This was easily ruled out with the observation that workers are not using Cloud NAT.
At the same time, several of us were working on reproducing the issue in more controlled circumstances. This began with interactive sessions on Taskcluster workers, and soon progressed to a script that reproduced the issue easily on a GCP instance running the VM image used to run workers. An important observation here was that the issue only reproduced inside of a docker container: downloads from the host worked just fine. This seemed to affect all docker images, not just the image used in WPT jobs.
At this point, we were able to use Taskcluster itself to reproduce the issue at scale, creating a task group of identical tasks running the reproduction recipe. The “completed” tasks in that group are the successful reproductions.
Armed with quick, reliable reproduction, we were able to start capturing dumps of the network traffic. From these, we learned that the downloads were failing mid-download (tens of MB into a ~65MB file). We were also able to confirm that the error is, indeed, a TCP RST segment from the peer.
Searches for similar issues around this time found a blog post entitled “Fix a random network Connection Reset issue in Docker/Kubernetes”, which matched our issue in many respects. It’s a long read, but the summary is that conntrack, which is responsible for maintaining NAT tables in the Linux kernel, sometimes gets mixed up and labels a valid packet as INVALID. The default configuration of iptables forwarding rules is to ignore INVALID packets, meaning that they fall through to the default ACCEPT for the FILTER table. Since the port is not open on the host, the host replies with an RST segment. Docker containers use NAT to translate between the IP of the container and the IP of the host, so this would explain why the issue only occurs in a Docker container.
We were, indeed, seeing INVALID packets as revealed by conntrack -S
, but there were some differences from our situation, so we continued investigating.
In particular, in the blog post, the connection errors are seen there in the opposite direction, and involved a local server for which the author had added some explicit firewall rules.
Since we hypothesized that NAT was involved, we captured packet traces both inside the Docker container and on the host interface, and combined the two. The results were pretty interesting! In the dump output below, 74.125.195.136 is dl.google.com, 10.138.0.12 is the host IP, and 172.17.0.2 is the container IP. 10.138.0.12 is a private IP, suggesting that there is an additional layer of NAT going on between the host IP and the Internet, but this was not the issue.
A “normal” data segment looks like
22:26:19.414064 ethertype IPv4 (0x0800), length 26820: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 35556934:35583686, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 26752
22:26:19.414076 ethertype IPv4 (0x0800), length 26818: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35556934:35583686, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 26752
here the first line is outside the container and the second line is inside the container; the SNAT translation has rewritten the host IP to the container IP. The sequence numbers give the range of bytes in the segment, as an offset from the initial sequence number, so we are almost 34MB into the download (from a total of about 65MB) at this point.
We began by looking at the end of the connection, when it failed.
A
22:26:19.414064 ethertype IPv4 (0x0800), length 26820: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 35556934:35583686, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 26752
22:26:19.414076 ethertype IPv4 (0x0800), length 26818: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35556934:35583686, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 26752
B
22:26:19.414077 ethertype IPv4 (0x0800), length 2884: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 34355910:34358726, ack 789, win 265, options [nop,nop,TS val 2940395383 ecr 3057320821], length 2816
C
22:26:19.414091 ethertype IPv4 (0x0800), length 56: 10.138.0.12.60790 > 74.125.195.136.https: Flags [R], seq 821696165, win 0, length 0
...
X
22:26:19.416605 ethertype IPv4 (0x0800), length 66: 172.17.0.2.60790 > 74.125.195.136.https: Flags [.], ack 35731526, win 1408, options [nop,nop,TS val 3057320829 ecr 2940395388], length 0
22:26:19.416626 ethertype IPv4 (0x0800), length 68: 10.138.0.12.60790 > 74.125.195.136.https: Flags [.], ack 35731526, win 1408, options [nop,nop,TS val 3057320829 ecr 2940395388], length 0
Y
22:26:19.416715 ethertype IPv4 (0x0800), length 56: 74.125.195.136.https > 10.138.0.12.60790: Flags [R], seq 3900322453, win 0, length 0
22:26:19.416735 ethertype IPv4 (0x0800), length 54: 74.125.195.136.https > 172.17.0.2.60790: Flags [R], seq 3900322453, win 0, length 0
Segment (A) is a normal data segment, forwarded to the container.
But (B) has a much lower sequence number, about 1MB earlier in the stream, and it is not forwarded to the docker container.
Notably, (B) is also about 1/10 the size of the normal data segments – we never figured out why that is the case.
Instead, we see an RST segment (C) sent back to dl.google.com
.
This situation repeats a few times: normal segment forwarded, late segment dropped, RST segment sent to peer.
Finally, the docker container sends an ACK segment (X) for the segments it has received so far, and this is answered by an RST segment (Y) from the peer, and that RST segment is forwarded to the container. This final RST segment is reasonable from the peer’s perspective: we have already reset its connection, so by the time it gets (X) the connection has been destroyed. But this is the first the container has heard of any trouble on the connection, so it fails with “Connection reset by peer”.
So it seems that the low-sequence-number segments are being flagged as INVALID by conntrack and causing it to send RST segments. That’s a little surprising – why is conntrack paying attention to sequence numbers at all? From this article it appears this is a security measure, helping to protect sockets behind the NAT from various attacks on TCP.
The second surprise here is that such late TCP segments are present. Scrolling back through the dump output, there are many such packets – enough that manually labeling them is infeasible. However, graphing the sequence numbers shows a clear pattern:
Note that this covers only the last 16ms of the connection (the horizontal axis is in seconds), carrying about 200MB of data (the vertical axis is sequence numbers, indicating bytes). The “fork” in the pattern shows a split between the up-to-date segments, which seem to accelerate, and the delayed segments. The delayed segments are only slightly delayed - 2-3ms. But a spot-check of a few sequence ranges in the dump shows that they had already been retransmitted by the time they were delivered. When such late segments were not dropped by conntrack, the receiver replied to them with what’s known as a duplicate ACK, a form of selective ACK that says “I have received that segment, and in fact I’ve received many segments since then.”
Our best guess here is that some network intermediary has added a slight delay to some packets. But since the RTT on this connection is so short, that delay is relatively huge and puts the delayed packets outside of the window where conntrack is willing to accept them. That helps explain why other downloads, from hosts outside of the Google infrastructure, do not see this issue: either they do not traverse the intermediary delaying these packets, or the RTT is long enough that a few ms is not enough to result in packets being marked INVALID.
Resolution
After we posted these results in the issue, our users realized these symptoms looked a lot like a Moby libnetwork bug. We adopted a workaround mentioned there where we use conntrack to drop invalid packets in iptables rather than trigger RSTs
iptables -I INPUT -m conntrack --ctstate INVALID -j DROP
The drawbacks of that approach listed in the bug are acceptable for our uses. After baking a new machine images we tried to reproduce the issue at scale as we had done during the debugging of this issue and were not able to. We updated all of our worker pools to use this image the next day and it seems like we’re now in the clear.
Security Implications
As we uncovered this behavior, there was some concern among the team that this represented a security issue. When conntrack marks a packet as INVALID and it is handled on the host, it’s possible that the same port on the host is in use, and the packet could be treated as part of that connection. However, TCP identifies connections with a “four-tuple” of source IP and port + destination IP and port. But the tuples cannot match, or the remote end would have been unable to distinguish the connection “through” the NAT from the connection terminating on the host. So there is no issue of confusion between connections here.
However, there is the possibility of a denial of service. If an attacker can guess the four-tuple for an existing connection and forge an INVALID packet matching it, the resulting RST would destroy the connection. This is probably only an issue if the attacker is on the same network as the docker host, as otherwise reverse-path filtering would discard such a forged packet.
At any rate, this issue appears to be fixed in more recent distributions.
Thanks
@hexcles, @djmitche, @imbstack, @stephenmcgruer
December 14, 2018
Wander Lairson Costa
Running packet.net images in qemu
For the past months, I have been working on adding Taskcluster support for packet.net cloud provider. The reason for that is to get faster Firefox for Android CI tests. Tests showed that jobs run up to 4x faster on bare metal machines than EC2.
I set up 25 machines to run a small subset of the production tasks, and so far results are excellent. The problem is that those machines are up 24/7 and there is no dynamic provisioning. If we need more machines, I have to manually change the terraform script to scale it up. We need a smart way to do that. We are going to build something similar as aws-provisioner. However, we need a custom packet.net image to speed up instance startup.
The problem is that if you can’t ssh into the machine, there is no way to get access to it to see what’s wrong. In this post,l I am going to show how you can run a packet image locally with qemu.
You can find documentation about creating custom packet images here and here.
Let’s create a sample image for the post. After you clone the packet-images repo, run:
$ ./tools/build.sh -d ubuntu_14_04 -p t1.small.x86 -a x86_64 -b ubuntu_14_04-t1.small.x86-dev
This creates the image.tar.gz
file, which is your packet image.
The goal of this post is not to guide you on creating your custom image; you can refer
to the documentation linked above for this. The goal here is, once you have your
image, how you can run it locally with qemu.
The first step is to create a qemu
disk to install the image into it:
$ qemu-img create -f raw linux.img 10G
This command creates a raw
qemu
image. We now need to create a disk partition:
$ cfdisk linux.img
Select dos
for the partition table, create a single primary partition and
make it bootable. We now need to create a loop device to handle our image:
$ sudo losetup -Pf linux.img
The -f
option looks for the first free loop device for attachment to the image file.
The -P
option instructs losetup
to read the partition table and create a loop
device for each partition found; this avoids we having to play with disk
offsets. Now let’s find our loop device:
$ sudo losetup -l
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop1 0 0 1 1 /var/lib/snapd/snaps/gnome-calculator_260.snap 0 512
/dev/loop8 0 0 1 1 /var/lib/snapd/snaps/gtk-common-themes_818.snap 0 512
/dev/loop6 0 0 1 1 /var/lib/snapd/snaps/core_5662.snap 0 512
/dev/loop4 0 0 1 1 /var/lib/snapd/snaps/gtk-common-themes_701.snap 0 512
/dev/loop11 0 0 1 1 /var/lib/snapd/snaps/gnome-characters_139.snap 0 512
/dev/loop2 0 0 1 1 /var/lib/snapd/snaps/gnome-calculator_238.snap 0 512
/dev/loop0 0 0 1 1 /var/lib/snapd/snaps/gnome-logs_45.snap 0 512
/dev/loop9 0 0 1 1 /var/lib/snapd/snaps/core_6034.snap 0 512
/dev/loop7 0 0 1 1 /var/lib/snapd/snaps/gnome-characters_124.snap 0 512
/dev/loop5 0 0 1 1 /var/lib/snapd/snaps/gnome-3-26-1604_70.snap 0 512
/dev/loop12 0 0 0 0 /home/walac/work/packet-images/linux.img 0 512
/dev/loop3 0 0 1 1 /var/lib/snapd/snaps/gnome-system-monitor_57.snap 0 512
/dev/loop10 0 0 1 1 /var/lib/snapd/snaps/gnome-3-26-1604_74.snap 0 512
We see that our loop device is /dev/loop12
. If we look in the /dev
directory:
$ ls -l /dev/loop12*
brw-rw---- 1 root 7, 12 Dec 17 10:39 /dev/loop12
brw-rw---- 1 root 259, 0 Dec 17 10:39 /dev/loop12p1
We see that, thanks to the -P
option, losetup
created the loop12p1
device for the partition we have. It is time to set up the filesystem:
$ sudo mkfs.ext4 -b 1024 /dev/loop12p1
mke2fs 1.44.4 (18-Aug-2018)
Discarding device blocks: done
Creating filesystem with 10484716 1k blocks and 655360 inodes
Filesystem UUID: 2edfe9f2-7e90-4c35-80e2-bd2e49cad251
Superblock backups stored on blocks:
8193, 24577, 40961, 57345, 73729, 204801, 221185, 401409, 663553,
1024001, 1990657, 2809857, 5120001, 5971969
Allocating group tables: done
Writing inode tables: done
Creating journal (65536 blocks): done
Writing superblocks and filesystem accounting information: done
Ok, finally we can mount our device and extract the image to it:
$ mkdir mnt
$ sudo mount /dev/loop12p1 mnt/
$ sudo tar -xzf image.tar.gz -C mnt/
The last step is to install the bootloader. As we are running an Ubuntu image, we will use grub2 for that.
Firstly we need to install grub in the boot sector:
$ sudo grub-install --boot-directory mnt/boot/ /dev/loop12
Installing for i386-pc platform.
Installation finished. No error reported.
Notice we point to the boot directory of our image. Next, we have to
generate the grub.cfg
file:
$ cd mnt/
$ for i in /proc /dev /sys; do sudo mount -B $i .$i; done
$ sudo chroot .
# cd /etc/grub.d/
# chmod -x 30_os-prober
# update-grub
Generating grub configuration file ...
Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT is set is no longer supported.
Found linux image: /boot/vmlinuz-3.13.0-123-generic
Found initrd image: /boot/initrd.img-3.13.0-123-generic
done
We bind mount the /dev
, /proc
and /sys
host mount points inside the Ubuntu image,
then chroot
into it. Next, to avoid grub creating entries for our host OSes, we disable the
30_os-prober
script. Finally we run update-grub
and it creates the /boot/grub/grub.cfg
file.
Now the only thing left is cleanup:
# exit
$ for i in dev/ sys/ proc/; do sudo umount $i; done
$ cd ..
$ sudo umount mnt/
The commands are self explanatory. Now let’s run our image:
$ sudo qemu-system-x86_64 -enable-kvm -hda /dev/loop12
And that’s it, you now can run your packet image locally!
August 29, 2018
John Ford
Shrinking Go Binaries
A bit of background is that Go binaries are static binaries which have the Go runtime and standard library built into them. This is great if you don't care about binary size but not great if you do.
To reproduce my results, you can do the following:
go get -u -t -v github.com/taskcluster/taskcluster-lib-artifact-go cd $GOPATH/src/github.com/taskcluster/taskcluster-lib-artifact-go git checkout 6f133d8eb9ebc02cececa2af3d664c71a974e833 time (go build) && wc -c ./artifact time (go build && strip ./artifact) && wc -c ./artifact time (go build -ldflags="-s") && wc -c ./artifact time (go build -ldflags="-w") && wc -c ./artifact time (go build -ldflags="-s -w") && wc -c ./artifact time (go build && upx -1 ./artifact) && wc -c ./artifact time (go build && upx -9 ./artifact) && wc -c ./artifact time (go build && strip ./artifact && upx -1 ./artifact) && wc -c ./artifact time (go build && strip ./artifact && upx --brute ./artifact) && wc -c ./artifact time (go build && strip ./artifact && upx --ultra-brute ./artifact) && wc -c ./artifact time (go build && strip && upx -9 ./artifact) && wc -c ./artifact
Since I was removing a lot of debugging information, I figured it'd be worthwhile checking that stack traces are still working. To ensure that I could definitely crash, I decided to panic with an error immediately on program startup.
Even with binary stripping and the maximum compression, I'm still able to get valid stack traces. A reduction from 9mb to 2mb is definitely significant. The binaries are still large, but they're much smaller than what we started out with. I'm curious if we can apply this same configuration to other areas of the Taskcluster Go codebase with similar success, and if the reduction in size is worthwhile there.
I think that using strip and upx -9 is probably the best path forward. This combination provides enough of a benefit over the non-upx options that the time tradeoff is likely worth the effort.
August 28, 2018
John Ford
Taskcluster Artifact API extended to support content verification and improve error detection
Background
At Mozilla, we're developing the Taskcluster environment for doing Continuous Integration, or CI. One of the fundamental concerns in a CI environment is being able to upload and download files created by each task execution. We call them artifacts. For Mozilla's Firefox project, an example of how we use artifacts is that each build of Firefox generates a product archive containing a build of Firefox, an archive containing the test files we run against the browser and an archive containing the compiler's debug symbols which can be used to generate stacks when unit tests hit an error.The problem
In the old Artifact API, we had an endpoint which generated a signed S3 url that was given to the worker which created the artifact. This worker could upload anything it wanted at that location. This is not to suggest malicious usage, but that any errors or early termination of uploads could result in a corrupted artifact being stored in S3 as if it were a correct upload.If you created an artifact with the local contents "hello-world\n", but your internet connection dropped midway through, the S3 object might only contain "hello-w". This went silent and uncaught until something much later down the pipeline (hopefully!) complained that the file it got was corrupted. This corruption is the cause of many orange-factor bugs, but we have no way to figure out exactly where the corruption is happening.
Our old API was also very challenging to use and artifact handling in tasks. It would often require a task writer to use one of our client libraries to generate a Taskcluster-Signed-URL and Curl to do uploads. For a lot of cases, this is really hazard fraught. Curl doesn't fail on errors by default (!!!), Curl doesn't automatically handle "Content-Encoding: gzip" responses without "Accept: gzip", which we sometimes need to serve. It requires each user figure all of this out for themselves, each time they want to use artifacts.
We also had a "Completed Artifact" pulse message which isn't actually sending anything useful. It would send a message when the artifact is allocated in our metadata tables, not when the artifact was actually complete. We could mark a task as being completed before all of the artifacts were finished being uploaded. In practice, this was avoided by avoiding a call to complete the task before the uploads were done, but it was a convention.
Our solution
We wanted to address a lot of issues with Taskcluster Artifacts. Specifically the following issues are ones which we've tackled:- Corruption during upload should be detected
- Corruption during download should be detected
- Corruption of artifacts should be attributable
- S3 Eventual Consistency error detection
- Caches should be able to verify whether they are caching valid items
- Completed Artifact messages should only be sent when the artifact is actually complete
- Tasks should be unresolvable until all uploads are finished
- Artifacts should be really easy to use
- Artifacts should be able to be uploaded with browser-viewable gzip encoding
Code
Here's the code we wrote for this project:- https://github.com/taskcluster/remotely-signed-s3 -- A library which wraps the S3 APIs using the lower level S3 REST Api and uses the aws4 request signing library
- https://github.com/taskcluster/taskcluster-lib-artifact -- A light wrapper around remotely-signed-s3 to enable JS based uploads and downloads
- https://github.com/taskcluster/taskcluster-lib-artifact-go -- A library and CLI written in Go
- https://github.com/taskcluster/taskcluster-queue/commit/6cba02804aeb05b6a5c44134dca1df1b018f1860 -- The final Queue patch to enable the new Artifact API
Upload Corruption
If an artifact is uploaded with a different set of bytes to those which were expected, we should fail the upload. The S3 V4 signatures system allows us to sign a request's headers, which includes an X-Amz-Content-Sha256 and Content-Length header. This means that the request headers we get back from signing can only be used for a request which sets the X-Amz-Content-Sha256 and Content-Length to the value provided at signing. The S3 library checks that the body of each request's Sha256 checksum matches the value provided in this header and also the Content-Length.The requests we get from the Taskcluster Queue can only be used to upload the exact file we asked permission to upload. This means that the only set of bytes that will allow the request(s) to S3 to complete sucessfully will be the ones we initially told the Taskcluster Queue about.
The two main cases we're protecting against here are disk and network corruption. The file ends up being read twice, once to hash and once to upload. Since we have the hash calculated, we can be sure to catch corruption if the two hashes or sizes don't match. Likewise, the possibility of network interuption or corruption is handled because the S3 server will report an error if the connection is interupted or corrupted before data matching the Sha256 hash exactly is uploaded.
This does not protect against all broken files from being uploaded. This is an important distinction to make. If you upload an invalid zip file, but no corruption occurs once you pass responsibility to taskcluster-lib-artifact, we're going to happily store this defective file, but we're going to ensure that every step down the pipeline gets an exact copy of this defective file.
Download Corruption
Like corruption during upload, we could experience corruption or interruptions during downloading. In order to combat this, we set some metadata on the artifacts in S3. We set some extra headers during uploading:- x-amz-meta-taskcluster-content-sha256 -- The Sha256 of the artifact passed into a library -- i.e. without our automatic gzip encoding
- x-amz-meta-taskcluster-content-length -- The number of bytes of the artifact passed into a library -- i.e. without our automatic gzip encoding
- x-amz-meta-taskcluster-transfer-sha256 -- The Sha256 of the artifact as passed over the wire to S3 servers. In the case of identity encoding, this is the same value as x-amz-meta-taskcluster-content-sha256. In the case of Gzip encoding, it is almost certainly not identical.
- x-amz-meta-taskcluster-transfer-length -- The number of bytes of the artifact as passed over the wire to S3 servers. In the case of identity encoding, this is the same value as x-amz-meta-taskcluster-content-sha256. In the case of Gzip encoding, it is almost certainly not identical.
Important to note is that because these are non-standard headers, verification requires explicit action on the part of the artifact downloader. That's a big part of why we've written supported artifact downloading tools.
Attribution of Corruption
Corruption is inevitable in a massive system like Taskcluster. What's really important is that when corruption happens we detect it and we know where to focus our remediation efforts. In the new Artifact API, we can zero in on the culprit for corruption.With the old Artifact API, we don't have any way to figure out if an artifact is corrupted or where that happened. We never know what the artifact was on the build machine, we can't verify corruption in caching systems and when we have an invalid artifact downloaded on a downstream task, we don't know whether it is invalid because the file was defective from the start or if it was because of a bad transfer.
Now, we know that if the Sha256 checksums of the downloaded artifact, the original file was broken before it was uploaded. We can build caching systems which ensure that the value that they're caching is valid and alert us to corruption. We can track corruption to detect issues in our underlying infrastructure.
Completed Artifact Messages and Task Resolution
Previously, as soon as the Taskcluster Queue stored the metadata about the artifact in its internal tables and generated a signed url for the S3 object, the artifact was marked as completed. This behaviour resulted in a slightly deceptive message being sent. Nobody cares when this allocation occurs, but someone might care about an artifact becoming available.On a related theme, we also allowed tasks to be resolved before the artifacts were uploaded. This meant that a task could be marked as "Completed -- Success" without actually uploading any of its artifacts. Obviously, we would always be writing workers with the intention of avoiding this error, but having it built into the Queue gives us a stronger guarantee.
We achieved this result by adding a new method to the flow of creating and uploading an artifact and adding a 'present' field in the Taskcluster Queue's internal Artifact table. For those artifacts which are created atomically, and the legacy S3 artifacts, we just set the value to true. For the new artifacts, we set it to false. When you finish your upload, you have to run a complete artifact method. This is sort of like a commit.
In the complete artifact method, we verify that S3 sees the artifact as present and only once it's completed do we send the artifact completed method. Likewise, in the complete task method, we ensure that all artifacts have a present value of true before allowing the task to complete.
S3 Eventual Consistency and Caching Error Detection
S3 works on an Eventual consistency model for some operations in some regions. Caching systems also have a certain level of tolerance for corruption. We're now able to determine whether the bytes we're downloading are those which we expect. We can now rely on more than http status code to know whether the request worked.In both of these cases we can programmatically check if the download is corrupt and try again as appropriate. In the future, we could even build smarts into our download libraries and tools to request caches involved to drop their data or try bypassing caches as a last result.
Artifacts should be easy to use
Right now, if your working with artifacts directly, you're probably having a hard time. You have to use something like Curl and building urls or signed urls. You've probably hit pitfalls like Curl not exiting with an error on a non-200 HTTP Status. You're not getting any content verification. Basically, it's hard.Taskcluster is about enabling developers to do their job effectively. Something so critical to CI usage as artifacts should be simple to use. To that end, we've implemented libraries for interacting with artifacts in Javascript and Go. We've also implemented a Go based CLI for interacting with artifacts in the build system or shell scripts.
Javascript
The Javascript client uses the same remotely-signed-s3 library that the Taskcluster Queue uses internally. It's a really simple wrapper which provides an put() and get() interface. All of the verification of requests is handled internally, as is decompression of Gzip resources. This was primarily written to enable integration in Docker-Worker directly.Go
We also provide a Go library for downloading and uploading artifacts. This is intended to be used in the Generic-Worker, which is written in Go. The Go Library uses the minimum useful interface in the Standard I/O library for inputs and outputs. We're also doing type assertions to do even more intelligent things on those inputs and outputs which support it.CLI
For all other users of Artifacts, we provide a CLI tool. This provides a simple interface to interact with artifacts. The intention is to make it available in the path of the task execution environment, so that users can simply call "artifact download --latest $taskId $name --output browser.zip.Artifacts should allow serving to the browser in Gzip
We want to enable large text files which compress extremely well with Gzip to be rendered by web browsers. An example is displaying and transmitting logs. Because of limitations in S3 around Content-Encoding and its complete lack of content negotiation, we have to decide when we upload an artifact whether or not it should be Gzip compressed.There's an option in the libraries to support automatic Gzip compression of things we're going to upload. We chose Gzip over possibly-better encoding schemes because this is a one time choice at upload time, so we wanted to make sure that the scheme we used would be broadly implemented.
Further Improvements
As always, there's still some things around artifact handling that we'd like to improve upon. For starters, we should work on splitting artifact handling out of our Queue. We've already agreed on a design of how we should store artifacts. This involves splitting out all of the artifact handling out of the Queue into a different service and having the Queue track only which artifacts belong to each task run.We're also investigting an idea to store each artifact in the region it is created in. Right now, all artifacts are stored in EC2's US West 2 region. We could have a situation where a build vm and test vm are running on the same hypervisor in US East 1, but each artifact has to be upload and downloaded via US West 2.
Another area we'd like to work on is supporting other clouds. Taskcluster ideally supports whichever cloud provider you'd like to use. We want to support other storage providers than S3, and splitting out the low level artifact handling gives us a huge maintainability win.
Possible Contributions
We're always open to contributions! A great one that we'd love to see is allowing concurrency of multipart uploads in Go. It turns out that this is a lot more complicated than I'd like it to be in order to support passing in the low level io.Reader interface. We'd want to do some type assertions to see if the input supports io.ReaderAt, and if not, use a per-go-routine offset and file mutex to guard around seeking on the file. I'm happy to mentor this project, so get in touch if that's something you'd like to work on.Conclusion
This project has been a really interesting one for me. It gave me an opportunity to learn the Go programming language and work with the underlying AWS Rest API. It's been an interesting experience after being heads down in Node.js code and has been a great reminder of how to use static, strongly typed languages. I'd forgotten how nice a real type system was to work with!Integration into our workers is still ongoing, but I wanted to give an overview of this project to keep everyone in the loop. I'm really excited to see a reduction in the amount of corruptions for artifacts
August 22, 2018
Dustin Mitchell
Introducing CI-Admin
A major focus of recent developments in Firefox CI has been putting control of the CI process in the hands of the engineers working on the project. For the most part, that means putting configuration in the source tree. However, some kinds of configuration don’t fit well in the tree. Notably, configuration of the trees themselves must reside somewhere else.
CI-Configuration
This information is collected in the ci-configuration repository.
This is a code-free library containing YAML files describing various aspects of the configuration – currently the available repositories (projects.yml
) and actions.
This repository is designed to be easy to modify by anyone who needs to modify it, through the usual review processes. It is even Phabricator-enabled!
CI-Admin
Historically, we’ve managed this sort of configuration by clicking around in the Taskcluster Tools. The situation is analogous to clicking around in the AWS console to set up a cloud deployment – it works, and it’s quick and flexible. But it gets harder as the configuration becomes more complex, it’s easy to make a mistake, and it’s hard to fix that mistake. Not to mention, the tools UI shows a pretty low-level view of the situation that does not make common questions (“Is this scope available to cron jobs on the larch repo?”) easy to answer.
The devops world has faced down this sort of problem, and the preferred approach is embodied in tools like Puppet or Terraform:
- write down the desired configuration in a human-parsable text files
- check it into a repository and use the normal software-development processes (CI, reviews, merges..)
- apply changes with a tool that enforces the desired state
This “desired state” approach means that the tool examines the current configuration, compares it to the configuration expressed in the text files, and makes the necessary API calls to bring the current configuration into line with the text files. Typically, there are utilities to show the differences, partially apply the changes, and so on.
The ci-configuration
repository contains those human-parsable text files.
The tool to enforce that state is ci-admin
.
It has some generic resource-manipulation support, along with some very Firefox-CI-specific code to do weird things like hashing .taskcluster.yml
.
Making Changes
The current process for making changes is a little cumbersome. In part, that’s intentional: this tool controls the security boundaries we use to separate try from release, so its application needs to be carefully controlled and subject to significant human review. But there’s also some work to do to make it easier (see below).
The process is this:
- make a patch to either or both repos, and get review from someone in the “Build Config - Taskgraph” module
- land the patch
- get someone with the proper access to run
ci-admin apply
for you (probably the reviewer can do this)
Future Plans
Automation
We are in the process of setting up some automation around these repositories. This includes Phabricator, Lando, and Treeherder integration, along with automatic unit test runs on push.
More specific to this project, we also need to check that the current and expected configurations match.
This needs to happen on any push to either repo, but also in between pushes: someone might make a change “manually”, or some of the external data sources (such as the Hg access-control levels for a repo) might change without a commit to the ci-configuration
repo.
We will do this via a Hook that runs ci-admin diff
periodically, notifying relevant people when a difference is found.
These results, too, will appear in Treeherder.
Grants
One of the most intricate and confusing aspects of configuration for Firefox CI is the assignment of scopes to various jobs.
The current implementation uses a cascade of role inheritance and *
suffixes which, frankly, no human can comprehend.
The new plan is to “grant” scopes to particular targets in a file in ci-configuration
.
Each grant will have a clear purpose, with accompanying comments if necessary.
Then, ci-admin
will gather all of the grants and combine them into the appropriate role definitions.
Worker Configurations
At the moment, the configuration of, say, aws-provsioner-v1/gecko-t-large
is a bit of a mystery.
It’s visible to some people in the AWS-provisioner tool, if you know to look there.
But that definition also contains some secret data, so it is not publicly visible like roles or hooks are.
In the future, we’d like to generate these configurations based on ci-configuration
.
That both makes it clear how a particular worker type is configured (instance type, capacity configuration, regions, disk space, etc.), and allows anyone to propose a modification to that configuration – perhaps to try a new instance type.
Terraform Provider
As noted above, ci-admin
is fairly specific to the needs of Firefox CI.
Other users of Taskcluster would probably want something similar, although perhaps a bit simpler.
Terraform is already a popular tool for configuring cloud services, and supports plug-in “providers”.
It would not be terribly difficult to write a terraform-provider-taskcluster
that can create roles, hooks, clients, and so on.
This is left as an exercise for the motivated user!
Links
June 15, 2018
Dustin Mitchell
Actions as Hooks
You may already be familiar with in-tree actions: they allow you to do things like retrigger, backfill, and cancel Firefox-related tasks
They implement any “action” on a push that occurs after the initial hg push
operation.
This article goes into a bit of detail about how this works, and a major change we’re making to that implementation.
History
Until very recently, actions worked like this:
First, the decision task (the task that runs in response to a push and decides what builds, tests, etc. to run) creates an artifact called actions.json
.
This artifact contains the list of supported actions and some templates for tasks to implement those actions.
When you click an action button (in Treeherder or the Taskcluster tools, or any UI implementing the actions spec), code running in the browser renders that template and uses it to create a task, using your Taskcluster credentials.
I talk a lot about functionality being in-tree. Actions are yet another example. Actions are defined in-tree, using some pretty straightforward Python code. That means any engineer who wants to change or add an action can do so – no need to ask permission, no need to rely on another engineer’s attention (aside from review, of course).
There’s Always a Catch: Security
Since the beginning, Taskcluster has operated on a fairly simple model: if you can accomplish something by pushing to a repository, then you can accomplish the same directly. At Mozilla, the core source-code security model is the SCM level: try-like repositories are at level 1, project (twice) repositories at level 2, and release-train repositories (autoland, central, beta, etc.) are at level 3. Similarly, LDAP users may have permisison to push to level 1, 2, or 3 repositories. The current configuration of Taskcluster assigns the same scopes to users at a particular level as it does to repositories.
If you have such permission, check out your scopes in the Taskcluster credentials tool (after signing in). You’ll see a lot of scopes there.
The Release Engineering team has made release promotion an action. This is not something that every user who can push to level-3 repository – hundreds of people – should be able to do! Since it involves signing releases, this means that every user who can push to a level-3 repository has scopes involved in signing a Firefox release. It’s not quite as bad as it seems: there are lots of additional safeguards in place, not least of which is the “Chain of Trust” that cryptographically verifies the origin of artifacts before signing.
All the same, this is something we (and the Firefox operations security team) would like to fix.
In the new model, users will not have the same scopes as the repositories they can push to. Instead, they will have scopes to trigger specific actions on task-graphs at specific levels. Some of those scopes will be available to everyone at that level, while others will be available only to more limited groups. For example, release promotion would be available to the Release Management team.
Hooks
This makes actions a kind of privilege escalation: something a particular user can cause to occur, but could not do themselves.
The Taskcluster-Hooks service provides just this sort of functionality:
a hook creates a task using scopes assiged by a role, without requiring the user calling triggerHook
to have those scopes.
The user must merely have the appropriate hooks:trigger-hook:..
scope.
So, we have added a “hook” kind to the action spec.
The difference from the original “task” kind is that actions.json
specifies a hook to execute, along with well-defined inputs to that hook.
The user invoking the action must have the hooks:trigger-hook:..
scope for the indicated hook.
We have also included some protection against clickjacking, preventing someone with permission to execute a hook being tricked into executing one maliciously.
Generic Hooks
There are three things we may wish to vary for an action:
- who can invoke the action;
- the scopes with which the action executes; and
- the allowable inputs to the action.
Most of these are configured within the hooks service (using automation, of course). If every action is configured uniquely within the hooks service, then the self-service nature of actions would be lost: any new action would require assistance from someone with permission to modify hooks.
As a compromise, we noted that most actions should be available to everyone who can push to the corresponding repo, have fairly limited scopes, and need not limit their inputs. We call these “generic” actions, and creating a new such action is self-serve. All other actions require some kind of external configuration: allocating the scope to trigger the task, assigning additional scopes to the hook, or declaring an input schema for the hook.
Hook Configuration
The hook definition for an action hook is quite complex: it involves a complex task definition template as well as a large schema for the input to triggerHook
.
For decision tasks, cron tasks, an “old” actions, this is defined in .taskcluster.yml
, and we wanted to continue that with hook-based actions.
But this creates a potential issue: if a push changes .taskcluster.yml
, that push will not automatically update the hooks – such an update requires elevated privileges and must be done by someone who can sanity-check the operation.
To solve this, ci-admin creates tasks hooksed on the .taskcluster.yml
it finds in any Firefox repository, naming each after a hash of the file’s content.
Thus, once a change is introduced, it can “ride the trains”, using the same hash in each repository.
Implementation and Implications
As of this writing, two common actions are operating as hooks: retrigger and backfill. Both are “generic” actions, so the next step is to start to implement some actions that are not generic. Ideally, nobody notices anything here: it is merely an implementation change.
Once all actions have been converted to hooks, we will begin removing scopes from users. This will have a more significant impact: lots of activities such as manually creating tasks (including edit-and-create) will no longer be allowed. We will try to balance the security issues against user convenience here. Some common activities may be implemented as actions (such as creating loaners). Others may be allowed as exceptions (for example, creating test tasks). But some existing workflows may need to change to accomodate this improvement.
We hope to finish the conversion process in July 2018, with that time largely taken with a slow rollout to accomodate unforseen implications. When the project is finished, Firefox releases and other sensitive operations will be better-protected, with minimal impact to developers’ existing worflows.
May 21, 2018
Dustin Mitchell
Redeploying Taskcluster: Hosted vs. Shipped Software
The Taskcluster team’s work on redeployability means switching from a hosted service to a shipped application.
A hosted service is one where the authors of the software are also running the main instance of that software. Examples include Github, Facebook, and Mozillians. By contrast, a shipped application is deployed multiple times by people unrelated to the software’s authors. Examples of shipped applications include Gitlab, Joomla, and the Rust toolchain. And, of course, Firefox!
Hosted Services
Operating a hosted service can be liberating. Blog posts describe the joys of continuous deployment – even deploying the service multiple times per day. Bugs can be fixed quickly, either by rolling back to a previous deployment or by deploying a fix.
Deploying new features on a hosted service is pretty easy, too. Even a complex change can be broken down into phases and accomplished without downtime. For example, changing the backend storage for a service can be accomplished by modifying the service to write to both old and new backends, mirroring existing data from old to new, switching reads to the new backend, and finally removing the code to write to the old backend. Each phase is deployed separately, with careful monitoring. If anything goes wrong, rollback to the old backend is quick and easy.
Hosted service developers are often involved with operation of the service, and operational issues can frequently be diagnosed or even corrected with modifications to the software. For example, if a service is experiencing performance issues due to particular kinds of queries, a quick deployment to identify and reject those queries can keep the service up, followed by a patch to add caching or some other approach to improve performance.
Shipped Applications
A shipped application is sent out into the world to be used by other people. Those users may or may not use the latest version, and certainly will not update several times per day (the heroes running Firefox Nightly being a notable exception). So, many versions of the application will be running simultaneously. Some applications support automatic updates, but many users want to control when – and if – they update. For example, upgrading a website built with a CMS like Joomla is a risky operation, especially if the website has been heavily customized.
Upgrades are important both for new features and for bugfixes, including for security bugs. An instance of an application like Gitlab might require an immediate upgrade when a security issue is discovered. However, especially if the deployment is several versions old, that critical upgrade may carry a great deal of risk. Producers of shipped software sometimes provide backported fixes for just this purpose, at least for long term support (LTS) or extended support release (ESR) versions, but this has a substantial cost for the application developers.
Upgrading services like Gitlab or Joomla is made more difficult because there is lots of user data that must remain accessible after the upgrade. For major upgrades, that often requires some kind of migration as data formats and schemas change. In cases where the upgrade spans several major versions, it may be necessary to apply several migrations in order. Tools like Alembic help with this by maintaining and applying step-by-step database migrations.
Taskcluster
Today, Taskcluster is very much a hosted application. There is only one “instance” of Taskcluster in the world, at taskcluster.net. The Taskcluster team is responsible for both development and operation of the service, and also works closely with the Firefox build team as a user of the service.
We want to make Taskcluster a shipped application. As the descriptions above suggest, this is not a simple process. The following sections highlight some of the challenges we are facing.
Releases and Deployment
We currently deploy Taskcluster microservices independently. That is, when we make a change to a service like taskcluster-hooks, we deploy an upgrade to that service without modifying the other services. We often sequence these changes carefully to ensure continued compatibility: we expect only specific combinations of services to run together.
This is a far more intricate process than we can expect users to follow. Instead, we will ship Taskcluster releases comprised of a set of built Docker images and a spec file identifying those images and how they should be deployed. We will test that this particular combination of versions works well together.
Deploying a release involves combining that spec file with some
deployment-specific configuration and some infrastructure information
(implemented via Terraform) to produce a set of
Kubernetes resources for deployment with kubectl
.
Kubernetes and Terraform both have limited support for migration from one
release to another: Terraform will only create or modify changed resources, and
Kubernetes will perform a phased roll-out of any modified resources.
By the way, all of this build-and-release functionality is implemented in the new taskcluster-installer.
Service Discovery
The string taskcluster.net
appears quite frequently in the Taskcluster source
code. For any other deployment, that hostname is not valid – but how will the
service find the correct hostname? The question extends to determining pulse
exchange names, task artifact hostnames, and so on. There are also security
issues to consider: misconfiguration of URLs might enable XSS and CSRF attacks
from untrusted content such as task artifacts.
The approach we are taking is to define a rootUrl
from which all other URLs
and service identities can be determined. Some are determined by simple
transformations encapsulated in a new
taskcluster-lib-urls
library. Others are fetched at runtime from other services: pulse exchanges
from the taskcluster-pulse service, artifact URLs from the taskcluster-queue
service, and so on.
The rootUrl
is a single domain, with all Taskcluster services available at
sub-paths such as /api/queue
. Users of the current Taskcluster installation
will note that this is a change: queue is currently at
https://queue.taskcluster.net
, not https://taskcluster.net/queue
. We have
solved this issue by special-casing the rootUrl https://taskcluster.net
to
generate the old-style URLs. Once we have migrated all users out of the current
installation, we will remove that special-case.
The single root domain is implemented using routing features supplied by
Kubernetes Ingress resources, based on an HTTP proxy. This has the
side-effect that when one microservice contacts another (for example,
taskcluster-hooks calling queue.createTask
), it does so via the same Ingress,
a more circuitous journey than is strictly required.
Data Migrations
The first few deployments of Taskcluster will not require great support for migrations. A staging environment, for example, can be completely destroyed and re-created without any adverse impact. But we will soon need to support users upgrading Taskcluster from earlier releases with no (or at least minimal) downtime.
Our Azure tables library (azure-entities) already has rudimentary support for schema updates, so modifying the structure of table rows is not difficult, although refactoring a single table into multiple tables would be difficult.
As we transition to using Postgres instead of Azure, we will need to adopt some of the common migration tools. Ideally we can support downtime-free upgrades like azure-entities does, instead of requiring downtime to run DB migrations synchronously. Bug 1431783 tracks this work.
Customization
As a former maintainer of Buildbot, I’ve had a lot of experience with CI applications as they are used in various organizations. The surprising observation is this: every organization thinks that their approach to CI is the obvious and only way to do things; and every organization does things in a radically different way. Developers gonna develop, and any CI framework will get modified to suit the needs of each user.
Lots of Buildbot installations are heavily customized to meet local needs. That has caused a lot of Buildbot users to get “stuck” at older versions, since upgrades would conflict with the customizations. Part of this difficulty is due to a failure of the Buildbot project to provide strong guidelines for customization. Recent versions of Buildbot have done better by providing clearly documented APIs and marking other interfaces as private and subject to change.
Taskcluster already has strong APIs, so we begin a step ahead. We might consider additional guidelines:
-
Users should not customize existing services, except to make experimental changes that will eventually be merged upstream. This frees the Taskcluster team to make changes to services without concern that those will conflict with users’ modifications.
-
Users are encouraged, instead, to develop their own services, either hosted within the Taskcluster deployment as a site-specific service, or hosted externally but following Taskcluster API conventions. A local example is the tc-coalesce service, developed by the release engineering team to support Mozilla-specific task-superseding needs and hosted outside of the Taskcluster installation. On the other hand, taskcluster-stats-collector is deployed within the Firefox Taskcluster deployment, but is Firefox-specific and not part of a public Taskcluster release.
-
While a Taskcluster release will likely encompass some pre-built worker images for various cloud platforms, sophisticated worker deployment is the responsibility of individual users. That may mean deploying workers to hardware where necessary, perhaps with modifications to the build configurations or even entirely custom-built worker implementations. We will provide cloud-provisioning tools that can be used to dynamically instantiate user-specified images.
Generated Client Libraries
The second point above raises an interesting quandry: Taskcluster uses code generation to create its API client libraries. Historically, we have just pushed the “latest” client to the package repository and carefully choreographed any incompatible changes. For users who have not customized their deployment, this is not too much trouble: any release of Taskcluster will have a client library in the package repository corresponding to it. We don’t have a great way to indicate which version that is, but perhaps we will invent something.
But when Taskcluster installations are customized by adding additional services, progress is no longer linear: each user has a distinct “fork” of the Taskcluster API surface containing the locally-defined services. Development of Taskcluster components poses a similar challenge: if I add a new API method to a service, how do I call that method from another service without pushing a new library to the package repository?
The question is further complicated by the use of compiled languages. While Python and JS clients can simply load a schema reference file at runtime (for example, a file generated at deploy time), the Go and Java clients “bake in” the references at compile time.
Despite much discussion, we have yet to settle on a good solution for this issue.
Everything is Public!
Mozilla is Open by Design, and so is Taskcluster: with the exception of data that must remain private (passwords, encryption keys, and material covered by other companies’ NDAs), everything is publicly accessible. While Taskcluster does have a sophisticated and battle-tested authorization system based on scopes, most read-only API calls do not require any scopes and thus can be made with a simple, un-authenticated HTTP request.
We take advantage of the public availability of most data by passing around
simple, authentication-free URLs. For example, the action
specification
describes downloading a decision task’s public/action.json
artifact. Nowhere
does it mention providing any credentials to fetch the decision task, nor to
fetch the artifact itself.
This is a rather fundamental design decision, and changing it would be difficult. We might embark on that process, but we might also declare Taskcluster an open-by-design system, and require non-OSS users to invent other methods of hiding their data, such as firewalls and VPNs.
Transitioning from taskcluster.net
Firefox build, test, and release processes run at massive scale on the existing Taskcluster instance at https://taskcluster.net, along with a number of smaller Mozilla-associated projects. As we work on this “redeployability” project, we must continue to deploy from master to that service as well – the rootUrl special-case mentioned above is a critical part of this compatibility. We will not be running either new or old instances from long-living Git branches.
Some day, we will need to move all of these projects to a newly redeployed
cluster and delete the old. That day is still in the distant future. It will
likely involve some running of tasks in parallel to expunge any leftover
references to taskcluster.net
, then a planned downtime to migrate everything
over (we will want to maintain task and artifact history, for example). We will
likely finish up by redeploying a bunch of permanent redirects from
taskcluster.net
domains.
Conclusion
That’s just a short list of some of the challenges we face in transmuting a hosted service into a shipped application.
All the while, of course, we must “keep the lights on” for the existing deployment, and continue to meet Firefox’s needs. At the moment that includes a project to deploy Taskcluster workers on arm64 hardware in https://packet.net, development of the docker-engine to replace the aging docker worker, using hooks for actions to reduce the scopes afforded to level-3 users, improving taskcluster-github to support defining decision tasks, and the usual assortment of contributed pull requests, issue debugging, service requests.
May 01, 2018
Dustin Mitchell
Design of Task-Graph Generation
Almost two years ago, Bug 1258497 introduced a new system for generating the graph of tasks required for each push to a Firefox source-code repository. Work continues to modify the expected tasks and add features, but the core design is stable. Lots of Firefox developers have encountered this system as they add or modify a job or try to debug why a particular task is failing. So this is a good time to review the system design at a high level.
A quick note before beginning: the task-graph generation system is implemented entirely in the Firefox source tree, and is administered as a sub-module of the Build Config module. While it is designed to interface with Taskcluster, and some of the authors are members of the Taskcluster team, it is not a part of Taskcluster itself.
Requirements
A task is a unit of work in the aptly-named Taskcluster platform. This might be a Firefox build process, or a run of a chunk of a test suite. More esoteric tasks include builds of the toolchains and OS environments used by other tasks; cryptographic signing of Firefox installers; configuring Balrog, the service behind Firefox’s automatic updates; and pushing APKs to the Google Play Store.
A task-graph is a collection of tasks linked by their dependencies. For example, a test task cannot run until the build it is meant to test has finished, and that build cannot run until the compiler toolchain it requires has been built.
The task-graph generation system, then, is responsible for generating a task-graph containing the tasks required to test a try push, a landing on a production branch, a nightly build, and a full release of Firefox. That task graph must be minimal (for example, not rebuilding a toolchain if it has already been built) and specific to the purpose (some tasks only run on mozilla-central, for example).
Firefox has been using some CI system – Tinderbox, then Buildbot, and now Taskcluster – for decades, so the set of requirements is quite deep and shrouded in historical mystery.
While the resulting system may seem complex, it is a relatively simple expression of the intricate requirements it must meet. It is also designed with approachability in mind: many common tasks can be accomplished without fully understanding the design.
System Design
The task-graph generation process itself runs in a task, called the Decision Task. That task is typically created in response to a push to version control, and is typically the first task to appear in Treeherder, with a “D” symbol. The decision task begins by checking out the pushed revision, and then runs the task-graph generation implementation in that push. That means the system can be tested in try, and can ride the trains just like any other change to Firefox.
Task-Graph Generation Process
The decision task proceeds in a sequence of steps:
-
Generate a graph containing all possible tasks (the full task graph). As of this writing, the full task graph contains 10,972 tasks!
-
Filter the graph to select the required tasks for this situation. Each project (a.k.a. “branch” or “repo”) has different requirements. Try pushes are a very flexible kind of filtering, selecting only the tasks indicated by the (arcane!) try syntax or the (awesome!) try-select system (more on this below). The result is the target task graph.
-
“Optimize” the graph, by trimming unnecessary tasks. Some tasks, such as tests, can simply be dropped if they are not required. Others, such as toolchain builds, must be replaced by an existing task containing the required data. The result is the optimized task graph.
-
Create each of the tasks using the Taskcluster API.
The process is a bit more detailed but this level of detail will do for now.
Kinds and Loaders
We’ll now focus on the first step: generating the full task graph. In an effort
to segment the mental effort required, tasks are divided into kinds. There
are some obvious kinds – build, test, toolchain – and a lot of less obvious
kinds. Each kind has a directory in
taskcluster/ci
.
Each kind is responsible for generating a list of tasks and their dependencies.
The tasks for all kinds are combined to make the full task graph. Each kind
can generate its tasks in a different way; this is the job of the kind’s
loader. Each kind has a kind.yml
which points to a Python function that
acts as its loader.
Most loaders just load task definitions from YAML files in the kind directory.
There are a few more esoteric loaders – for example, the test loader creates
one copy of each test for each platform, allowing a single definition of, say
mochitest-chrome
to run on all supported platforms.
Transforms
A “raw” task is designed for execution by a Taskcluster worker. It has all sorts of details of the task baked into environment variables, the command to run, routes, and so on. We do not want to write expressions to generate that detail over and over for each task, so we design the inputs in the YAML files to be much more human-friendly. The system uses transforms to bridge the gap: each task output from the load is passed through a series of transforms, in the form of Python generator functions, to produce the final, raw task.
To bring some order to the process, there are some specific forms defined, with schemas and sets of transforms to turn one into the next:
-
Test Description - how to perform a test, including suite and flavor, hardware features required, chunking configuration, and so on.
-
Job Description - how to perform a job; essentially “run Mozharness with these arguments” or “run the Debian package-building process with these inputs”
-
Task Description - how to build a task description; this contains all of the command arguments, environment variables, and so on but is not specific to a particular worker implementation.
There are several other “descriptions”, but these give the general idea.
The final effect is that a relatively concise, readable build description like this:
linux64/debug: description: "Linux64 Debug" index: product: firefox job-name: linux64-debug treeherder: platform: linux64/debug symbol: B worker-type: aws-provisioner-v1/gecko-{level}-b-linux worker: max-run-time: 36000 run: using: mozharness actions: [get-secrets build check-test update] config: - builds/releng_base_firefox.py - builds/releng_base_linux_64_builds.py script: "mozharness/scripts/fx_desktop_build.py" secrets: true custom-build-variant-cfg: debug tooltool-downloads: public need-xvfb: true toolchains: - linux64-clang - linux64-gcc - linux64-sccache - linux64-rust
Can turn into a much larger task definition like this.
Cron
We ship “nightlies” of Firefox twice a day (making the name “nightly” a bit of a historical artifact). This, too, is controlled in-tree, and is general enough to support other time-based task-graphs such as Valgrind runs or Searchfox updates.
The approach is fairly simple: the hooks
service
creates a “cron task” for each project every 15 minutes. This task checks out
the latest revision of the project and runs a mach command that examines
.cron.yml
in the
root of the source tree. It then creates a decision task for each matching
entry, with a custom task-graph filter configuration to select only the desired
tasks.
Actions
For the most part, the task-graph for a push (or cron task) is defined in advance. But developers and sheriffs often need to modify a task-graph after it is created, for example to retrigger a task or run a test that was accidentally omitted from a try push. Taskcluster defines a generic notion of an “action” for just this purpose: acting on an existing task-graph.
Briefly, the decision task publishes a description of the actions that are available for the tasks in the task-graph. Services like Treeherder and the Taskcluster task inspector then use that description to connect user-interface elements to those actions. When an action is executed, the user interface creates a new task called an action task that performs the desired action.
Action tasks are similar to decision and cron tasks: they clone the desired revision of the source code, then run a mach command to do whatever the user has requested.
Multiple Projects
The task-graph generation code rides the trees, with occasional uplifts, just like the rest of the Firefox codebase. That means that the same code must work correctly for all branches; we do not have a different implementation for the mozilla-beta branch, for example.
While it might seem like, to run a new task on mozilla-central, you would just land a patch adding that task on mozilla-central, it’s not that simple: without adjusting the filtering, that task would eventually be merged to all other projects and execute everywhere.
This also makes testing tricky: since the task-graph generation is different
for every project, it’s possible to land code which works fine in try and
inbound, but fails on mozilla-central. It is easy to test task-graph generation
against specific situations (all inputs to the process are encapsulated in a
parameters.yml
file easily downloaded from a decision task). The artistry is
in figuring out which situations to test.
Try Pushes
Pushes to try trigger decision tasks just like any other project, but the filtering process is a little more complex.
If the push comes with legacy try syntax (-b do -p win64,linux64 -u
all[linux64-qr,windows10-64-qr] -t all[linux64-qr,windows10-64-qr]
- clear as
mud, right?), we do our best to emulate the behavior of the Buildbot try parser
in filtering out tasks that were not requested. The legacy syntax is deeply
dependent on some Buildbot-specific features, and does not cover new
functionality like optimization, so there are lots of edge cases where it
behaves strangely or does not work at all.
The better alternative is
try-select,
where the push contains a try_task_config.json
listing exactly which tasks to
run, along with desired modifications to those tasks. The command ./mach try
fuzz
creates such a file. In this case, creating the target task-graph is as
simple as filtering for tasks that match the supplied list.
Conclusion
This has been a long post! The quote “make everything as simple as possible and no simpler”, commonly attributed to Einstein, holds the reason – the task-graph generation system satisfies an incredibly complex set of requirements. In designing the system, we considered these requirements holistically and with a deep understanding of how they developed and why they exist, and then designed a system that was as simple as possible. The remaining complexity is inherent to the problem it solves.
The task-graph generation is covered in the Firefox
source-docs
and its source is in the
/taskcluster
directory in the Firefox source tree.
February 23, 2018
Dustin Mitchell
Internship Applications: Make the First Move
There’s an old story about Charles Proteus Steinmetz, a famous GE engineer in the early 20th century. He was called to one of Henry Ford’s factories, where a huge generator was having problems that the local engineers could not solve. After some investigation and calculation, Steinmetz made a mark on the shell of the generator and told the factory engineers to open that spot and replace the windings there. He later sent a bill for his services to Henry Ford: $10,000. Ford demanded an itemized bill – after all, Steinmetz had only made a single mark on the generator. The bill came back: “Making chalk mark on generator: $1. Knowing where to make mark: $9,999.”
Like electrical engineering, software development is more than just writing code. Sometimes it can take hours to write a 3-line patch. The hard part is knowing what patch to write.
It takes time to understand the system you’re developing and the systems it interacts with. Just undersatnding the problem you’re trying to solve can take some lengthy pondering. There are often new programming languages involved, or new libraries or tools. Once you start writing the code, new complications come up, and you must adjust course.
Experienced software engineers can make this look easy. They have an intuitive sense of what is important and what can be safely ignored, and for what problems might come up later. This is probably the most important skill for newcomers to the field to work on.
Make the First Move
Lately, I’ve gotten dozens of emails from Google Summer of Code and Outreachy applicants that go like this:
Dear Sir,
I am interested in the Outreachy project “…”. I have a background in JavaScript, HTML, CSS, and Java. Please connect me with a mentor for this project.
I’ve also seen dozens of bug comments like this:
I would like to work on this bug. Please guide me in what steps to take.
There is nothing inherently wrong with these messages. It’s always OK to ask for help.
What’s missing is evidence that applicant has made any effort to get started. In the first case, the applicant did not even read the full project description, which indicates that the next step is to make a contribution and has links to tools for finding those contributions. In the second case, it seems that the applicant has not even taken the first steps toward solving the bug. In most cases, they have not even read the bug!
If my first instructions to an applicant are “start by reading the bug” or “taskcluster-lib-app is at https://github.com/taskcluster/taskcluster-lib-app” (something Google will happily tell you in 0.55 seconds), that suggests the applicant’s problem-solving skills need some serious work. While GSoC and Outreachy are meant to be learning experiences, we look for applicants who are able to make the most of the experience by learning and growing on their own. A participant who asks “what is the next step” at every step, without ever trying to figure out what steps to take, is not going to learn very much.
Advice
If you are applying to a program like Google Summer of Code or Outreachy, take the time to try to problem-solve before asking for help. There is nothing wrong with asking for help. But when you do, show what you have already figured out, and ask a specific question. For example:
I would like to work on this bug. It seems that this would require modifying the
taskcluster-lib-scopes
library to add a formatter function. I can see how this formatter would handle anyOf and allOf, but how should it format a for loop?
This comment shows that the applicant has done some thinking about the problem already, and I can see exactly where they have gotten stuck.
January 19, 2018
Dustin Mitchell
Taskcluster Redeployability
Taskcluster To Date
Taskcluster has always been open source: all of our code is on Github, and we get lots of contributions to the various repositories. Some of our libraries and other packages have seen some use outside of a Taskcluster context, too.
But today, Taskcluster is not a project that could practically be used outside of its single incarnation at Mozilla.
For example, we hard-code the name taskcluster.net
in a number of places, and we include our config in the source-code repositories.
There’s no legal or contractual reason someone else could not run their own Taskcluster, but it would be difficult and almost certainly break next time we made a change.
The Mozilla incarnation is open to use by any Mozilla project, although our focus is obviously Firefox and Firefox-related products like Fennec. This was a practical decision: our priority is to migrate Firefox to Taskcluster, and that is an enormous project. Maintaining an abstract ability to deploy additional instances while working on this project was just too much work for a small team.
The good news is, the focus is now shifting. The migration from Buildbot to Taskcluster is nearly complete, and the remaining pieces are related to hardware deployment, largely by other teams. We are returning to work on something we’ve wanted to do for a long time: support redeployability.
Redeployability
Redeployability means that Taskcluster can be deployed easily, multiple times, similar to OpenStack or Hadoop. If, when we finish this effort, there exist several distinct “instances” of Taskcluster in the world, then we have been successful. We will start by building a “staging” deployment of the Firefox instance, then move on to deploy instances that see production usage, first for other projects in Mozilla, and later for projects outside of Mozilla.
In deciding to pursue this approach, we considered three options:
- Taskcluster as a service (TCaaS) – we run the single global Taskcluster instance, providing that service to other projects just like Github or Travis-CI.
- Firefox CI – Taskcluster persists only as a one-off tool to support Firefox’s continuous integration system
- Redeployability (redeployability) – we provide means for other projects to run dedicated Taskcluster instances
TCaaS allows us to provide what we believe is a powerful platform for complex CI systems to a broad audience. While not quite as easy to get started with, Taskcluster’s flexibility extends far beyond what even a paid plan with CircleCI or Travis-CI can provide. However, this approach would represent a new and different business realm for Mozilla. While the organization has lots of public-facing services like MDN and Addons, other organizations do not depend on these services for production usage, nor do they pay us a fee for use of those services. Defining and meeting SLAs, billing, support staffing, abuse response – none of these are areas of expertise within Mozilla, much less the Taskcluster team. TCaaS would also require substantial changes to the platform itself to isolate paying customers from one another, hide confidential data, accurately meter usage, and so on.
Firefox CI is, in a sense, a scaled-back, internal version of TCaaS: we provide a service, but to only one customer (Firefox Engineering). It would mean transitioning the team to an operations focus, with little or no further development on the platform. It would also open the doors to Firefox-specific design within Taskcluster, such as checking out the Gecko source code in the workers or sorting queued tasks by Gecko branch. This would also shut the door to other projects such as Rust relying on Taskcluster.
Redeployability represents something of a compromise between the other two options. It allows us to make Taskcluster available outside of the narrow confines of Firefox CI without diving into a strange new business model. We’re Mozilla – shipping open source software is right in our wheelhouse.
It comes with some clear advantages, too:
-
Like any open-source project, users will contribute back, focusing on the parts of the system most related to their needs. Most Taskcluster users will be medium- to large-scale engineering organizations, and thus able to dedicate the resources to design and develop significant new features.
-
A well-designed deployment system will help us improve operations for Firefox CI (many of our outages today are caused by deployment errors) and enable deployment by teams focused on operations.
-
We can deploy an entire staging instance of Firefox’s Taskcluster, allowing thorough testing before deploying to production. The current approach to staging changes is ad-hoc and differs between services, workers, and libraries.
Challenges
Of course, the redeployability project is not going to be easy. The next few sections highlight some of the design challenges we are facing. We have begin solving all of these and more, but as none of the solutions are set in stone I will focus just on the challenges themselves.
Deployment Process
Deploying a set of microservices and backend services like databases is pretty easy: tools like Kubernetes are designed for the purpose. Taskcluster, however, is a little more complicated. The system uses a number of cloud providers (packet.net, AWS, and Azure), each of which needs to be configured properly before use.
Worker deployment is a complicated topic: workers must be built into images that can run in cloud services (such as AMIs), and those images must be capable of starting and contacting the Queue to fetch work without further manual input. We already support a wide array of worker deployments on the single instance of Taskcluster, and multiple deployments would probably see an even greater diversity, so any deployment system will need to be extremely flexible.
We want to use the deployment process for all deployments, so it must be fast and reliable. For example, to deploy a fix to the Secrets service, I would modify the configuration to point to the new version and initiate a full re-deploy of the Taskcluster instance. If the deployment process causes downtime by restarting every service, or takes hours to complete, we will find ourselves “cheating” and deploying things directly.
Client Libraries
The Taskcluster client libraries contain code that is generated from the API specification for the Taskcluster services.
That means that the latest taskcluster
package on PyPi corresponds to the APIs of the services as they are currently deployed.
If an instance of Taskcluster is running an older version of those services, then the newer client may not be able to call them correctly.
Likewise, an instance created for development purposes might have API methods that aren’t defined in any released version of the client libraries.
A related issue is service discovery: how does a client library find the right URL for a particular service? For platform services like the Queue and Auth, this is fairly simple, but grows more complex for services which might be deployed several times, such as the AWS provisioner.
Configuration and Customization
No two deployments of Taskcluster will be exactly alike – that would defeat the purpose. We must support a limited amount of flexibility: which services are enabled, what features of those services are enabled, and credentials for the various cloud services we use.
In some cases the configuration for a service relies on values derived from another service that must already be started.
For example, the Queue needs Taskcluster credentials generated by calling createTask
on a running Auth service.
Upgrades
Many of the new features we have added in Taskcluster have been deployed through a carefully-choreographed, manual process. For example, to deploy parameterized roles support, which involved a change to the Auth sevice’s backend support, I disabled writes to the backend, carefully copied the data to the new backend, then landed a patch to begin using the new backend with the old frontend, and so on. We cannot expect users to follow hand-written instructions for such delicate dances.
Conclusion
The Taskcluster team has a lot of work to do. But this is a direction many of us have been itching to move for several years now, so we are eagerly jumping into it. Look for more updates on the redeployability project in the coming months!
July 19, 2017
Chinmay Kousik
Livelog Proxy(WebhookTunnel): Final Work Product
The project was initially named Livelog Proxy, but during the community bonding period was renamed as Webhooktunnel, as it more accurately captured the full scope of the project.The Webhooktunnel repository can be found here.
Tasks Completed:
- [x] Main webhooktunnel project.
- [x] Taskcluster Auth integration
- [x] Deployment to docker cloud
- [x] taskcluster-worker integration
- [x] docker-worker integration [Stretch Goal]
- [ ] generic-worker integration [Stretch Goal]
- [ ] Routing DHT [Stretch Goal]
Webhooktunnel Details:
Webhooktunnel works by multiplexing HTTP requests over a WebSocket connection. This allows clients to connect to the proxy and server webhooks over the websocket connection instead of exposing a port to the internet.
The connection process for clients(workers) is explained in the diagram below:
The client(worker) needs an ID and JWT to connect to the proxy. These are supplied by tc-auth. The proxy(whtunnel) responds by upgrading the HTTP(S) connection to a websocket connection and supplies the client’s base URL in a response header.
An example of request forwarding works as follows:
Webhooktunnel can also function as a websocket proxy.
Webhooktunnel has already been integrated into Taskcluster worker and is used for serving livelogs from task builds.
The core of Webhooktunnel is the multiplexing library wsmux.
Wsmux allows creating client and server sessions over a WebSocket connection and creates multiplexed streams over the
connection. These streams are exposed using a net.Conn
interface.
Webhooktunnel also consists of a command line client, which can forward incoming connections from the proxy to a local port. This is useful as it can be used by servers which are behind a NAT/Firewall.
June 16, 2017
Chinmay Kousik
WebSocket Multiplexer Overview
General Idea
WebSocket multpilexer enables creation of multiple tcp-like streams over a WebSocket connection. Since each stream can be treated as a separate net.Conn
instance, it is used by other components to proxy HTTP requests. A new stream can be opened for each request, and they can be handled in a manner identical to tcp streams. Wsmux contains two components: Sessions and Streams. Sessions wrap WebSocket connections and allow management of streams over the connection. Session implements net.Listener
, and can be used by an http.Server
instance to serve requests. Streams are the interface which allow users to send and receive multiplexed data. Streams implement net.Conn
. Streams have internal mechanisms for buffering and congestion control.
Why WebSocket?
The decision to use WebSocket (github.com/gorilla/websocket
) instead of supporting a net.Conn
was made for the following reasons:
- WebSocket handshakes can be used for intitiating a connection instead of writing a custom handshake. Wsmux can be used as a subprotocol in the WebSocket handshake. This greatly simplifies the implementation of wsmux.
- WebSocket convenience mathods (
ReadMessage
andWriteMessage
) simplify sending and receiving of data and control frames. - Control messages such as ping, pong, and close need not be implemented separately in wsmux. WebSocket control frames can be used for this purpose.
- Adding another layer of abstraction over WebSocket enables connections to be half-closed. WebSocket does not allow for half closed connections, but wsmux streams can be half closed, thus simplifying usage.
- Since WebSocket frames already contain the length of the message, the length field can be dropped from wsmux frames. This reduces the size of the wsmux header to 5 bytes.
Framing
WebSocket multiplexer implements a very simple framing technique. The total header size is 5 bytes.
[ message type - 8 bits ][ stream ID - 32 bits ]
Messages can have the following types:
- msgSYN: Used to initiate a connection.
- msgACK: Used to acknowledge bytes read on a connection.
- msgDAT: Signals that data is being sent.
- msgFIN: Signals stream closure.
Sessions
A Session wraps a WebSocket connection and enables usage of the wsmux subprotocol. A Session is typically used to manage the various wsmux streams. Sessions are of two types:Server, Client. The only difference between a Server Session and a Client Session is that the ID of a stream created by a Server Session will be even numbered while the ID of a stream created by a Client will be odd numbered. A connection must have only one Server and one Client. Sessions read messages from the WebSocket connection and forward the data to the appropriate stream. Streams are responsible for buffering and framing of data. Streams must send data by invoking the send
method of their Session.
Streams
Streams allow users to interface with data tagged with a particular ID, called the stream ID. Streams contain circular buffers for congestion control and are also responsible for generating and sending msgACK frames to the remote session whenever data is read. Streams handle framing of data when data is being sent, and also allow setting of deadlines for Read
and Write
calls. Internally, streams are represented using a Finite State Machine, which has been described in a previous blog post. Performance metrics for streams have also been measured and are availabe here.
Conclusion
Wsmux is being used by the two major components of Webhooktunnel: Webhook Proxy, and Webhook Client. It has been demonstrated that wsmux can be used to multiplex HTTP requests reliably, and can also support WebSocket over wsmux streams.
The repository can be found here.
May 24, 2017
Chinmay Kousik
Stream Metrics
In the previous post, I gave a brief explanation of how the stream has been refactored to resemble a finite state machine. This post elaborates on the performance metrics of streams based on buffer size and number of concurrent streams.
Buffers
Each stream has an internal buffer which is used to store data which is sent to it from the remote side. The default buffer size is 1024 bytes as of now. The buffer size is immutable and cannot be changed once the stream has been created. The buffer is internally implemented as a circular queue of bytes, and implements io.ReadWriter
. When a stream is created, the stream assumes the remote buffer capacity to be zero. When the stream is accepted, the remote connection informs the stream of its buffer size and the remote buffer capacity is updated. Streams are setup to track remote capacity, unblocking bytes when an ACK frame arrives, and reducing remote capacity when a certain number of bytes are written. Streams can only write as many bytes as remote capacity, and will block writes until further bytes are unblocked. Thus, buffer size has a significant effect on performance.
The following plot shows the time taken for a Write()
call over 100 concurrent streams as a function of buffer size.
1500 bytes are sent and echoed back over each stream. It is clear that the time taken reduces exponentially as a function of buffer size. This is because smaller buffers require more messages to be sent over the websocket connection. A stream with a 1024 byte buffer needs to exchange a minimum of 3 messages for the data to be completely sent to the remote connection: write 1024 bytes, receive ACK >= 476 bytes, write 476 bytes. A stream with a large enough buffer can write data using a single message. The intended buffer size is 4k.
Concurrent Streams
Each session is capabale of handling concurrent streams. This test keeps the buffer size constant as 1024 bytes and varies the number of concurrent streams.
The following plot describes the time taken to echo 1500 bytes over each stream with a buffer size of 1k as a function of number of concurrent streams:
It is simple to fit a quadratic curve to this curve. A reason for this could be a limit on the throughput of the websocket connection.
May 16, 2017
Chinmay Kousik
Stream States Part 1
Streams can be modelled as an FSM by determining the different states a stream can be in and all valid state transistions. Initially, a stream is in the created
state.
This state signifies that the stream has been created. This is possible in two different ways: a) The stream was created by the session, added to the stream map, and a SYN
packet with the stream’s ID was sent to the remote session, or b) A SYN packet was received from the remote session and a new stream was created and added to the stream map.
In case of (a) the stream waits for an ACK packet from the remote session and as soon as the ACK packet arrives, it transistions to the accepted
state. In case of (b) the
session sends an ACK packet, and the the stream transitions to the accepted
state.
Once in the accepted
state the stream can read and write from the stream. When a DAT packet arrives, the data is push to the stream’s buffer. When data is read out of the
buffer using a Read()
call, an ACK packet is sent to the remote stream with the number of bytes read. When an ACK packet is received in the accepted state, the number of
bytes unblocked (the number of bytes the remote session is willing to accept), is updated. If the stream is closed by a call to Close()
, then the stream transitions to the
closed
state and sends a FIN packet to the remote stream. When a FIN packet is received, the stream transitions to the remoteClosed
state.
In the closed
state, the stream can not write any data to the remote connection. All Write()
calls return an ErrBrokenPipe error. The stream can still receive data, and canread data from the buffer.
The remoteClosed
state signifies that the remote stream will not send any more data to the stream. Read()
calls can still read data from the buffer. If the buffer is empty
then the Read()
calls return EOF. The stream can write data to the remote session.
If a FIN packet is received when in the closed
state, or Close()
is called in the remoteClosed
state, the stream transitions to the dead
state. All Write()
calls fail
in the dead
state, but Read()
can retreive data from the buffer. If the stream is in the dead
state, and the buffer is empty, then the stream is removed by its Session.
The state transitions can be summed up in the following diagram:
May 08, 2017
Chinmay Kousik
GSOC Project: Webhook Tunnel
I got accepted to Google Summer of Code (GSoC) 2017. I will be working with Mozilla Taskcluster, and my project is Webhook Tunnel (we changed the name from livelog proxy). TaskCluster workers are hosted on services such as EC2 and currently expose ports to the internet and allows clients to call API endpoints. This may not be feasible in a data center setup. Webhook proxy aims to mitigate this problem by allowing workers to connect to a proxy (part of webhook tunnel) over an outgoing WebSocket connection and the proxy in turn exposes API endpoints to the internet. This is implemented as a distributed system for handling high loads.
This is similar to ngrok, or localtunnel, but a key difference is that instead of providing a port that clients can connect to, webhook tunnel exposes APIs as “<worker-id>.taskcluster-proxy.net/<endpoint>”. This is a much more secure way of exposing endpoints.
The initial plan is to deploy this on Docker Cloud. Details will follow in further posts.
August 03, 2016
Selena Deckelmann
TaskCluster 2016Q2 Retrospective
The TaskCluster Platform team worked very hard in Q2 to support the migration off Buildbot, bring new projects into our CI system and look forward with experiments that might enable fully-automated VM deployment on hardware in the future.
We also brought on 5 interns. For a team of 8 engineers and one manager, this was a tremendous team accomplishment. We are also working closely with interns on the Engineering Productivity and Release Engineering teams, resulting in a much higher communication volume than in months past.
We continued our work with RelOps to land Windows builds, and those are available in pushes to Try. This means people can use “one click loaners” for Windows builds as well as Linux (through the Inspect Task link for jobs)! Work on Windows tests is proceeding.
We also created try pushes for Mac OS X tests, and integrated them with the Mac OS X cross-compiled builds. This also meant deep diving into the cross-compiled builds to green them up in Q3 after some compiler changes.
A big part of the work for our team and for RelEng was preparing to implement a new kind of signing process. Aki and Jonas spent a good deal of time on this, as did many other people across PlatformOps. What came out of that work was a detailed specification for TaskCluster changes and for a new service from RelEng. We expect to see prototypes of these ideas by the end of August, and the major blocking changes to the workers and provisioner to be complete then too.
This all leads to being able to ship Linux Nightlies directly from TaskCluster by the end of Q3. We’re optimistic that this is possible, with the knowledge that there are still a few unknowns and a lot has to come together at the right time.
Much of the work on TaskCluster is like building a 747 in-flight. The microservices architecture enables us to ship small changes quickly and without much pre-arranged coordination. As time as gone on, we have consolidated some services (the scheduler is deprecated in favor of the “big graph” scheduling done directly in the queue), separated others (we’ve moved Treeherder-specific services into its own component, and are working to deprecate mozilla-taskcluster in favor of a taskcluster-hg component), and refactored key parts of our systems (intree scheduling last quarter was an important change for usability going forward). This kind of change is starting to slow down as the software and the team adapts and matures.
I can’t wait to see what this team accomplishes in Q3!
Below is the team’s partial list of accomplishments and changes. Please drop by #taskcluster or drop an email to our tools-taskcluster lists.mozilla.org mailing list with questions or comments!
Things we did this quarter:
- initial investigation and timing data around using sccache for linux builds
- released update for sccache to allow working in a more modern python environment
- created taskcluster managed s3 buckets with appropriate policies
- tested linux builds with patched version of sccache
- tested docker-worker on packet.net for on hardware testing
- worked with jmaher on talos testing with docker-worker on releng hardware
- created livelog plugin for taskcluster-worker (just requires tests now)
- added reclaim logic to taskcluster-worker
- converted gecko and gaia in-tree tasks to use new v2 treeherder routes
- Updated gaia-taskcluster to allow github repos to use new taskcluster-treeherder reporting
- move docs, schemas, references to https
- refactor documentation site into tutorial / manual / reference
- add READMEs to reference docs
- switch from a * certificate to a SAN certificate for taskcluster.net
- increase accessibility of AWS provisioner by separating bar-graph stuff from workerType configuration
- use roles for workerTypes in the AWS provisioner, instead of directly specifying scopes
- allow non-employees to login with Okta, improve authentication experience
- named temporary credentials
- use npm shrinkwrap everywhere
- enable coalescing
- reduce the artifact retention time for try jobs (to reduce S3 usage)
- support retriggering via the treeherder API
- document azure-entities
- start using queue dependencies (big-graph-scheduler)
- worked with NSS team to have tasks scheduled and displayed within treeherder
- Improve information within docker-worker live logs to include environment information (ip address, instance type, etc)
- added hg fingerprint verification to decision task
- Responded and deployed patches to security incidents discovered in q2
- taskcluster-stats-collector running with signalfx
- most major services using signalfx and sentry via new monitoring library taskcluster-lib-monitor
- Experimented with QEMU/KVM and libvirt for powering a taskcluster-worker engine
- QEMU/KVM engine for taskcluster-worker
- Implemented Task Group Inspector
- Organized efforts around front-end tooling
- Re-wrote and generalized the build process for taskcluster-tools and future front-end sites
- Created the Migration Dashboard
- Organized efforts with contractors to redesign and improve the UX of the taskcluster-tools site
- First Windows tasks in production – NSS builds running on Windows 2012 R2
- Windows Firefox desktop builds running in production (currently shown on staging treeherder)
- new features in generic worker (worker type metadata, retaining task users/directories, managing secrets in secrets store, custom drive for user directories, installing as a startup item rather than service, improved syscall integration for logins and executing processes as different users)
- many firefox desktop build fixes including fixes to python build scripts, mozconfigs, mozharness scripts and configs
- CI cleanup https://travis-ci.org/taskcluster
- support for relative definitions in jsonschema2go
- schema/references cleanup
Paying down technical debt
- Fixed numerous issues/requests within mozilla-taskcluster
- properly schedule and retrigger tasks using new task dependency system
- add more supported repositories
- Align job state between treeherder and taskcluster better (i.e cancels)
- Add support for additional platform collection labels (pgo/asan/etc)
- fixed retriggering of github tasks in treeherder
- Reduced space usage on workers using docker-worker by removing temporary images
- fixed issues with gaia decision task that prevented it from running since March 30th.
- Improved robustness of image creation image
- Fixed all linter issues for taskcluster-queue
- finished rolling out shrinkwrap to all of our services
- began trial of having travis publish our libraries (rolled out to 2 libraries now. talking to npm to fix a bug for a 3rd)
- turned on greenkeeper everywhere then turned it off again for the most part (it doesn’t work with shrinkwrap, etc)
- “modernized” (newer node, lib-loader, newest config, directory structure, etc) most of our major services
- fix a lot of subtle background bugs in tc-gh and improve logging
- shared eslint and babel configs created and used in most services/libraries
- instrumented taskcluster-queue with statistics and error reporting
- fixed issue where task dependency resolver would hang
- Improved error message rendering on taskcluster-tools
- Web notifications for one-click-loaner UI on taskcluster-tools
- Migrated stateless-dns server from tutum.co to docker cloud
- Moved provisioner off azure storage development account
- Moved our npm package to a single npm organization
June 27, 2016
Wander Lairson Costa
The taskcluster-worker Mac OSX engine
In this quarter, I worked on implementing the taskcluster-worker Mac OSX engine. Before talking about this specific implementation, let me explain what a worker is and how taskcluster-worker differs from docker-worker, the currently main worker in Taskcluster.
The role of a Taskcluster worker
When a user submits a task graph to Taskcluster, contrary to the common sense (at least if you are used on how OSes schedulers usually work), these tasks are submitted to the scheduler first, which is responsible to process dependencies and enqueue them. In the Taskcluster manual page there is a clear picture ilustrating this concept.
The provisioner is responsible for looking at the queue and determine how many pending tasks exist and, based on that, it launches worker instances to run these tasks.
Then comes the figure of the worker. The worker is responsible for actually executing the task. It claims a task from the queue, runs it, upload the generated artifacts and submits the status of the finished task, using the Taskcluster APIs.
docker-worker
is a worker that runs task command inside a docker container.
The task payload specifies a docker
image as well as a command line to run, among other environment parameters.
docker-worker pulls the specified docker image and runs task commands inside it.
taskcluster-worker and the OSX engine
taskcluster-worker
is a generic and modularized worker under active
development by the Taskcluster team. The worker delegates the task execution
to one of the available
engines.
An engine is a component of taskcluster-worker responsible for running a task
under a specific system environment. Other features, like environment variable
setting, live logging, artifact uploading, etc., are handled by
worker plugins.
I am implementing the Mac OSX engine, which will mainly be used to run
Firefox automated tests in the Mac OSX environment. There is a
macosx
branch in
my personal Github taskcluster-worker fork in which I push my commits.
One specific aspect of the engine implementation is the ability to run more than one task at the same time. For this, we need to implement some kind of task isolation. For docker-worker, each task ran in its own docker container so tasks were isolated by definition. But there is no such thing as a container for OSX engine. Our earlier tries with chroot failed miserably, due to incompatibilities with OSX graphic system. Our final solution was to create a new user on the fly and run the task with this user’s credentials. This not only provides some task isolation, but also prevents privilege escalation attacks by running tasks with different user than the worker.
Instead of dealing with the poorly documented
Open Directory Framework,
we chose to spawn the
dscl
command to create and configure users. Tasks usually takes a long time to
execute, spawning loads of subprocess, so a few spawns of the dscl
command
won’t have any practical performance impact.
One final aspect is how we bootstrap task execution. A tasks boils down to
a script that executes task duties. But where does this script come from?
It doesn’t live in the machine that executes the worker. OSX engine provides a
link
field in task payload that a task can specify an executable to download and
execute.
Running the worker
OSX engine will primarily be used to execute Firefox tests on Mac OSX, and the environment is expected to have a very specific tools and configurations set. Because of that, I am testing the code on a loaner machine. To start the worker, it is just a matter of opening a terminal and typing:
$ ./taskcluster-worker work macosx --logging-level debug
The worker connects to the Taskcluster queue, claims and execute the tasks available. At the time I am writing, all tests but Firefox UI functional tests” were green, running on optimized Firefox OSX builds. We intend to land Firefox tests in taskcluster-worker as Tier-2 on next quarter, running them in parallel with Buildbot.
May 02, 2016
Maja Frydrychowicz
Not Testing a Firefox Build (Generic Tasks in TaskCluster)
A few months ago I wrote about my tentative setup of a TaskCluster task that was neither a build nor a test. Since then, gps has implemented “generic” in-tree tasks so I adapted my initial work to take advantage of that.
Triggered by file changes
All along I wanted to run some in-tree tests without having them wait around for a Firefox build or any other dependencies they don’t need. So I originally implemented this task as a “build” so that it would get scheduled for every incoming changeset in Mozilla’s repositories.
But forget “builds”, forget “tests” — now there’s a third category of tasks that we’ll call “generic” and it’s exactly what I need.
In base_jobs.yml I say, “hey, here’s a new task called marionette-harness
— run it whenever there’s a change under (branch)/testing/marionette/harness”. Of course, I can also just trigger the task with try syntax like try: -p linux64_tc -j marionette-harness -u none -t none
.
When the task is triggered, a chain of events follows:
marionette-harness
is defined by harness_marionette.yml, which depends on harness_test.yml- harness_test.yml says to run build.sh with the appropriate mozilla branch and revision.
- harness_marionette.yml sets more environment variables and parameters for build.sh to use (
JOB_SCRIPT
,MOZHARNESS_SCRIPT
, etc.) - So build.sh checks out the source tree and executes harness-test-linux.sh (
JOB_SCRIPT
)… - …which in turn executes marionette_harness_tests.py (
MOZHARNESS_SCRIPT
) with the parameters passed on by build.sh
For Tasks that Make Sense in a gecko Source Checkout
As you can see, I made the build.sh
script in the desktop-build
docker image execute an arbitrary in-tree JOB_SCRIPT
, and I created harness-test-linux.sh
to run mozharness within a gecko source checkout.
Why not the desktop-test image?
But we can also run arbitrary mozharness scripts thanks to the configuration in the desktop-test docker image! Yes, and all of that configuration is geared toward testing a Firefox binary, which implies downloading tools that my task either doesn’t need or already has access to in the source tree. Now we have a lighter-weight option for executing tests that don’t exercise Firefox.
Why not mach?
In my lazy work-in-progress, I had originally executed the Marionette harness tests via a simple call to mach, yet now I have this crazy chain of shell scripts that leads all the way mozharness. The mach command didn’t disappear — you can run Marionette harness tests with ./mach python-test ...
. However, mozharness provides clearer control of Python dependencies, appropriate handling of return codes to report test results to Treeherder, and I can write a job-specific script and configuration.
April 01, 2016
Wander Lairson Costa
Overcoming browser same origin policy
One of my goals for 2016 Q1 was to write a monitoring dashboard for Taskcluster. It basically pings Taskcluster services to check if they are alive and also acts as a feed aggregator for services Taskcluster depends on. One problem with this approach is the same origin policy, in which web pages are only allowed to make requests to their own domain. For web servers which is safe to make these cross domain requests, they can either implement jsonp or CORS. CORS is the preferred way so we will focus on it for this post.
Cross-origin resource sharing
CORS is a mechanism that allows the web server tell the browser that is safe to
accomplish a cross domain request. It consists of a set of HTTP headers with details
for the conditions to accomplish the request. The main response header is
Access-Control-Allow-Origin
, which contains either a list of allowed domains or
a *
, indicating any domain can make a cross request to this server. In a CORS
request, only a small set of headers is exposed to the response object. The server
can tell the browser to expose additional headers through the
Access-Control-Expose-Headers
response header.
But what if the web server doesn’t implement CORS? The only solution is to provide a proxy that will make the actual request and add the CORS headers.
cors-proxy
To allow the monitoring dashboard make requests for status state on remote services
that do not implement CORS, we created the
cors-proxy. It exports a /request
endpoint that allows you to make requests to any remote host. cors-proxy redirects
it to the remote URL and sends the responses back, with appropriate CORS headers set.
Let’s see an example:
$.ajax({
url: 'https://cors-proxy.taskcluster.net/request',
method: 'POST',
contentType: 'application/json',
data: {
url: 'https://queue.taskcluster.net/v1/ping',
}
}).done(function(res) {
console.log(res);
});
The information about the remote request is sent in the proxy request body. All parameter fields are shown in the project page.
Before you think on using the hosted server to bypass your own requests, cors-proxy only honors requests from a whitelist. So, only some subdomains under Taskcluster domain can use cors-proxy.
March 30, 2016
Pete Moore
Walkthrough installing Cygwin SSH Daemon on AWS EC2 instances
One of the challenges we face at Mozilla is supporting Windows in an organisational environment which is predominantly *nix oriented. Furthermore, historically our build and test infrastructure has only provided a very limited ssh daemon, with an antiquated shell, and outdated unix tools.
With the move to hosting Windows environments in AWS EC2, the opportunity arose to review our current SSH daemon, and see if we couldn’t do something a little bit better.
When creating Windows environments in EC2, it is possible to launch a “vanilla” Windows instance, from an AMI created by Amazon. This instance is based on a standard installation of a given version of Windows, with a couple of AWS EC2 tools preinstalled.
One of the features of the preinstalled tools, is that they allow you to specify powershell and/or batch script snippets inside the instance User Data, that will be executed upon launch.
This makes it quite trivial to customise a Windows environment, by providing all of the customisation steps as a PowerShell snippet in the instance User Data.
In this Walkthrough, we will set up a Windows 2012 R2 Windows machine, with the cygwin ssh daemon preinstalled. In order to follow this walkthrough, you will need an AWS account, and the ability to spawn an instance.
Install AWS CLI
Although all of these steps can be performed via the web console, typically we would want to automate them. Therefore in this walkthrough, I’m using the AWS CLI to perform all of the actions, to make it easier should you want to script any of the setup.
Windows installation
Download and run the 64-bit or 32-bit Windows installer.
Mac and Linux installation
Requires Python 2.6.5 or higher.
Install using pip.
Further help
See the AWS CLI guide if you get stuck.
Configuring AWS credentials
If this is your first time running the AWS CLI tool, configure your credentials with:
See the AWS credentials configuration guide if you need more help.
Locate latest Windows Server 2012 R2 AMI (64bit)
The following command line will find you the latest Windows 2012 R2 stock image, provided by AWS, in your default region.
Now we can see what the current AMI is, in our default region, with:
Note, the actual AMI provided by AWS changes from week to week, and from region to region, so don’t be surprised if you get a different result to the one above.
Create a Security Group
We need our instance to be in a security group that allows us to SSH onto it.
First create a security group:
And then update it to only allow inbound SSH traffic:
Create a unique Client Token
We should create a unique client token that will allow us to make idempotent requests, should there be any failures. We will also use this as our “name” for the instance until we get the real instance name back.
Create a dedicated Key Pair
We’ll need to specify a key pair in order to retrieve the Windows Password. Let’s create a dedicated one just for this instance.
Create custom post-installation script
Typically, you’ll want to customise the cygwin environment, for example:
- Changing the bash prompt
- Setting vim options
- Adding ssh authorized keys
- ….
Let’s do this in a post installation bash script, which we can download as part of the installation.
In order to be able to authenticate with our new key, we’ll need to get the public part. Note, we could generate separate keys for ssh’ing to our machine, but we might as well reuse the key we just created.
Create User Data
The AWS Windows Guide advises us that Windows PowerShell commands can be executed if supplied as part of the EC2 User Data. We’ll use this userdata to install cygwin and the ssh daemon from scratch.
Create a file userdata
to store the User Data:
Fix SSH key
We need to replace the SSH public key placeholder we just referenced in userdata with the actual public key
Launch new instance
We’re now finally ready to launch the instance. We can do this with the following commands:
You should get some output similar to this:
March 11, 2016
Selena Deckelmann
[workweek] tc-worker workweek recap
Sprint recap
We spent this week sprinting on the tc-worker, engines and plugins. We merged 19 pull requests and had many productive discussions!
tc-worker core
We implemented the task loop! This basic loop should start when the worker is invoked. It spins up a task claimer and manager responsible for claiming as many tasks up to it’s available capacity and running them to completion. You can find details in in this commit. We’re still working on some high level documentation.
We did some cleanups to make it easier to download and get started with builds. We fixed up packages related to generating go types from json schemas, and the types now conform to the linting rules
We also implemented the webhookserver. The package provides implementations of the WebHookServer interface which allows attachment and detachment of web-hooks to an internet exposed server. This will support both the livelog and interactive features. Work is detailed in PR 37.
engine: hello, world
Greg created a proof of concept and pushed a successful task to emit a hello, world artifact. Greg will be writing up something to describe this process next week.
plugin: environment variables
Wander landed this plugin this week to support environment variable setting. The work is described in PR 39.
plugin: artifact uploads
This plugin will support artifact uploads for all engines to S3 and is based on generic-worker code. This work is started in PR 55.
TaskCluster design principles
We discussed as a team the ideas behind the design of TaskCluster. The umbrella principle we try to stick to is: Getting Things Built. We felt it was important to say that first because it helps us remember that we’re here to provide features to users, not just design systems. The four key design principles were distilled to:
- Self-service
- Robustness
- Enable rapid change
- Community friendliness
One surprising connection (to me) we made was that our privacy and security features are driven by community friendliness.
We plan to add our ideas about this to a TaskCluster “about” page.
TaskCluster code review
We discussed our process for code review, and how we’d like to do them in the future. We covered issues around when to do architecture reviews and how to get “pre-reviews” for ideas done with colleagues who will be doing our reviews. We made an outline of ideas and will be giving them a permanent home on our docs site.
Q2 Planning
We made a first pass at our 2016q2 goals. The main theme is to add OS X engine support to taskcluster-worker, continue work on refactoring intree config and build out our monitoring system beyond InfluxDB. Further refinements to our plan will come in a couple weeks, as we close out Q1 and get a better understanding of work related to the Buildbot to TaskCluster migration.
March 08, 2016
Selena Deckelmann
Tier-1 status for Linux 64 Debug build jobs on March 14, 2016
I sent this to dev-planning, dev-platform, sheriffs and tools-taskcluster today. I added a little more context for a non-Mozilla audience.
The time has come! We are planning to switch to Tier-1 on Treeherder for TaskCluster Linux 64 Debug build jobs on March 14. At the same time, we will hide the Buildbot build jobs, but continue running them. This means that these jobs will become what Sheriffs use to determine the health of patches and our trees.
On March 21, we plan to switch the Linux 64 Debug tests to Tier-1 and hide the related Buildbot test jobs.
After about 30 days, we plan to disable and remove all Buildbot jobs related to Linux Debug.
Background:
We’ve been running Linux 64 Debug builds and tests using TaskCluster side-by-side with Buildbot jobs since February 18th. Some of the project work that was done to green up the tests is documented here.
The new tests are running in Docker-ized environments, and the Docker images we use are defined in-tree and publicly accessible.
This work was the culmination of many months of effort, with Joel Maher, Dustin Mitchell and Armen Zambrano primarily focused on test migration this quarter. Thank you to everyone who responded to NEEDINFOs, emails and pings on IRC to help with untangling busted test runs.
On performance, we’re taking a 14% hit across all the new test jobs vs. the old jobs in Buildbot. We ran two large-scale tests to help determine where slowness might still be lurking, and were able to find and fix many issues. There are a handful of jobs remaining that seem significantly slower, while others are significantly faster. We decided that it was more important to deprecate the old jobs and start exclusively maintaining the new jobs now, rather than wait to resolve the remaining performance issues. Over time we hope to address issues with the owners of the affected test suites.
March 07, 2016
Selena Deckelmann
[portland] taskcluster-worker Hello, World
The TaskCluster Platform team is in Portland this week, hacking on the taskcluster-worker.
Today, we all sync’d up on the current state of our worker, and what we’re going to hack on this week. We started with the current docs.
The reason why we’re investing so much time in the worker is two fold:
- The worker code previously lived in two code bases – docker-worker and generic-worker. We need to unify these code bases so that multiple engineers can work on it, and to help us maintain feature parity.
- We need to get a worker that supports Windows into production. For now, we’re using the generic-worker, but we’d like to switch over to taskcluster-worker in late Q2 or early Q3. This timeline lines up with when we expect the Windows migration from Buildbot to happen.
One of the things I asked this team to do was come up with some demos of the new worker. The first demo today was to simply output a log and upload it from Greg Arndt.
The rest of the team is getting their Go environments set up to run tests and get hacking on crucial plugins, like our environment variable handling and additional artifact uploading logic we need for our production workers.
We’re also taking the opportunity to sync up with our Windows environment guru. Our goal for Buildbot to TaskCluster migration this quarter is focused on Linux builds and tests. Next quarter, we’ll be finishing Linux and, I hope, landing Windows builds in TaskCluster. To do that, we have a lot of details to sort out with how we’ll build Windows AMIs and deploy them. It’s a very different model because we don’t have the same options with Docker as we have on Linux.
March 01, 2016
Jonas Finnemann Jensen
One-Click Loaners with TaskCluster
Last summer Edgar Chen (air.mozilla.org) built on an interactive shell for TaskCluster Linux workers, so developers can get a SSH-like session into a task container from their browser. We’ve slowly been improving this, and prior to Mozlando I added support for opening a VNC-like session connecting to an X-session inside a task container. I’ll admit I was mostly motivated by the prospect of giving an impressive demo, and the implementation details are likely to change as we improve it further. Consequently, we haven’t got many guides on how to use these features in their current state.
However, with people asking for TaskCluster “loaners” on IRC, I figure now is a good time to explain how these interactive features can be used to provide a loaner-on-demand flow for TaskCluster workers. At least on Linux, but hopefully we can do a similar thing on other platforms too. Before we dive in, I want to note that all of our Linux tasks runs under docker with one container per tasks. Hence, you can pull down the docker image and play with it locally, the process and caveats such as setting up loopback video and audio devices is beyond the scope of this post. But feel free to ask on IRC (#taskcluster), I’m sure Greg Arndt has all the details, some of them are already present in “Run Locally” script displayed in the task-inspector.
Quick Start
If you can’t wait to play, here are the bullet points:
- You’ll need a commit-level 1 access (and LDAP login)
- Go to treeherder.mozilla.org pick a task that runs on TaskCluster (I tried “[TC] Linux64 reftest-3”, build tasks don’t have X.org)
- Under “Job details” click the “Inspect Task” (this will open the task-inspector)
- In the top right corner in the task-inspector click “Login” (this opens login.taskcluster.net on a new tab)
- “Sign-in with LDAP” or “Sign-in with Okta” (Okta only works for employees)
- Click the “Grant Access” button (to grant tools.taskcluster.net access)
- In the task-inspector under the “Task” tab, scroll down and click the “One-Click Loaner” button
- Click again to confirm and create a one-click loaner task (this takes you to a “Waiting for Loaner” page)
- Just wait… 30s to 5 min (you can open the task-inspector for your loaner task to see the live log, if you are impatient)
- Eventually you should see two big buttons to open an interactive shell or display
- You should now have an interactive terminal (and display) into a running task container.
Warning: These loaners runs on EC2 spot-nodes, they may disappear at any time. Use them for quickly trying something, not for writing patches.
Given all these steps, in particular the “Click again” in step (6), I recognize that it might take more than one click to get a “One-Click Loaner”. But we are just getting started, and all of this should be considered a moving target. The instructions above can also be found on MDN, where we will try to keep them up to date.
Implementation Details
To support interactive shell sessions the worker has an end-point that accepts websocket connections. For each new websocket the worker spawns a sh
or bash
inside the task container and pipes stdin
, stdout
and stderr
over the websocket. In browser we use then have the websocket reading from and writing to hterm (from the chromium project) giving us a nice terminal emulator in the browser. There is still a few issues with the TTY emulation in docker, but it works reasonably for small things.
For interactive display sessions (VNC-like sessions in the browser) the worker has an end-point which accepts both websocket connections and ordinary GET
requests for listing displays. For each GET
request the worker will run a small statically linked binary that lists all the X-sessions inside the task container, the result is then transformed to JSON and returned in the request. Once the user has picked a display, a websocket connection is opened with the display identifier in query-string. On the worker the websocket is piped to a statically linked instance of x11vnc running inside the task container. In the browser we then use noVNC to give the user an interactive remote display right in the browser.
As with the shell, there is also a few quirks to the interactive display. Some graphical artifacts and other “interesting” issues. When streaming a TCP connection over a websocket we might not be handling buffering all too well. Which I suspect introduces additional latency and possible bugs. I hope these things will get better in future iterations of the worker, which is currently undergoing an experimental rewrite from node to go.
Future Work
As mentioned in the “Quick Start” section, all of this is still a bit of a moving target. Access is to any loaner is effectively granted to anyone with commit level 1 or any employee. So your friends can technically hijack the interactive task you created. Obviously, we have to make that more fine-grained. At the moment, the “one-click loaner” button is also very specific to our Linux worker. As we add more platforms will have to extend support and find a way to abstract the platform dependent aspects. S it’s very likely that this will break on occasion.
We also recently introduced a hack defining the environment variable TASKCLUSTER_INTERACTIVE
when a loaner task is created. A quick hack that we might refactor later, but for now it’s enabling Armen Zambrano to customize how the docker image used for tests runs in loaner-mode. In bug 1250904 there is on-going work to ensure that a loaner will setup the test environment, but not start running tests until a user connects and types the right command. I’m sure there are many other things we can do to make the task environment more useful in loaner-mode, but this is certainly a good start.
Anyways, much of this is still quick hacks, with rough edges that needs to be resolved. So don’t be surprised if it breaks while we improve stability and attempt to add support for multiple platforms. With a bit of time and resources I’m fairly confident that the “one-click loaner” flow could become the preferred method for debugging issues specific to the test environment.
February 24, 2016
John Ford
cloud-mirror – Platform Engineering Operations Project of the Month
The cloud-mirror is something that we've written to reduce costs and time of inter-region S3 transfers. Cloud-mirror was designed for use in the Taskcluster system, but is possible to run independently. Taskcluster, which is the new automation environment for Mozilla, can support passing artifacts between dependent tasks. An example of this is that when we do a build, we want to make the binaries available to the test machines. We originally hosted all of our artifacts in a single AWS region. This meant that every time a test was done in a region outside of the main region, we would incur an inter-region transfer for each test run. This is expensive and slow compared to in-region transfers.
We decided that a better idea would be to transfer the data from the main region to the other regions the first time it was requested in that region and then have all subsequent requests be inside of the region. This means that for the small overhead of an extra in-region copy of the file, we lose the cost and time overhead of doing inter-region transfers every single time.
Here's an example. We use us-west-2 as our main region for storing artifacts. A test machine in eu-central-1 requires "firefox-50.tar.bz2" for use in a test. The test machine in eu-central-1 will ask cloud mirror for this file. Since this is the first test to request this artifact in eu-central-1, cloud mirror will first copy "firefox-50.tar.bz2" into eu-central-1 then redirect to the copy of that file in eu-central-1. The second test machine in eu-central-1 will then ask for a copy of "firefox-50.tar.bz2" and because it's already in the region, the cloud mirror will immediately redirect to the eu-central-1 copy.
We expire artifacts from the destination regions so that we don't incur too high storage costs. We also use a redis cache configured to expire keys which have been used least recently first. Cloud mirror is written with Node 5 and uses Redis for storage. We use the upstream aws-sdk library for doing our S3 operations.
We're in the process of deploying this system to replace our original implementation called 's3-copy-proxy'. This earlier version was a much simpler version of this idea which we've been using in production. One of the main reasons for the rewrite was to be able to abstract the core concepts to allow anyone to write a backend for their storage type as well as being able to support more aws regions and move towards a completely HTTPS based chain.
If this is a project that's interesting to you, we have lots of ways that you could contribute! Here are some:
- switch polling for pending copy operations to use redis's pub/sub features
- write an Azure or GCE storage backend
- Modify the API to determine which cloud storage pool a request should be redirected to instead of having to encode that into the route
- Write a localhost storage backend for testing that serves content on 127.0.0.1
If you're interested in contributing, please ping me (jhford) in #taskcluster on irc.mozilla.org.
For more information about all Platform Ops projects, visit our wiki. If you're interested in helping out, http://ateam-bootcamp.readthedocs.org/en/latest/guide/index.html has resources for getting started.
February 16, 2016
Maja Frydrychowicz
First Experiment with TaskCluster
TaskCluster is a new-ish continuous integration system made at Mozilla. It manages the scheduling and execution of tasks based on a graph of their dependencies. It’s a general CI tool, and could be used for any kind of job, not just Mozilla things.
However, the example I describe here refers to a Mozilla-centric use case of TaskCluster1: tasks are run per check-in on the branches of Mozilla’s Mercurial repository and then results are posted to Treeherder. For now, the tasks can be configured to run in Docker images (Linux), but other platforms are in the works2.
So, I want to schedule a task! I need to add a new task to the task graph that’s created for each revision submitted to hg.mozilla.org. (This is part of my work on deploying a suite of tests for the Marionette Python test runner, i.e. testing the test harness itself.)
The rest of this post describes what I learned while making this work-in-progress.
There are builds and there are tests
mozilla-taskcluster operates based on the info under testing/taskcluster/tasks
in Mozilla’s source tree, where there are yaml files that describe tasks. Specific tasks can inherit common configuration options from base yaml files.
The yaml files are organized into two main categories of tasks: builds and tests. This is just a convention in mozilla-taskcluster about how to group task configurations; TC itself doesn’t actually know or care whether a task is a build or a test.
The task I’m creating doesn’t quite fit into either category: it runs harness tests that just exercise the Python runner code in marionette_client, so I only need a source checkout, not a Firefox build. I’d like these tests to run quickly without having to wait around for a build. Another example of such a task is the recently-created ESLint task.
Scheduling a task
Just adding a yaml file that describes your new task under testing/taskcluster/tasks
isn’t enough to get it scheduled: you must also add it to the list of tasks in base_jobs.yml
, and define an identifier for your task in base_job_flags.yml
. This identifier is used in base_jobs.yml
, and also by people who want to run your task when pushing to try.
How does scheduling work? First a decision task generates a task graph, which describes all the tasks and their relationships. More precisely, it looks at base_jobs.yml
and other yaml files in testing/taskcluster/tasks
and spits out a json artifact, graph.json
3. Then, graph.json
gets sent to TC’s createTask
endpoint, which takes care of the actual scheduling.
In the excerpt below, you can see a task definition with a requires
field and you can recognize a lot of fields that are in common with the ‘task’ section of the yaml files under testing/taskcluster/tasks/
.
{
"tasks": [
{
"requires": [
// id of a build task that this task depends on
"fZ42HVdDQ-KFFycr9PxptA"
],
"task": {
"taskId": "c2VD_eCgQyeUDVOjsmQZSg"
"extra": {
"treeherder": {
"groupName": "Reftest",
"groupSymbol": "tc-R",
},
},
"metadata": {
"description": "Reftest test run 1",
"name": "[TC] Reftest",
//...
]
}
For now at least, a major assumption in the task-graph creation process seems to be that test tasks can depend on build tasks and build tasks don’t really4 depend on anything. So:
- If you want your tasks to run for every push to a Mozilla hg branch, add it to the list of builds in
base_jobs.yml
. - If you want your task to run after certain build tasks succeed, add it to the list of tests in
base_jobs.yml
and specify which build tasks it depends on. - Other than the above, I don’t see any way to specify a dependency between task A and task B in
testing/taskcluster/tasks
.
So, I added marionette-harness
under builds
. Recall, my task isn’t a build task, but it doesn’t depend on a build, so it’s not a test, so I’ll treat it like a build.
# in base_job_flags.yml
builds:
# ...
- marionette-harness
# in base_jobs.yml
builds:
# ...
marionette-harness:
platforms:
- Linux64
types:
opt:
task: tasks/tests/harness_marionette.yml
This will allow me to trigger my task with the following try syntax: try: -b o -p marionette-harness
. Cool.
Make your task do stuff
Now I have to add some stuff to tasks/tests/harness_marionette.yml
. Many of my choices here are based on the work done for the ESLint task. I created a base task called harness_test.yml
by mostly copying bits and pieces from the basic build task, build.yml
and making a few small changes. The actual task, harness_marionette.yml
inherits from harness_test.yml
and defines specifics like Treeherder symbols and the command to run.
The command
The heart of the task is in task.payload.command
. You could chain a bunch of shell commands together directly in this field of the yaml file, but it’s better not to. Instead, it’s common to call a TaskCluster-friendly shell script that’s available in your task’s environment. For example, the desktop-test
docker image has a script called test.sh
through which you can call the mozharness script for your tests. There’s a similar build.sh
script on desktop-build
. Both of these scripts depend on environment variables set elsewhere in your task definition, or in the Docker image used by your task. The environment might also provide utilities like tc-vcs, which is used for checking out source code.
# in harness_marionette.yml
payload:
command:
+ bash
+ -cx
+ >
tc-vcs checkout ./gecko {{base_repository}} {{head_repository}} {{head_rev}} {{head_ref}} &&
cd gecko &&
./mach marionette-harness-test
My task’s payload.command
should be moved into a custom shell script, but for now it just chains together the source checkout and a call to mach. It’s not terrible of me to use mach in this case because I expect my task to work in a build environment, but most tests would likely call mozharness.
Configuring the task’s environment
Where should the task run? What resources should it have access to? This was probably the hardest piece for me to figure out.
docker-worker
My task will run in a docker image using a docker-worker5. The image, called desktop-build
, is defined in-tree under testing/docker
. There are many other images defined there, but I only considered desktop-build
versus desktop-test
. I opted for desktop-build
because desktop-test
seems to contain mozharness-related stuff that I don’t need for now.
# harness_test.yml
image:
type: 'task-image'
path: 'public/image.tar'
taskId: '{{#task_id_for_image}}desktop-build{{/task_id_for_image}}'
The image is stored as an artifact of another TC task, which makes it a ‘task-image’. Which artifact? The default is public/image.tar
. Which task do I find the image in? The magic incantation '{{#task_id_for_image}}desktop-build{{/task_id_for_image}}'
somehow6 obtains the correct ID, and if I look at a particular run of my task, the above snippet does indeed get populated with an actual taskId
.
"image": {
"path": "public/image.tar",
// Mystery task that makes a desktop-build image for us. Thanks, mystery task!
"taskId": "aqt_YdmkTvugYB5b-OvvJw",
"type": "task-image"
}
Snooping around in the handy Task Inspector, I found that the magical mystery task is defined in image.yml and runs build_image.sh
. Fun. It’s also quite convenient to define and test your own custom image.
Other details that I mostly ignored
# in harness_test.yml
scopes:
# Nearly all of our build tasks use tc-vcs
- 'docker-worker:cache:level-{{level}}-{{project}}-tc-vcs'
cache:
# The taskcluster-vcs tooling stores the large clone caches in this
# directory and will reuse them for new requests this saves about 20s~
# and is the most generic cache possible.
level-{{level}}-{{project}}-tc-vcs: '/home/worker/.tc-vcs'
- Routes allow your task to be looked up in the task index. This isn’t necessary in my case so I just omitted routes altogether.
- Scopes are permissions for your tasks, and I just copied the scope that is used for checking out source code.
- workerType is a configuration for managing the workers that run tasks. To me, this was a choice between
b2gtest
andb2gbuild
, which aren’t specific to b2g anyway.b2gtest
is more lightweight, I hear, which suits my harness-test task fine. - I had to include a few dummy values under
extra
inharness_test.yml
, likebuild_name
, just because they are expected in build tasks. I don’t use these values for anything, but my task fails to run if I don’t include them.
Yay for trial and error
- If you have syntax errors in your yaml, the Decision task will fail. If this happens during a try push, look under Job Details > Inspect Task to fine useful error messages.
- Iterating on your task is pretty easy. Aside from pushing to try, you can run tasks locally using vagrant and you can build a task graph locally as well with
mach taskcluster-graph
.
Resources
Blog posts from other TaskCluster users at Mozilla:
- https://ehsanakhgari.org/blog/2015-09-29/my-experience-adding-new-build-type-taskcluster
- https://elvis314.wordpress.com/2015/11/09/adventures-in-task-cluster-running-tests-locally/
- https://elvis314.wordpress.com/2015/11/11/adventures-in-task-cluster-running-a-custom-docker-image/
There is lots of great documentation at docs.taskcluster.net, but these sections were especially useful to me:
Acknowledgements
Thanks to dustin, pmoore and others for corrections and feedback.
-
This is accomplished in part thanks to mozilla-taskcluster, a service that links Mozilla’s hg repo to TaskCluster and creates each decision task. More at TaskCluster at Mozilla ↩
-
Run tasks on any platform thanks to generic worker ↩
-
To look at a
graph.json
artifact, go to Treeherder, click a green ‘D’ job, then Job details > Inspect Task, where you should find a list of artifacts. ↩ -
It’s not really true that build tasks don’t depend on anything. Any task that uses a task-image depends on the task that creates the image. I’m sorry for saying ‘task’ five times in every sentence, by the way. ↩
-
…as opposed to a generic worker. ↩
-
{{#task_id_for_image}}
is an example of a predefined variable that we can use in our TC yaml files. Where do they come from? How do they get populated? I don’t know. ↩
October 12, 2015
John Ford
Splitting out taskcluster-base into component libraries
Taskcluster serverside components are currently built using the suite of libraries in the taskcluster-base npm package. This package is many things: config parsing, data persistence, statistics, json schema validators, pulse publishers, a rest api framework and some other useful tools. Having these all in one single package means that each time a contributor wants to hack on one part of our platform, she'll have to figure out how to install and run all of our dependencies. This is annoying when it's waiting for a libxml.so library build, but just about impossible for contributors who aren't on the Taskcluster platform team. You need Azure, Influx and AWS accounts to be able to run the full test suite. You also might experience confusing errors in a part of the library you're not even touching.
Additionally, we are starting to get to the point where some services must upgrade one part of taskcluster-base without using other parts. This is generally frowned upon, but sometimes we just need to put a bandaid on a broken system that's being turned off soon. We deal with this currently by exporting base.Entity and base.LegacyEntity. I'd much rather we just export a single base.Entity and have people who need to keep using the old Entity library use taskcluster-lib-legacyentity directly
We're working on fixing this! The structure of taskcluster-base is really primed and ready to be split up since it's already a bunch of independent libraries that just so happen to be collocated. The new component loader that landed was the first library to be included in taskcluster-base this way and I converted our configs and stats libraries last week.
The naming convention that we've settled on is that taskcluster libraries will be prefix with taskcluster-lib-X. This means we have taskcluster-lib-config, taskcluster-lib-stats. We'll continue to name services as taskcluster-Y, like taskcluster-auth or taskcluster-confabulator. The best way to get the current supported set of taskcluster libraries is still going to be to install the taskcluster-base npm module.
Some of our libraries are quiet large and have a lot of history in them. I didn't really want to just create a new repository and copy in the files we care about and destroy the history. Instead, I wrote a simple and ugly tool (https://github.com/jhford/taskcluster-base-split) which does the pedestrian tasks involved in this split up by filtering out irrelevant history for each project, moving files around and doing some preliminary cleanup work on the new library.
This tooling gets us 90% of the way to a split out repository, but as always, a human is required to take it the last step of the way. Imports need to be fixed, dependencies must be verified and tests need to be fixed. I'm also taking this opportunity to implement babel-transpiling support in as many libraries as I can. We use babel everywhere in our application code, so it'll be nice to have it available in our platform libraries as well. I'm using the babel-runtime package instead of requiring the direct use of babel. The code produced by our babel setup is tested in tested using the node 0.12 binary without any wrappers at all.
Having different libraries will introduce the risk of our projects having version number hell. We're still going to have a taskcluster-base npm package. This package will simply be a package.json file which specifies the supported versions of the taskcluster-lib-* packages we ship as a release and an index.js file which imports and re-exports the libraries that we provide. If we have two libraries that have codependent changes, we can land new versions in those repositories and use taskcluster-base as the synchronizing mechanism.
A couple of open questions that I'd love to get input on are how we should share package.json snippets and babel configurations. We mostly have a solution for eslint, but we'd love to be able to share as much as possible in our .babelrc configuration files. If you have a good idea for how we can do that, please get in touch!
One of the goals in doing this is to make writing taskcluster components easier to write. We'd love to see components written by other teams use our framework since we know it's tested to work with Taskcluster well. It also makes it easier for the task cluster team to advise on design and maintenance concerns.
Once a few key changes have landed, I will write a series of blog posts explaining how core taskcluster services are structured.
October 09, 2015
Wander Lairson Costa
In tree tasks configuration
This post is about our plans for representing Taskcluster tasks inside the gecko tree. Jonas, Dustin and I had a discussion in Berlin about this, here I summarize what we have so far. We currently store tasks in an yaml file and they translate to json format using the mach command. The syntax we have now is not the most flexible one, it is hard to parameterize the task and very difficulty to represents tasks relationships.
Let us illustrate the shortcomings with two problems we currently have. Both apply to B2G.
B2G (as in Android) has three different build variants: user, userdebug and eng. Each one has slightly different task configurations. As there is no flexible way to parameterize tasks, we end up with one different task file for each build variant.
When doing nightly builds, we must send update data to the OTA server. We have plans to run a build task, then run the test tasks on this build, and if all tests pass, we run a task responsible to update the OTA server. The point is that today we have no way to represent this relationship inside the task files.
For the first problem Jonas has a prototype for json parameterization. There were discussions on Berlin work week either we should stick with yaml files or use Python files for task configuration. We do want to keep the syntax declarative, which favors yaml, but storing configurations in Python files brings much more expressiveness and flexibility, but this can result in the same configuration hell we have with Buildbot.
The second problem is more complex, and we still haven’t reached a final design. The first question is how we describe task dependencies, top-down, i.e., we specify which task(s) should run after a completed task, or ground up, a task specifies which tasks it depends on. In general, we all agreed to go to a top-down syntax, since most scenarios beg for a top down approach. Other either should put the description of tasks relationship inside the task files or in a separated configuration file. We would like to represent task dependencies inside the task file, the problem is how to check what’s the root task for the task graph. One suggestion is having a task file called root.yml which only contain root tasks.
October 05, 2015
Selena Deckelmann
[berlin] TaskCluster Platform: A Year of Development
Back in September, the TaskCluster Platform team held a workweek in Berlin to discuss upcoming feature development, focus on platform stability and monitoring and plan for the coming quarter’s work related to Release Engineering and supporting Firefox Release. These posts are documenting the many discussions we had there.
Jonas kicked off our workweek with a brief look back on the previous year of development.
Prototype to Production
In the last year, TaskCluster went from an idea with a few tasks running to running all of FirefoxOS aka B2G continuous integration, which is about 40 tasks per minute in the current environment.
Architecture-wise, not a lot of major changes were made. We went from CloudAMQP to Pulse (in-house RabbitMQ). And shortly, Pulse itself will be moving it’s backend to CloudAMQP! We introduced task statuses, and then simplified them.
On the implementation side, however, a lot changed. We added many features and addressed a ton of docker worker bugs. We killed Postgres and added Azure Table Storage. We rewrote the provisioner almost entirely, and moved to ES6. We learned a lot about babel-node.
We introduced the first alternative to the Docker worker, the Generic worker. We for the first time had Release Engineering create a worker, the Buildbot Bridge.
We have several new users of TaskCluster! Brian Anderson from Rust created a system for testing all Cargo packages for breakage against release versions. We’ve had a number of external contributors create builds for FirefoxOS devices. We’ve had a few Github-based projects jump on taskcluster-github.
Features that go beyond BuildBot
One of the goals of creating TaskCluster was to not just get feature parity, but go beyond and support exciting, transformative features to make developer use of the CI system easier and fun.
Some of the features include:
- Interactive sessions
- Live logging (mentioned in our createArtifact() docs and visible in the task-inspector for a task)
- Public-first task statuses
- Easy Indexing
- Storage in S3 (see createArtifact() documentation)
- Public first, reference-style APIs
- Support for remote device lab workers
Features coming in the near future to support Release
Release is a special use case that we need to support in order to take on Firefox production worload. The focus of development work in Q4 and beyond includes:
- Secrets handling to support Release and ops workflows. In Q4, we should see secrets.taskcluster.net go into production and UI for roles-based management.
- Scheduling support for coalescing, SETA and cache locality. In Q4, we’re focusing on an external data solution to support coalescing and SETA.
- Private data hosting. In Q4, we’ll be using a roles-based solution to support these.
TaskCluster Platform: 2015Q3 Retrospective
Welcome to TaskCluster Platform’s 2015Q3 Retrospective! I’ve been managing this team this quarter and thought it would be nice to look back on what we’ve done. This report covers what we did for our quarterly goals. I’ve linked to “Publications” at the bottom of this page, and we have a TaskCluster Mozilla Wiki page that’s worth checking out.
High level accomplishments
- Dramatically improved stability of TaskCluster Platform for Sheriffs by fixing TreeHerder ingestion logic and regexes, adding better logging and fixing bugs in our taskcluster-vcs and mozilla-taskcluster components
- Created and Deployed CI builds on three major platforms:
- Added Linux64 (CentOS), Mac OS X cross-compiled builds as Tier2 CI builds
- Completed and documented a prototype Windows 2012 builds in AWS and task configuration
- Deployed auth.taskcluster.net, enabling better security, better support for self-service authorization and easier contributions from outside our team
- Added region biasing based on cost and availability of spot instances to our AWS provisioner
- Managed the workload of two interns, and significantly mentored a third
- Onboarded Selena as a new manager
- Held a workweek to focus attention on bringing our environment into production support of Release Engineering
Goals, Bugs and Collaborators
We laid out our Q3 goals in this etherpad. Our chosen themes this quarter were:
- Improve operational excellence — focus on sheriff concerns, data collection,
- Facilitate self-serve consumption — refactoring auth and supporting roles for scopes, and
- Exploit opportunities to differentiate from other platforms — support for interactive sessions, docker images as artifacts, github integration and more blogging/docs.
We had 139 Resolved FIXED bugs in TaskCluster product.
We also resolved 7 bugs in FirefoxOS, TreeHerder and RelEng products/components.
We received significant contributions from other teams: Morgan (mrrrgn) designed, created and deployed taskcluster-github; Ted deployed Mac OS X cross compiled builds; Dustin reworked the Linux TC builds to use CentOS, and resolved 11 bugs related to TaskCluster and Linux builds.
An additional 9 people contributed code to core TaskCluster, intree build scripts and and task definitions: aus, rwood, rail, mshal, gerard-majax, mihneadb@gmail.com, htsai, cmanchester, and echen.
The Big Picture: TaskCluster integration into Platform Operations
Moving from B2G to Platform was a big shift. The team had already made a goal of enabling Firefox Release builds, but it wasn’t entirely clear how to accomplish that. We spent a lot of this quarter learning things from RelEng and prioritizing. The whole team spent the majority of our time supporting others use of TaskCluster through training and support, developing task configurations and resolving infrastructure problems. At the same time, we shipped docker-worker features, provisioner biasing and a new authorization system. One tricky infra issue that John and Jonas worked on early in the quarter was a strange AWS Provisioner failure that came down to an obscure missing dependency. We had a few git-related tree closures that Greg worked closely on and ultimately committed fixes to taskcluster-vcs to help resolve. Everyone spent a lot of time responding to bugs filed by the sheriffs and requests for help on IRC.
It’s hard to overstate how important the Sheriff relationship and TreeHerder work was. A couple teams had the impression that TaskCluster itself was unstable. Fixing this was a joint effort across TreeHerder, Sheriffs and TaskCluster teams.
When we finished, useful errors were finally being reported by tasks and starring became much more specific and actionable. We may have received a partial compliment on this from philor. The extent of artifact upload retries, for example, was made much clearer and we’ve prioritized fixing this in early Q4.
Both Greg and Jonas spent many weeks meeting with Ed and Cam, designing systems, fixing issues in TaskCluster components and contributing code back to TreeHerder. These meetings also led to Jonas and Cam collaborating more on API and data design, and this work is ongoing.
We had our own “intern” who was hired on as a contractor for the summer, Edgar Chen. He did some work with the docker-worker, implementing Interactive Sessions, and did analysis on our provisioner/worker efficiency. We made him give a short, sweet presentation on the interactive sessions. Edgar is now at CMU for his sophomore year and has referred at least one friend back to Mozilla to apply for an internship next summer.
Pete completed a Windows 2012 prototype build of Firefox that’s available from Try, with documentation and a completely automated process for creating AMIs. He hasn’t created a narrated video with dueling, British-English accented robot voices for this build yet.
We also invested a great deal of time in the RelEng interns. Jonas and Greg worked with Anhad on getting him productive with TaskCluster. When Anthony arrived, we also onboarded him. Jonas worked closely to get him working on a new project, hooks.taskcluster.net. To take these two bits of work from RelEng on, I pushed TaskCluster’s roadmap for generic-worker features back a quarter and Jonas pushed his stretch goal of getting the big graph scheduler into production to Q4.
We worked a great deal with other teams this quarter on taskcluster-github, supporting new Firefox and B2G builds, RRAs for the workers and generally telling Mozilla about TaskCluster.
Finally, we spent a significant amount of time interviewing, and then creating a more formal interview process that includes a coding challenge and structured-interview type questions. This is still in flux, but the first two portions are being used and refined currently. Jonas, Greg and Pete spent many hours interviewing candidates.
Berlin Work Week
Toward the end of the quarter, we held a workweek in Berlin to focus our next round of work on critical RelEng and Release-specific features as well as production monitoring planning. Dustin surprised us with delightful laser cut acrylic versions of the TaskCluster logo for the team! All team members reported that they benefited from being in one room to discuss key designs, get immediate code review, and demonstrate work in progress.
We came out of this with 20+ detailed documents from our conversations, greater alignment on the priorities for Platform Operations and a plan for trainings and tutorials to give at Orlando. Dustin followed this up with a series of ‘TC Topics’ Vidyo sessions targeted mostly at RelEng.
Our Q4 roadmap is focused on key RelEng features to support Release.
Publications
Our team published a few blog posts and videos this quarter:
- TaskCluster YouTube channel with two generic worker videos
- On Planet Taskcluster:
- Monitoring TaskCluster Infrastructure (garndt)
- Building Firefox for Windows 2012 on Try (pmoore)
- TaskCluster Component Loader (jhford)
- Getting started with TaskCluster APIs (jonasfj)
- De-mystifying TaskCluster intree scheduling (garndt)
- Running phone builds on TaskCluster (wcosta)
- On Air Mozilla
- Interactive Sessions (Edgar Chen)
- TaskCluster GitHub, Continuous integration for Mozillians by Mozillians (mrrrgn)
Wander Lairson Costa
Running phone builds on Taskcluster
In this post I am going to talk about my work for phone builds inside the Taskcluster infrastructure. Mozilla is slightly moving from Buildbot to Taskcluster. Here I am going to give a survivor guide on Firefox OS phone builds.
Submitting tasks
A task is nothing more than a json file containing the description
of the job to execute. But you don’t need to handle the json directly, all tasks
are written in YAML, and it is then processed
by the mach command. The in tree tasks are located
at testing/taskcluster/tasks and the build tasks are
inside the builds/
directory.
My favorite command to try out a task is the mach taskcluster-build
command.
It allows you to process a single task and output the json formatted task ready
for Taskcluster submission.
$ ./mach taskcluster-build \
--head-repository=https://hg.mozilla.org/mozilla-central
--head-rev=tip \
--owner=foobar@mozilla.com \
tasks/builds/b2g_desktop_opt.yml
Although we specify a Mercurial repository, Taskcluster also accepts git repositories interchangeably.
This command will print out the task to the console output. To
run the task, you can copy the generated task and paste it in the
task creator tool. Then just
click on Create Task
to schedule it to run. Remember that you need
Taskcluster Credentials to run Taskcluster
tasks. If you have
taskcluster-cli
installed, you can the pipe the mach output to taskcluster run-task
.
The tasks are effectively executed inside a docker image.
Mozharness
Mozharness
is what we use for effectively build stuff. Mozharness
architecture, despite its code size, is quite simple. Under the
scripts
directory you find the harness scripts. We are specifically
interested in the b2g_build.py script. As the script
name says, it is responsible for B2G builds. The B2G harness configuration
files are located at the b2g/config directory. Not
surprisingly, all files starting with “taskcluster” are for Taskcluster
related builds.
Here are the most common configurations:
- default_vcs
- This is the default vcs used to clone repositories when no other is given. [tc_vcs](https://tc-vcs.readthedocs.org/en/latest/) allows mozharness to clone either git or mercurial repositories transparently, with repository caching support.
- default_actions
- The actions to execute. They must be present and in the same order as in the build class `all_actions` attribute.
- balrog_credentials_file
- The credentials to send update data to the OTA server.
- nightly_build
- `True` if this is a nightly build.
- upload
- Upload info. Not used for Taskcluster.
- repo_remote_mappings
- Maps externals repository to [mozilla domain](https://git.mozilla.org).
- env
- Environment variables for commands executed inside mozharness.
The listed actions map to Python methods inside the build class, with -
replaced
by _
. For example, the action checkout-sources
maps to the method
checkout_sources
. That’s where the mozharness simplicity comes from: everything boils
down to a sequence of method calls, just it, no secret.
For example, here is how you run mozharness to build a flame image:
python <gecko-dir>/testing/mozharness/scripts/b2g_build.py \
--config b2g/taskcluster-phone.py \
--disable-mock \
--variant=user \
--work-dir=B2G \
--gaia-languages-file locales/languages_all.json \
--log-level=debug \
--target=flame-kk \
--b2g-config-dir=flame-kk \
--repo=https://hg.mozilla.org/mozilla-central \
Remember you need your flame connected to the machine so the build system can extract the blobs.
In general you don’t need to worry about mozharness command line because it is wrapped by the build scripts.
Hacking Taskcluster B2G builds
All Taskcluster tasks run inside a docker container. Desktop and emulator B2G builds
run inside the builder
docker image. Phone builds are more complex, because:
-
Mozilla is not allowed to publicly redistribute phone binaries.
-
Phone build tasks need to access the Balrog server to send OTA update data.
-
Phone build tasks need to upload symbols to the crash reporter.
Due to (1), only users authenticated with a @mozilla account are allowed
to download phone binaries (this works the same way as private builds). And
because of (1), (2) and (3), the phone-builder
docker image is secret,
so only authorized users can submit tasks to it.
If you need to create a build task for a new phone, most of the time you will starting from an existing task (Flame and Aries tasks are preferred) and then make your customizations. You might need to add new features to the build scripts, which currently are not the most flexible scripts around.
If you need to customize mozharness, make sure your changes are Python 2.6 compatible, because mozharness is used to run Buildbot builds too, and the Buildbot machines run Python 2.6. The best way to minimize risk of breaking stuff is to submit your patches to try with “-p all -b do” flags.
Need help? Ask at the #taskcluster channel.
September 30, 2015
Pete Moore
Building Firefox for Windows™ on Try using TaskCluster
Try them out for yourself!
Here are the try builds we have created. They were built from the official in-tree mozconfigs that we use for the builds running in Buildbot.
Set up your own Windows™ Try tasks
We are porting over all of Mozilla’s CI tasks to TaskCluster, including Windows™ builds and tests.
Currently Windows™ and OS X tasks still run on our legacy Buildbot infrastructure. This is about to change.
In this post, I am going to talk you through how I set up Firefox Desktop builds in TaskCluster on Try. In future, the TaskCluster builds should replace the existing Buildbot builds, even for releases. Getting them running on Try was the first in a long line of many steps.
Spoiler alert: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fc4b30cc56fb
Using the right Worker
In TaskCluster, Linux tasks run in a docker container. This doesn’t work on Windows, so we needed a different strategy.
TaskCluster defines the role of a Worker as component that is able to claim tasks from the Queue, execute them, publish artifacts, and report back status to the Queue.
For Linux, we have the Docker Worker. This is the component that takes care of executing Linux tasks inside a docker container. Since everything takes place in a container, consecutive tasks cannot interfere with each other, and you are guaranteed a clean environment.
This year I have been working on the Generic Worker. This takes care of running TaskCluster tasks on other platforms.
For Windows, we have a different isolation strategy: since we cannot yet easily run inside a container, the Generic Worker will create a new Windows user for each task it runs.
This user will have its own home directory, and will not have privileged access to the host OS. This means, it should not be able to make any persistent changes to the host OS that will outlive the lifetime of the task. The user only is able to affect HKEY_CURRENT_USER
registry settings, and write to its home folder, which are both purged after task completion.
In other words, although not running in a container, the Generic Worker offers isolation to TaskCluster tasks by virtue of running each task as a different, custom created OS user with limited privileges.
Creating a Worker Type
TaskCluster considers a Worker Type as an entity which belongs to a Provisioner, and represents a host environment and hardware context for running one or more Workers. This is the Worker Type that I set up:
Not everybody has permission to create worker types - but there again, you only really need to do this if you are:
- using Windows (or anything else non-linux)
- not able to use an existing worker type
If you would like to create a new Worker Type, please contact the taskcluster team on irc.mozilla.org
in #taskcluster
channel.
The Worker Type above boils down to some AWS hardware specs, and an ImageId ami-db657feb
. But where did this come from?
Generating the AMI for the Worker Type
It is a Windows 2012 R2 AMI, and it was generated with this code checked in to the try branch. This is not automatically run, but is checked in for reference purposes.
Here is the code. The first is a script that creates the AMI:
This script works by exploiting the fact that when you spawn a Windows instance in AWS, using one of the AMIs that Amazon provides, you can include a Powershell snippet for additional setup. This gets executed automatically when you spawn the instance.
So we simply spawn an instance, passing through this powershell snippet, and then wait. A LONG time (an hour). And then we snapshot the image, and we have our new AMI. Simple!
Here is the Powershell snippet that it uses:
Hopefully this Powershell script is quite self-explanatory. It installs the required build tool chains for building Firefox Desktop, and then installs the parts it needs for running the Generic Worker on this instance. It sets up some additional config that is needed by the build process, and then takes an initial clone of mozilla-central, as an optimisation, so that future jobs only need to pull changes since the image was created.
The caching strategy is to have a clone of mozilla-central live under C:\gecko
, which is updated with an hg pull
from mozilla central each time a job runs. Then when a task needs to pull from try, it is only ever a few commits behind, and should pull updates very quickly.
Defining Tasks
Once we have our AMI created, and we’ve published our Worker Type, we need to submit tasks to get the Provisioner to spawn instances in AWS, and execute our tasks.
The next piece of the puzzle is working out how to get these jobs added to Try. Again, luckily for us, this is just a matter of in-tree config.
For this, most of the magic exists in testing/taskcluster/tasks/builds/firefox_windows_base.yml:
Reading through this, you see that with the exception of knowing the value of a few parameters ({{object_dir}}
, {{platform}}
, {{arch}}
, {{build_type}}
, {{mozconfig}}
), the full set of steps that a Windows build of Firefox Desktop requires on the Worker Type we created above. In other words, you see the full system setup in the Worker Type definition, and the full set of task steps in this Task Definition - so now you know as much as I do about how to build Firefox Desktop on Windows. It all exists in-tree, and is transparent to developers.
So where do these parameters come from? Well, this is just the base config - we define opt and debug builds for win32 and win64 architectures. These live [here]:
Here I will illustrate just one of them, the win32 debug build config:
This file above has defined those parameters, and provided some more task specific config too, which overlays the base config we saw before.
But wait a minute… how do these tasks know to use the win2012r2
worker type we created? The answer to that is that testing/taskcluster/tasks/builds/firefox_windows_base.yml inherits from testing/taskcluster/tasks/windows_build.yml:
Incidentally, this then inherits in turn from the root yaml file for all gecko builds (across all gecko platforms):
So the complete inheritence chain looks like this:
tasks/build.yml
tasks/windows_build.yml
tasks/builds/firefox_windows_base.yml
tasks/builds/firefox_win32_opt.yml
tasks/builds/firefox_win64_debug.yml
tasks/builds/firefox_win32_opt.yml
tasks/builds/firefox_win64_debug.yml
Getting the new tasks added to Try pushes
This involved adding win32
and win64
as build platforms in testing/taskcluster/tasks/branches/base_job_flags.yml
(previsouly taskcluster was not running any tasks for these platforms):
And then associating these new task definitions we just created, to these new build platforms. This is done in testing/taskcluster/tasks/branches/try/job_flags.yml
:
Summary
The above hopefully has given you a taste for what you can do yourself in TaskCluster, and specifically in Gecko, regarding setting up new jobs. By following this guide, you too should be able to schedule Windows jobs in Taskcluster, including try jobs for Gecko projects.
For more information about TaskCluster, see docs.taskcluster.net.
John Ford
Taskcluster Component Loader
Since we're building our services with the same base libraries we end up having a lot of duplicated glue code. During a set of meetings in Berlin, Jonas and I were lamenting about how much copied, pasted and modified boilerplate was in our projects.
Between the API definition file and the command line to launch a program invariably sits a
bin/server.js
file for each service. This script basically loads up our config system, loads our Azure Entity library, loads a Pulse publisher, a JSON Schema validator and a Taskcluster-base App. Each background worker has its own bin/something.js
which basically has a very similar loop. Services with unit tests have a test/helper.js
file which initializes the various components for testing. Furthermore, we might have things initialize inside of a given before()
or beforeEach()
.The problem with having so much boiler plate is twofold. First, each time we modify one services's boilerplate, we are now adding maintenance complexity and risk because of that subtle difference to the other services. We'd eventually end up with hundreds of glue files which do roughly the same thing, but accomplish it complete differently depending on which services it's in. The second problem is that within a single project, we might load the same component ten ways in ten places, including in tests. Having a single codepath that we can test ensures that we're always initializing the components properly.
During a little downtime between sessions, Jonas and I came up with the idea to have a standard component loading system for taskcluster services. Being able to rapidly iterate and discuss in person made the design go very smoothly and in the end, we were able to design something we were both happy with in about an hour or so.
The design we took is to have two 'directories' of components. One is the project wide set of components which has all the logic about how to build the complex things like validators and entities. These components can optionally have dependencies. In order to support different values for different environments, we force the main directory to declare which 'virtual dependencies' it requires. They are declared as a list of strings. The second level of component directory is where these 'virtual dependencies' have their value.
Both Virtual and Concrete dependencies can either be 'flat' values or objects. If a dependency is a
string
, number
, function
, Promise
or an object
without a create property, we just give that exact value back as a resolved Promise
. If the component is an object
with a create
property, we initialize the dependencies specified by the 'requires' list property, pass those values as properties on an object to the function at the 'create' property. The value of that function's return is stored as a resolved promise. Components can only depend on other components non-flat dependencies.Using code is a good way to show how this loader works:
// lib/components.js
let loader = require('taskcluster-base').loader;
let fakeEntityLibrary = require('fake');
module.exports = loader({
fakeEntity: {
requires: ['connectionString'],
setup: async deps => {
let conStr = await deps.connectionString;
return fakeEntityLibrary.create(conStr);
},
},
},
['connectionString'],
);
In this file, we're building a really simple component directory which only contains a contrived 'fakeEntity'. This component depends on having a connection string to fully configure. Since we want to use this code in production, development and testing, we don't want to bake configuration into this file, so we force the thing using this to itself give us a way to configure what the connection string.
// bin/server.js
let config = require('taskcluster-base').config('development');
let loader = require('../lib/components.js');
let load = loader({
connectionString: config.entity.connectionString,
});
let configuredFakeEntity = await load('fakeEntity')
In this file, we're providing a simple directory that satisifies the 'virtual' dependencies we know that need to be fulfilled before initializing can happen.Since we're creating a dependency tree, we want to avoid having cyclic dependencies. I've implemented a cycle checker which ensures that you cannot configure a cyclical dependency. It doesn't rely on the call stack being exceeded from infinite recursion either!
This is far from being the only thing that we figured out improvements for during this chat. Two other problems that we were able to talk through were splitting out taskcluster-base and having a background worker framework.
Currently, taskcluster-base is a monolithic library. If you want our Entities at version 0.8.4, you must take our config at 0.8.4 and our rest system at 0.8.4. This is great because it forces services to move all together. This is also awful because sometimes we might need a new stats library but can't afford the time to upgrade a bunch of Entities. It also means that if someone wants to hack on our stats module that they'll need to learn how to get our Entities unit tests to work to get a passing test run on their stats change.
Our plan here is to make taskcluster-base a 'meta-package' which depends on a set of taskcluster components that we support working together. Each of the libraries (entities, stats, config, api) will be split out into their own packages using git filter-branch to maintain history. This is just a bit of simple leg work of ensuring that the splitting out goes smooth.
The other thing we decided on was a standardized background looping framework. A lot of background workers follow the pattern "do this thing, wait one minte, do this thing again". Instead of each service implementing this its own special way for each background worker, what we'd really like is to have a library which does all the looping magic itself. We can even have nice things like a watch dog timer to ensure that the loop doesn't stick.
Once the PR has landed for the loader, I'm going to be converting the provisioner to use this new loader. This is a part of a new effort to make Taskcluster components easy to implement. Once a bunch of these improvements have landed, I intend to write up a couple blog posts on how you can write your own Taskcluster service.
August 13, 2015
Jonas Finnemann Jensen
Getting Started with TaskCluster APIs (Interactive Tutorials)
When we started building TaskCluster about a year and a half ago one of the primary goals was to provide a self-serve experience, so people could experiment and automate things without waiting for someone else to deploy new configuration. Greg Arndt (:garndt) recently wrote a blog post demystifying in-tree TaskCluster scheduling. The in-tree configuration allows developers to write new CI tasks to run on TaskCluster, and test these new tasks on try before landing them like any other patch.
This way of developing test and build tasks by adding in-tree configuration in a patch is very powerful, and it allows anyone with try access to experiment with configuration for much of our CI pipeline in a self-serve manner. However, not all tools are best triggered from a post-commit-hook, instead it might be preferable to have direct API access when:
- Locating existing builds in our task index,
- Debugging for intermittent issues by running a specific task repeatedly, and
- Running tools for bisecting commits.
To facilitate tools like this TaskCluster offers a series of well-documented REST APIs that can be access with either permanent or temporary TaskCluster credentials. We also provide client libraries for Javascript (node/browser), Python, Go and Java. However, being that TaskCluster is a loosely coupled set of distributed components it is not always trivial to figure out how to piece together the different APIs and features. To make these things more approachable I’ve started a series of interactive tutorials:
- Tutorial 1: Modern asynchronous Javascript,
- Tutorial 2: Authentication against TaskCluster, and,
- Tutorial 3: Creating Your First Task
All these tutorials are interactive, featuring a runtime that will transpile your code with babel.js before running it in the browser. The runtime environment also exposes the require
function from a browserify bundle containing some of my favorite npm modules, making the example editors a great place to test code snippets using taskcluster or related services.
Happy hacking, and feel free submit PRs for all my spelling errors at github.com/taskcluster/taskcluster-docs.
June 03, 2015
Selena Deckelmann
TaskCluster migration: about the Buildbot Bridge
Back on May 7, Ben Hearsum gave a short talk about an important piece of technology supporting our transition to TaskCluster, the Buildbot Bridge. A recording is available.
I took some detailed notes to spread the word about how this work is enabling a great deal of important Q3 work like the Release Promotion project. Basically, the bridge allows us to separate out work that Buildbot currently runs in a somewhat monolithic way into TaskGraphs and Tasks that can be scheduled separately and independently. This decoupling is a powerful enabler for future work.
Of course, you might argue that we could perform this decoupling in Buildbot.
However, moving to TaskCluster means adopting a modern, distributed queue-based approach to managing incoming jobs. We will be freed of the performance tradeoffs and careful attention required when using relational databases for queue management (Buildbot uses MySQL for it’s queues, TaskCluster uses RabbitMQ and Azure). We also will be moving “decision tasks” in-tree, meaning that they will be closer to developer environments and likely easier to manage keeping developer and build system environments in sync.
Here are my notes:
Why have the bridge?
- Allows a graceful transition
- We’re in an annoying state where we can’t have dependencies between buildbot builds and taskcluster tasks. For example: we can’t move firefox linux builds into taskcluster without moving everything downstream of those also into taskcluster
- It’s not practical and sometimes just not possible to move everything at the same time. This let’s us reimplement buildbot schedulers as task graphs. Buildbot builds are tasks on the task graphs enabling us to change each task to be implemented by a Docker worker, a generic worker or anything we want or need at that point.
- One of the driving forces is the build promotion project – the funsize and anti-virus scanning and binary moving – this is going to be implemented in taskcluster tasks but the rest will be in Buildbot. We need to be able to bounce between the two.
What is the Buildbot Bridge (BBB)
BBB acts as a TC worker and provisioner and delegates all those things to BuildBot. As far as TC is concerned, BBB is doing all this work, not Buildbot itself. TC knows nothing about Buildbot.
There are three services:
- TC Listener: responds to things happening in TC
- BuildBot Listener: responds to BB events
- Reflector: takes care of things that can’t be done in response to events — it reclaims tasks periodically, for example. TC expects Tasks to reclaim tasks. If a Task stops reclaiming, TC considers that Task dead.
BBB has a small database that associates build requests with TC taskids and runids.
BBB is designed to be multihomed. It is currently deployed but not running on three Buildbot masters. We can lose an AWS region and the bridge will still function. It consumes from Pulse.
The system is dependent on Pulse, SchedulerDB and Self-serve (in addition to a Buildbot master and Taskcluster).
Taskcluster Listener
Reacts to events coming from TC Pulse exchanges.
Creates build requests in response to tasks becoming “pending”. When someone pushes to mozilla-central, BBB inserts BuildRequests into BB SchedulerDB. Pending jobs appear in BB. BBB cancels BuildRequests as well — can happen from timeouts, someone explicitly cancelling in TC.
Buildbot Listener
Responds to events coming from the BB Pulse exchanges.
Claims a Task when builds start. Attaches BuildBot Properties to Tasks as artifacts. Has a buildslave name, information/metadata. It resolves those Tasks.
Buildbot and TC don’t have a 1:1 mapping of BB statuses and TC resolution. Also needs to coordinate with Treeherder color. A short discussion happened about implementing these colors in an artifact rather than inferring them from return codes or statuses inherent to BB or TC.
Reflector
- Runs on a timer – every 60 seconds
- Reclaims tasks: need to do this every 30-60 minutes
- Cancels Tasks when a BuildRequest is cancelled on the BB side (have to troll through BB DB to detect this state if it is cancelled on the buildbot side)
Scenarios
- A successful build!
Task is created. Task in TC is pending, nothnig in BB. TCListener picks up the event and creates a BuildRequest (pending).
BB creates a Build. BBListener receives buildstarted event, claims the Task.
Reflector reclaims the Task while the Build is running.
Build completes successfully. BBListener receives log uploaded event (build finished), reports success in TaskCluster.
- Build fails initially, succeeds upon retry
(500 from hg – common reason to retry)
Same through Reflector.
BB fails, marked as RETRY BBListener receives log uploaded event, reports exception to Taskcluster and calls rerun Task.
BB has already started a new Build TCListener receives task-pending event, updates runid, does not create a new BuildRequest.
Build completes successfully Buildbot Listener receives log uploaded event, reports success to TaskCluster.
- Task exceeds deadline before Build starts
Task created TCListener receives task-pending event, creates BuildRequest Nothing happens. Task goes past deadline, TaskCluster cancels it. TCListener receives task-exception event, cancels BuildRequest through Self-serve
QUESTIONS:
- TC deadline, what is it? Queue: a task past a deadline is marked as timeout/deadline exceeded
On TH, if someone requests a rebuild twice what happens? * There is no retry/rerun, we duplicate the subgraph — where ever we retrigger, you get everything below it. You’d end up with duplicates Retries and rebuilds are separate. Rebuilds are triggered by humans, retries are internal to BB. TC doesn’t have a concept of retries.
-
How do we avoid duplicate reporting? TC will be considered source of truth in the future. Unsure about interim. Maybe TH can ignore duplicates since the builder names will be the same.
-
Replacing the scheduler what does that mean exactly?
- Mostly moving decision tasks in-tree — practical impact: YAML files get moved into the tree
- Remove all scheduling from BuildBot and Hg polling
Roll-out plan
- Connected to the Alder branch currently
- Replacing some of the Alder schedulers with TaskGraphs
- All the BB Alder schedulers are disabled, and was able to get a push to generate a TaskGraph!
Next steps might be release scheduling tasks, rather than merging into central. Someone else might be able to work on other CI tasks in parallel.
June 02, 2015
Selena Deckelmann
TaskCluster migration: a “hello, world” for worker task creator
On June 1, 2015, Morgan and Dustin presented an introduction to configuring and testing TaskCluster worker tasks. The session was recorded. Their notes are also available in an etherpad.
The key tutorial information centered on how to set up jobs, test/run them locally and selecting appropriate worker types for jobs.
This past quarter Morgan has been working on Linux Docker images and TaskCluster workers for Firefox builds. Using that work as an example, Morgan showed how to set up new jobs with Docker images. She also touched on a couple issues that remain, like sharing sensitive or encrypted information on publicly available infrastructure.
A couple really nice things:
- You can run the whole configuration locally by copy and pasting a shell script that’s output by the TaskCluster tools
- There are a number of predefined workers you can use, so that you’re not creating everything from scratch
Dustin gave an overview of task graphs using a specific example. Looking through the docs, I think the best source of documentation other than this video is probably the API documentation. The docs could use a little more narrative for context, as Dustin’s short talk about it demonstrated.
The talk closed with an invitation to help write new tasks, with pointers to the Android work Dustin’s been doing.
February 23, 2015
James Lal
Taskcluster Release Part 1 : Gecko
It's been awhile since my last blog post about taskcluster and I wanted to give an update...
Taskcluster + Gecko
Taskcluster is running by default on
In Treeherder you will see jobs run by both buildbot and taskcluster. The "TC" jobs are prefixed accordingly so you can tell the difference.
This is the last big step to enabling TC as the default CI for many mozilla project. Adding new and existing branches is easily achieved with basic config changes.
Why is this a great thing? Just about everything is in the tree.
This means you can easily add new builds/tests and immediately push them to try for testing (see the configs for try
Adding new tests and builds is easier than ever but the improvements don't stop there. Other key benefits on linux include:
We use docker
Docker enables easy cloning of CI environments.
# Pull tester image
docker pull quay.io/mozilla/tester:0.0.14
# Run tester image shell
docker run -it quay.io/mozilla/tester:0.0.14 /bin/bash
# <copy/paste stuff from task defintions into this>
Tests and builds are faster
Through this entire process we have been optimizing away overhead and using faster machines which means both build (and particularly test) times are faster.
(Wins look big but more in future blog post)
What's missing ?
Some tests fail due to differences in machines. When we move tests things fail largely due to timing issues (there are a few cases left here).
Retrigger/cancel does not work (yet!) as of the time of writing this it has not yet hit production but will be deployed soon.
Results currently show up only on staging treeherder. We will incrementally report these to production treeherder.
May 27, 2014
James Lal
Gaia + Taskcluster + Treeherder
What is this stuff?
(originally posted on dev-gaia)
For some time now Gaia developers have wanted the ability to scale their tests infinitely, while reporting to a dashboard that both sheriffs and devs can monitor, and yet still maintain control over the test configurations themselves.
Taskcluster & Treeherder let's us do this: http://treeherder-dev.allizom.org/ui/#/jobs?repo=gaia-master Taskcluster http://docs.taskcluster.net/ drives the tests and with a small github hook allows us to configure the jobs from a json file in the tree (this will likely be a yaml file in the end) https://github.com/mozilla-b2g/gaia/blob/master/taskgraph.json
Treeherder is the next generation "TBPL" which allows us to report results to sheriffs from external resources (meaning we can control the tests) for both a "try" interface (like pull requests) and branch landings.
Crrently, we are very close to having green runs in treeherder, with only one intermittent and the rest green ...
How is this different then gaia-try?
Taskcluster will eventually replace all buildbot run jobs (starting with linux)... we are currently in the process of moving tests over and getting treeherder ready for production.
Gaia-try is run on top of buildbot and hooks into our github pull requests.. Gaia-try gives us a single set of suites that the sheriffs can look at and help keep our tree green. This should be considered "production".
Treeherder/taskcluster are designed to solve the issues with the current buildbot/tbpl implementations:
in tree configuration
complete control over the test environment with docker (meaning you can have the exact same setup locally as on TBPL!)
artifacts for pull requests (think screenshots for failed tests, gaia profiles, etc...)
- in tree graph capabilities (for example "smoketests" builds by running smaller test suites or how tests depend on builds).
How is this different from travis-ci?
we can scale on demand on any AWS hardware we like (at very low cost thanks to spot)
docker is used to provide a consistent test environment that may be run locally
- artifacts for pull requests (think screenshots for failed tests, gaia profiles, etc...)
logs can be any size (but still mostly "live")
reports to TBPL2 (treeherder)
When is this production ready?
taskcluster + treeherder is not ready for production yet... while the tests are running this is not in a state where sheriffs can manage it (yet!). Our plan is to continue to add taskcluster test suites (and builds!) for all trees (yes gecko) and have them run in parallel with the buildbot jobs this month...
I will be posting weekly updates on my blog about taskcluster/treeherder http://lightsofapollo.github.io/ and how it effects gaia (and hopefully your overall happiness)
Where are the docs??
- http://docs.taskcluster.net/
- (More coming to gaia-taskcluster and gaia readme as we get closer to production)
WHERE IS THE CODE?
- https://github.com/taskcluster (overall project)
- https://github.com/lightsofapollo/gaia-taskcluster (my current gaia intergration)
- https://github.com/mozilla/treeherder-service (treeherder backend)
- https://github.com/mozilla/treeherder-ui (treeherder frontend)
March 04, 2014
James Lal
Taskcluster - Mozilla's new test infrastructure project
Taskcluster is not one singular entity that runs a script with output in a pretty interface or a github hook listener, but rather a set of decoupled interfaces that enables us to build various test infrastructures while optimizing for cost, performance and reliability. The focus of this post is Linux. I will have more information how this works for OSX/Window soon.
Some History
Mozilla has quite a few different code bases, most depend on gecko (the heart of Firefox and FirefoxOS). Getting your project hooked up to our current CI infrastructure usually requires a multi-team process that takes days or more. Historically, simply merging projects into gecko was easier than having external repositories that depend on gecko, which our current CI cannot easily support.
It is critical to be able to see in one place (TBPL) that all the projects depend on gecko are working. Today TBPL current this process is tightly coupled to our buildbot infrastructure (which together make up our current CI). If you really care about your project not breaking when a change lands in gecko, you really only have one option: hosting your testing infrastructure under buildbot (which feeds TBPL).
Where Taskcluster comes in
Treeherder resolves the tight coupling problem by separating the reporting from the test running process. This enables us to re-imagine our workflow and how it's optimized. We can run tests anywhere using any kind of utility/library assuming it gives us the proper hooks (really just logs and some revision information) to plug results into our development workflow.
A high level workflow with taskcluster looks like this:
You submit some code (this can be patch or a pull request, etc...) to a "scheduler" ( I have started on one for gaia ) which submits a set of tasks. Each task is run inside a docker container the container's image is specified as part of your task. This means anything you can imagine running on linux you can directly specify in your container (no more waiting for vm reimaging, etc...) this also means we directly control the resources that container uses (less variance in test) AND if something goes wrong you can download the entire environment that test ran on locally to debug it.
As tasks are completed the task cluster queue emits events over AMQP (think pulse) so anyone interested in the status of tests, etc.. can hook directly into this... This enables us to post results as they happen directly to treeherder.
The initial taskcluster provisions AWS spot nodes on demand (we have it capped to a fixed number right now) so during peaks we can burst to an almost unlimited number of nodes. During idle times workers shut themselves down to reduce costs. We have additional plans for different clouds (and physical hardware on open stack).
Each component can be easily replaced (and multiple types of workers and provisioners can be added on demand. Jonas Finnemann Jensen has done a awesome job documenting how taskcluster works in the docs at the API level.
What the future looks like
My initial plan is to hook everything up for gaia the FirefoxOS frontend. This will replace our current travis CI setup.
As pull requests come in we will run tests on taskcluster and report status to both treeherder and github (the beloved github status api). The ability to hook up new types of tests from the tree itself (and test new types from the tree itself) will continue on in the form of a task template (another blog post coming). Developers can see the status of their tests from treeherder.
Code landing in master follows the same practice and results will report into a gaia specific treeherder view.
Most importantly immediately after treeherder is launched we can run all gaia testing on the same exact infrastructure for both gaia and gecko commits Jonas Sicking (b2g overload) has some great ideas about locking gecko <-> gaia versions to reduce another kind of failure which occurs when developing against the ever changing landscape of gecko / gaia commits.
When is the future? We have implemented the "core" of taskcluster already and have the ability to run tests. By the end of the month (March) we will have the capability to replace the entire gaia workflow with taskcluster.
Why not X CI solution
Building a brand new CI solution is non-trivial why are we doing this?
To leverage LXC containers (docker): One of the big problems we hit when trying to debug test failures is the vairance of testing locally and remotely. With LXC containers you can download the entire container (the entire environment which your test runs in) and run it with the same cpu/memory/swap/filesystem as it would run remotely.
On demand scaling. We have (somewhat predictable) bursts throughout the day and the ability to spin up (and down) on demand is required to keep up with our changing needs throughout the day.
Make in tree configuration easy. Pull requests + in tree configuration enable developers to quickly iterate on tests and testing infrastructure
Modular extensible components with public facing APIs. Want run tasks to do things other then test/build or report to something other then treeherder? We have or will build an api for that.
Hackability is imporant... The parts you don't want to solve (running aws nodes, keeping them up, pricing them, etc...) are solved for you so you can focus on building the next great mozilla related thing (better bisection tools, etc...).
More flexibility to test/deploy optimizations... We have something like a compute year of tests and 10-30+ minute chunks of testing is normal. We need to iterate on our test infrastructure quickly to try to reduce this where possible with CI changes.
Here are a few potential alternatives below... I list out the pros & cons of each from my perspective (and a short description of each).
Travis [hosted]
TravisCI is an awesome [free] open source testing service that we use for many of our smaller projects.
Travis works really well for the 90% webdev usecase. Gaia does not fit well into that use case and gecko does so even less.
Pros:
- Dead simple setup.
- Iterate on test frameworks, etc... on every pull request without any issue.
- Nice simple UI which reports live logging.
- Adding tests and configuring tests is trivial.
Cons:
- Difficult to debug failures locally.
- No public facing API for creating jobs.
- No build artifacts on pull requests.
- Cannot store arbitrarily long logs (this is only an issue for open source IIRC).
- On demand scaling.
Buildbot [build on top of it]
We currently use buildbot at scale thousands~ of machines for all gecko testing on multiple platforms. If you are using firefox it was built by our buildbot setup.
(NOTE: This is a critique of how we currently use buildbot not the entire project). If I am missing something or you think a CI solution could fit the bill contact me!
Pros:
- We have it working at a large scale already.
Cons:
- Adding tests and configuring tests is fairly difficult and involves long lead times.
- Difficult to debug failures locally.
- Configuration files live outside of the tree.
- Persistent connection master/slave model.
- Its one monolithic project which is difficult to replace components of.
- Slow rollout of new machine requirements & configurations.
Jenkins
We are using Jenkins for our on device testing.
Pros:
- Easy to configure jobs from the UI (decent ability to do configuration yourself).
- Configuration (by default) does not live in the tree.
- Tons of plugins (with varying quality).
Cons:
- By default difficult to debug failures locally.
- Persistent connection master/slave model.
- Configuration files live outside of the tree.
Drone.io [hosted/not hosted]
Drone.io recently open sourced... It's docker based and shows promise. Out of all the options above it looks the closest to the to what we want for linux testing.
I am going to omit the Pros/Cons here the basics look good for drone and it requires some more investigation. Some missing things here are:
- A long term plan for supporting multiple operating systems.
- A public api for scheduling tasks/jobs.
- On demand scaling.
January 31, 2014
James Lal
Using docker volumes for rapid development of containers
Its fairly obvious how to use docker for shipping an immutable image that is great for deployment.. It was less obvious (to me) how to use docker to iterate on the image, run tests in it, etc...
Lets say you have a node project and your writing some web service thing:
// server.js
var http = require('http');
...
// server_test.js
suite('my tests', function() {
});
# Dockerfile
FROM lightsofapollo/node:0.10.24
ADD . /service
WORKDIR /service
CMD node server.js
Before Volumes
Without using volumes your workflow is like this:
docker build -t image_name
docker run image_name ./node_modules/.bin/mocha server_test.js
# .. make some changes and repeat...
While this is certainly not awful its a lot of extra steps you probably don't want to do...
After Volumes
While iterating ideally we could just "shell in" to the container and make changes on the fly then run some tests (like lets say vagrant).
You can do this with volumes:
# Its important that you only use the -v command during development it
# will override the contents of whatever you specify and you should also
# keep in mind you want to run the final tests on the image without this
# volume at the end of your development to make sure you didn't forget to
# build or somthing.
# Mount the current directory in your service folder (override the add
# above) then open an interactive shell
docker run -v $PWD:/service -i -t /bin/bash
From here you can hack like normal making changes and running tests on the fly like you would with vagrant or on your host.
When your done!
I usually have a makefile... I would setup the "make test" target something like this to ensure your tests are running on the contents of your image rather then using the volume
.PHONY: test
test:
docker build -t my_image
docker run my_image npm test
.PHONY: push
push: test
docker push my_image