Configuration
Configuration considerations
To be able to support a number of use-cases, the module has quite a lot of configuration options. We tried to choose reasonable defaults. Several examples also show the main cases of how to configure the runners.
- Org vs Repo level. You can configure the module to connect the runners in GitHub on an org level and share the runners in your org, or set the runners on repo level and the module will install the runner to the repo. There can be multiple repos but runners are not shared between repos.
- Multi-Runner module. This modules allows you to create multiple runner configurations with a single webhook and single GitHub App to simplify deployment of different types of runners. Check the detailed module documentation for more information or checkout the multi-runner example.
- Workflow job event. You can configure the webhook in GitHub to send workflow job events to the webhook. Workflow job events were introduced by GitHub in September 2021 and are designed to support scalable runners. We advise using the workflow job event when possible.
- Linux vs Windows. You can configure the OS types linux and win. Linux will be used by default.
- Re-use vs Ephemeral. By default runners are re-used, until detected idle. Once idle they will be removed from the pool. To improve security we are introducing ephemeral runners. Those runners are only used for one job. Ephemeral runners only work in combination with the workflow job event. For ephemeral runners the lambda requests a JIT (just in time) configuration via the GitHub API to register the runner. JIT configuration is limited to ephemeral runners (and currently not supported by GHES). For non-ephemeral runners, a registration token is always requested. In both cases the configuration is made available to the instance via the same SSM parameter. To disable JIT configuration for ephemeral runners set
enable_jit_config
tofalse
. We also suggest using a pre-build AMI to improve the start time of jobs for ephemeral runners. - GitHub Cloud vs GitHub Enterprise Server (GHES). The runners support GitHub Cloud as well GitHub Enterprise Server. For GHES, we rely on our community for support and testing. We at Philips have no capability to test GHES ourselves.
- Spot vs on-demand. The runners use either the EC2 spot or on-demand life cycle. Runners will be created via the AWS CreateFleet API. The module (scale up lambda) will request via the CreateFleet API to create instances in one of the subnets and of the specified instance types.
- ARM64 support via Graviton/Graviton2 instance-types. When using the default example or top-level module, specifying
instance_types
that match a Graviton/Graviton 2 (ARM64) architecture (e.g. a1, t4g or any 6th-geng
orgd
type), you must also specifyrunner_architecture = "arm64"
and the sub-modules will be automatically configured to provision with ARM64 AMIs and leverage GitHub's ARM64 action runner. See below for more details.
AWS SSM Parameters
The module uses the AWS System Manager Parameter Store to store configuration for the runners, as well as registration tokens and secrets for the Lambdas. Paths for the parameters can be configured via the variable ssm_paths
. The location of the configuration parameters is retrieved by the runners via the instance tag ghr:ssm_config_path
. The following default paths will be used. Tokens or JIT config stored in the token path will be deleted after retrieval by instance, data not deleted after a day will be deleted by a SSM housekeeper lambda.
Path | Description |
---|---|
ssm_paths.root/var.prefix?/app/ |
App secrets used by Lambda's |
ssm_paths.root/var.prefix?/runners/config/<name> |
Configuration parameters used by runner start script |
ssm_paths.root/var.prefix?/runners/tokens/<ec2-instance-id> |
Either JIT configuration (ephemeral runners) or registration tokens (non ephemeral runners) generated by the control plane (scale-up lambda), and consumed by the start script on the runner to activate / register the runner. |
ssm_paths.root/var.prefix?/webhook/runner-matcher-config |
Runner matcher config used by webhook to decide the target for the webhook event. |
Available configuration parameters: |
Parameter name | Description |
---|---|
agent_mode |
Indicates if the agent is running in ephemeral mode or not. |
enable_cloudwatch |
Configuration for the cloudwatch agent to stream logging. |
run_as |
The user used for running the GitHub action runner agent. |
token_path |
The path where tokens are stored. |
Encryption
The module supports two scenarios to manage environment secrets and private keys of the Lambda functions.
Managed KMS key (default)
This is the default, no additional configuration is required.
Provided KMS key
You have to create and configure you KMS key. The module will use the context with key: Environment
and value var.environment
as encryption context.
resource "aws_kms_key" "github" {
is_enabled = true
}
module "runners" {
...
kms_key_arn = aws_kms_key.github.arn
...
Pool
The module supports two options for keeping a pool of runners. One is via a pool which only supports org-level runners, the second option is keeping runners idle.
The pool is introduced in combination with the ephemeral runners and is primarily meant to ensure if any event is unexpectedly dropped and no runner was created, the pool can pick up the job. The pool is maintained by a lambda. Each time the lambda is triggered a check is performed to ensure the number of idle runners managed by the module matches the expected pool size. If not, the pool will be adjusted. Keep in mind that the scale down function is still active and will terminate instances that are detected as idle.
pool_runner_owner = "my-org" # Org to which the runners are added
pool_config = [{
size = 20 # size of the pool
schedule_expression = "cron(* * * * ? *)" # cron expression to trigger the adjustment of the pool
}]
The pool is NOT enabled by default and can be enabled by setting at least one object of the pool config list. The ephemeral example contains configuration options (commented out).
Idle runners
The module will scale down to zero runners by default. By specifying a idle_config
config, idle runners can be kept active. The scale down lambda checks if any of the cron expressions matches the current time with a margin of 5 seconds. When there is a match, the number of runners specified in the idle config will be kept active. In case multiple cron expressions match, the first one will be used. Below is an idle configuration for keeping runners active from 9:00am to 5:59pm on working days. The cron expression generator by Cronhub is a great resource to set up your idle config.
By default, the oldest instances are evicted. This helps keep your environment up-to-date and reduce problems like running out of disk space or RAM. Alternatively, if your older instances have a long-living cache, you can override the evictionStrategy
to newest_first
to evict the newest instances first instead.
idle_config = [{
cron = "* * 9-17 * * 1-5"
timeZone = "Europe/Amsterdam"
idleCount = 2
# Defaults to 'oldest_first'
evictionStrategy = "oldest_first"
}]
Note: When using Windows runners, we recommend keeping a few runners warmed up due to the minutes-long cold start time.
Supported config
Cron expressions are parsed by cron-parser. The supported syntax.
* * * * * *
┬ ┬ ┬ ┬ ┬ ┬
│ │ │ │ │ |
│ │ │ │ │ └ day of week (0 - 7) (0 or 7 is Sun)
│ │ │ │ └───── month (1 - 12)
│ │ │ └────────── day of month (1 - 31)
│ │ └─────────────── hour (0 - 23)
│ └──────────────────── minute (0 - 59)
└───────────────────────── second (0 - 59, optional)
For time zones please check TZ database name column for the supported values.
Ephemeral runners
You can configure runners to be ephemeral, in which case runners will be used only for one job. The feature should be used in conjunction with listening for the workflow job event. Please consider the following:
- The scale down lambda is still active, and should only remove orphan instances. But there is no strict check in place. So ensure you configure the
minimum_running_time_in_minutes
to a value that is high enough to get your runner booted and connected to avoid it being terminated before executing a job. - The messages sent from the webhook lambda to the scale-up lambda are by default delayed by SQS, to give available runners a chance to start the job before the decision is made to scale more runners. For ephemeral runners there is no need to wait. Set
delay_webhook_event
to0
. - All events in the queue will lead to a new runner created by the lambda. By setting
enable_job_queued_check
totrue
you can enforce a rule of only creating a runner if the event has a correlated queued job. Setting this can avoid creating useless runners. For example, a job getting cancelled before a runner was created or if the job was already picked up by another runner. We suggest using this in combination with a pool. - To ensure runners are created in the same order GitHub sends the events, by default we use a FIFO queue. This is mainly relevant for repo level runners. For ephemeral runners you can set
enable_fifo_build_queue
tofalse
. - Errors related to scaling should be retried via SQS. You can configure
job_queue_retention_in_seconds
andredrive_build_queue
to tune the behavior. We have no mechanism to avoid events never being processed, which means potentially no runner gets created and the job in GitHub times out in 6 hours.
The example for ephemeral runners is based on the default example. Have look at the diff to see the major configuration differences.
Prebuilt Images
This module also allows you to run agents from a prebuilt AMI to gain faster startup times. The module provides several examples to build your own custom AMI. To remove old images, an AMI housekeeper module can be used. See the AMI examples for more details.
Logging
The module uses AWS Lambda Powertools for logging. By default the log level is set to info
, by setting the log level to debug
the incoming events of the Lambda are logged as well.
Log messages contains at least the following keys:
messages
: The logged messagesenvironment
: The environment prefix provided via Terraformservice
: The lambdamodule
: The TypeScript module writing the log messagefunction-name
: The name of the lambda function (prefix + function name)github
: Depending on the lambda, contains GitHub contextrunner
: Depending on the lambda, specific context related to the runner
An example log message of the scale-up function:
{
"level": "INFO",
"message": "Received event",
"service": "runners-scale-up",
"timestamp": "2023-03-20T08:15:27.448Z",
"xray_trace_id": "1-6418161e-08825c2f575213ef760531bf",
"module": "scale-up",
"region": "eu-west-1",
"environment": "my-linux-x64",
"aws-request-id": "eef1efb7-4c07-555f-9a67-b3255448ee60",
"function-name": "my-linux-x64-scale-up",
"runner": {
"type": "Repo",
"owner": "test-runners/multi-runner"
},
"github": {
"event": "workflow_job",
"workflow_job_id": "1234"
}
}
Tracing
The distributed architecture of this application can make it difficult to troubleshoot. We support the option to enable tracing for all the lambda functions created by this application. To enable tracing, you can provide the tracing_config
option inside the root module or inner modules.
This tracing config generates timelines for following events:
- Basic lifecycle of lambda function
- Traces for Github API calls (can be configured by capture_http_requests).
- Traces for all AWS SDK calls
This feature has been disabled by default.
Multiple runner module in your AWS account
The watcher will act on all spot termination notificatins and log all onses relevant to the runner module. Therefor we suggest to only deploy the watcher once. You can either deploy the watcher by enabling in one of your deployments or deploy the watcher as a stand alone module.
Debugging
In case the setup does not work as intended, trace the events through this sequence:
- In the GitHub App configuration, the Advanced page displays all webhook events that were sent.
- In AWS CloudWatch, every lambda has a log group. Look at the logs of the
webhook
andscale-up
lambdas. - In AWS SQS you can see messages available or in flight.
- Once an EC2 instance is running, you can connect to it in the EC2 user interface using Session Manager (use
enable_ssm_on_runners = true
). Check the user data script usingcat /var/log/user-data.log
. By default several log files of the instances are streamed to AWS CloudWatch, look for a log group named<environment>/runners
. In the log group you should see at least the log streams for the user data installation and runner agent. - Registered instances should show up in the Settings - Actions page of the repository or organization (depending on the installation mode).
Experimental features
Termination watcher
This feature is in early stage and therefore disabled by default.
The termination watcher is currently watching for spot termination notifications. The module is only taken events into account for instances tagged with ghr:environment
by default when deployment the module as part of one of the main modules (root or multi-runner). The module can also be deployed stand-alone, in that case the tag filter needs to be tunned.
- Logs: The module will log all termination notifications. For each warning it will look up instance details and log the environment, instance type and time the instance is running. As well some other details.
- Metrics: Metrics are disabled by default, this to avoid costs. Once enabled a metric will be created for each warning with at least dimensions for the environment and instance type. THe metric name space can be configured via the variables. The metric name used is
SpotInterruptionWarning
.
Log example
Below an example of the the log messages created.
{
"level": "INFO",
"message": "Received spot notification warning:",
"environment": "default",
"instanceId": "i-0039b8826b3dcea55",
"instanceType": "c5.large",
"instanceLaunchTime": "2024-03-15T08:10:34.000Z",
"instanceRunningTimeInSeconds": 68,
"tags": [
{
"Key": "ghr:environment",
"Value": "default"
}
... all tags ...
]
}
Queue to publish workflow job events
This queue is an experimental feature to allow you to receive a copy of the wokflow_jobs events sent by the GitHub App. This can be used to calculate a matrix or monitor the system.
To enable the feature set enable_workflow_job_events_queue = true
. Be aware though, this feature is experimental!
Messages received on the queue are using the same format as published by GitHub wrapped in a property workflowJobEvent
.
export interface GithubWorkflowEvent {
workflowJobEvent: WorkflowJobEvent;
}
This extensible format allows more fields to be added if needed.
You can configure the queue by setting properties to workflow_job_events_queue_config
NOTE: By default, a runner AMI update requires a re-apply of this terraform config (the runner AMI ID is looked up by a terraform data source). To avoid this, you can use ami_id_ssm_parameter_name
to have the scale-up lambda dynamically lookup the runner AMI ID from an SSM parameter at instance launch time. Said SSM parameter is managed outside of this module (e.g. by a runner AMI build workflow).