Accelerating InnerSource at enterprise scale

with GitHub action runners in AWS cloud at Philips







Niek Palm

How do you picture Philips?

Probably this?

Maybe this?

Not this

Philips is a health technology company improving people's health and well-being through meaningful innovation

Our purpose is to improve people’s health and well-being. We aim to improve 2.5 billion lives per year by 2030

Software in Philips

  • Global Organisation

  • Cloud | Web | Mobile | Embedded

  • 6500+ Software Professionals

  • 100s Millions lines of code

  • Regulated Medical Software

InnerSource Journey

InnerSource is a development methodology where engineers build proprietary software using best practices from large-scale open source projects.

How we started?

🏛️ March 2020

👨🏽‍💻 InnerSource as default

✨ GitHub for source code

☁️ Preferred cloud AWS

🔌 Empower everyone with CI/CD

101 - GitHub Actions

  • Actions == GitHub CI/CD ++
  • Actions == CI/CD Lego bricks
  • Jobs are triggered by an event
  • Jobs require a runner to run
on: [push]
jobs:
  check-bats-version:
    runs-on: [ubuntu-latest]
    container: node:16
    steps:
      - uses: actions/checkout@v3
      - run: npx bats -v

🔌 Connectivity

⚙️ Hardware

💰 Costs

🔐 Control

Self-hosted runners gives control

but how to get the same experience?

Manual?

💡 Idea

  • Cloud
  • Run on standard minimal VMs
  • Tailor OS / Arch
  • Scale up / down / zero
  • Connection enterprise network
  • Only pay for what's used

Event based

Scale based on workflow jobs

Serverless

low cost / low maintenance control plane

Treat as Cattle

Secure and no fire fighting

Networking

Bring your own connection

Cloud Solution

Serverless contol plane receiving events from GitHub and scale new self-hosted runners using AWS EC2 Spot Instances

Terraform module with out of the box working configuration which can be tailored to for specific use-cases. AWS Lambda's build in TypeScript.

Scale up

  • GitHub sends event App webhook
  • AWS API gateway to get events
  • AWS Lambda verifies event
  • AWS SQS for decoupling / delay
  • AWS Lambda to create EC2 runner
  • GitHub App for API access

The runner

  • Support Spot and On-Demand
  • Create instance by CreateFleet API with type Instant.
  • Limit permission to the instance
  • Optional ephemeral
  • Optional bring your own AMI and custom cloud-init
  • Cached GitHub agent to improve boot time.

Scale Down

  • No event
  • Self terminating ephemeral runners
  • Event bridge trigger to trigger regular scale down checks

📢 DEMO

  • Create cloud resources
  • Connect cloud with GitHub
  • Run 40 jobs

Open Source

⭐ 1K+ stars

✨ 90+ contributors

❤️ 400+ Pull requests

🏆 Recommended by GitHub

Contribution

  • Support windows
  • Support ARM
  • Support GHES
  • Better docs
  • Security improvements
  • Upgrades

Running at Scale

in Philips

Deployment

  • Deploy runners with the runners
  • Terragrunt to keep our Terraform dry
  • Connect to Philips with AWS Direct Connect
  • Work together with security to change firewall rules
  • Limit AWS access by permission boundaries

Deployment

Now can we avoid avoid using keys in CI?

  • Define OIDC provider for GitHub in AWS
  • Create role with trust based on claim
  • Define policies for role

Deployment

Trust

{
    "Sid": "",
    "Effect": "Allow",
    "Principal": {
        "Federated": "arn:aws:iam::<id>:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
        "StringLike": {
            "token.actions.githubusercontent.com:sub": "<claim>"
        }
    }
}

Action

jobs:
  permissions:
    id-token: write

  deploy:
    steps:
      - uses: aws-actions/configure-aws-credentials@v1
        with:
          role-to-assume: ${{ inputs.aws_role_to_assume }}
          aws-region: ${{ inputs.aws_region }}

Limit access by Permissions Boundaries

Define identity permission

{
  "Effect": "Allow",
  "Action": ["iam:*"],
  "Resource": "*"
}

Limit by boundary

{
  "Effect": "Allow",
  "Action": [
    "iam:DeleteRole"
  ],
  "Resource": "arn:...:role/github-runners/*"
}

Scaling in and out

15K instances on a average day

Lessons learned

Speed

  • Caching GitHub runner binary
  • Pre-build AMI

CI DOS

Rate Limits

Network

Costs

Questions

# Resources

resource "website" "github_runners" {
  url = "github.com/philips-labs/terraform-aws-github-runner"
}

resource "website" "github_oidc" {
  url = "github.com/philips-labs/terraform-aws-github-oidc"
}

resource "website" "slides" {
  url = "github.com/philips-labs/2022-10-03_scaling-github-runners"
}

resource "contact" "niek" {
  github   = "@npalm"
  linkedin = "in/niekpalm/"
  twitter  = "@niekos77"
}






We're writing code

to change health technology

![bg](assets/cents.jpeg)

What are we doing here? Philips is a worldwide recognisable brand almost everyone in the world has heard of Philips. But you don't think of software. How do you picture philips?

We build a lot of software in philips We have a lot of different business units that historically have little alignment

InnerSource is key to our software strategy In Philips we combine world class tools to enable teams to focus on meaningful innovation to improve people lives.

With our open source module we auto-scale GitHub self-hosted runners in the cloud, and bring the same experience to our developers as using standard hosted runners.

Ideally we would use the public runners - but we cant because

Connectivity is abstract of the end solution. You bring the solution to your network and take advantage of it

* GitHub App for events * AWS API gateway to get events * AWS Lambda for event handling * AWS SQS for decoupling * AWS Lambda to scale up * GitHub App for API access * AWS EC2 (Spot) to run jobs * AWS Direct connect for networking * AWS Lambda for scaling down

topics we could cover - PR checks automated - Automated release - Slack - Build a community

runners last 3 months per day