A lesson in failure: Upgrade to Cluster Rebuild

Published: Mar 21, 2024 by Isaac Johnson

This is a different kind of Blog post. I started it one day realizing my cluster was simply falling down on resolving DNS to Github. All queries to api.github.com were going to nowhere.

I initially thought it was Github’s fault. But their status page was all good. I then thought it might be my Github Runner. In this writeup we watch as I try more and more things.

As of this writing, the cluster is dead. I’ve had to rebuild it all. I wish I would have backed things up before getting too deep. That said, I hope you enjoy as I desperately try to improve things, yet fail and fail again.

Half way through, you’ll see me pivot to a rebuild and we can cover trying to get data from a crashing cluster.

The start of Issues

I’ve been running the SummerWind Github Actions controller for nearly two years now. Recently, after a cluster crash, by runners would not reconnect

I saw errors about DNS

{"githubConfigUrl": "https://github.com/idjohnson/jekyll-blog"}
2024-02-28T22:12:24Z    INFO    listener-app    getting runner registration token       {"registrationTokenURL": "https://api.github.com/repos/idjohnson/jekyll-blog/actions/runners/registration-token"}
2024-02-28T22:12:54Z    ERROR   listener-app    Retryable client error  {"error": "Post \"https://api.github.com/repos/idjohnson/jekyll-blog/actions/runners/registration-token\": dial tcp: lookup api.github.com: i/o timeout", "method": "POST", "url": "https://api.github.com/repos/idjohnson/jekyll-blog/actions/runners/registration-token", "error": "request failed"}

I tried moving to my latest runner custom image

$ kubectl edit RunnerDeployment new-jekyllrunner-deployment -o yaml
error: runnerdeployments.actions.summerwind.dev "new-jekyllrunner-deployment" could not be patched: Internal error occurred: failed calling webhook "mutate.runnerdeployment.actions.summerwind.dev": failed to call webhook: Post "https://actions-runner-controller-webhook.actions-runner-system.svc:443/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment?timeout=10s": no endpoints available for service "actions-runner-controller-webhook"
You can run `kubectl replace -f /tmp/kubectl-edit-295327076.yaml` to try this update again.

Just looking at the webhook that fails shows we are at 582d

$ kubectl get mutatingwebhookconfiguration
NAME                                                       WEBHOOKS   AGE
dapr-sidecar-injector                                      1          413d
rancher.cattle.io                                          2          353d
datadog-webhook                                            3          582d
cert-manager-webhook                                       1          582d
mutating-webhook-configuration                             8          353d
actions-runner-controller-mutating-webhook-configuration   4          582d
vault-agent-injector-cfg                                   1          372d

The last known good one, while disconnected, at least shows it was there

$ kubectl logs new-jekyllrunner-deployment-6t44l-ncpp2
Defaulted container "runner" out of: runner, docker
2024-02-27 11:12:35.38  NOTICE --- Runner init started with pid 7
2024-02-27 11:12:35.55  DEBUG --- Github endpoint URL https://github.com/
2024-02-27 11:12:42.353  DEBUG --- Passing --ephemeral to config.sh to enable the ephemeral runner.
2024-02-27 11:12:42.361  DEBUG --- Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication


√ Connected to GitHub

# Runner Registration




√ Runner successfully added
√ Runner connection is good

# Runner settings


√ Settings Saved.

2024-02-27 11:12:47.929  DEBUG --- Runner successfully configured.
{
  "agentId": 1983,
  "agentName": "new-jekyllrunner-deployment-6t44l-ncpp2",
  "poolId": 1,
  "poolName": "Default",
  "ephemeral": true,
  "serverUrl": "https://pipelinesghubeus2.actions.githubusercontent.com/qCliCWldvO6BBBswMSDoRLbsUjSq35ZPPBfSYwyE4OOX7bEFxU/",
  "gitHubUrl": "https://github.com/idjohnson/jekyll-blog",
  "workFolder": "/runner/_work"
2024-02-27 11:12:47.950  DEBUG --- Docker enabled runner detected and Docker daemon wait is enabled
2024-02-27 11:12:47.952  DEBUG --- Waiting until Docker is available or the timeout of 120 seconds is reached
}CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
2024-02-27 11:12:54.655  NOTICE --- WARNING LATEST TAG HAS BEEN DEPRECATED. SEE GITHUB ISSUE FOR DETAILS:
2024-02-27 11:12:54.657  NOTICE --- https://github.com/actions/actions-runner-controller/issues/2056

√ Connected to GitHub

Current runner version: '2.309.0'
2024-02-27 11:12:57Z: Listening for Jobs
Runner update in progress, do not shutdown runner.
Downloading 2.313.0 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should be back online within 10 seconds.
Runner update process finished.
Runner listener exit because of updating, re-launch runner after successful update
Update finished successfully.
Restarting runner...

√ Connected to GitHub

Current runner version: '2.313.0'
2024-02-27 11:13:33Z: Listening for Jobs
2024-02-27 13:04:40Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.

Looking into docs, I found they moved on some time ago to “ARC”, or “Actions Runner Controller”.

Custom image

I depend on a lot of things I would rather have baked into my image. This includes Pulumi binaries, OpenTofu and some older Ruby libraries. Ruby, in particular, is a beast to have to install over and over.

My old image was based on the now deprecated ‘summerwind’ image. I would need to build out a new Github Runner using ARC

I’ll spare you the endless back and forth as I discovered differences in the base image, but here is the before:

FROM summerwind/actions-runner:latest

RUN sudo apt update -y \
  && umask 0002 \
  && sudo apt install -y ca-certificates curl apt-transport-https lsb-release gnupg

# Install MS Key
RUN curl -sL https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/microsoft.gpg > /dev/null

# Add MS Apt repo
RUN umask 0002 && echo "deb [arch=amd64] https://packages.microsoft.com/repos/azure-cli/ focal main" | sudo tee /etc/apt/sources.list.d/azure-cli.list

# Install Azure CLI
RUN sudo apt update -y \
  && umask 0002 \
  && sudo apt install -y azure-cli awscli ruby-full

# Install Pulumi
RUN curl -fsSL https://get.pulumi.com | sh

# Install Homebrew
RUN /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# OpenTF

# Install Golang 1.19
RUN eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)" \
  && brew install go@1.19
#echo 'export PATH="/home/linuxbrew/.linuxbrew/opt/go@1.19/bin:$PATH"'

RUN eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)" \
  && brew install opentofu

RUN sudo cp /home/linuxbrew/.linuxbrew/bin/tofu /usr/local/bin/

RUN sudo chown runner /usr/local/bin

RUN sudo chmod 777 /var/lib/gems/2.7.0

RUN sudo chown runner /var/lib/gems/2.7.0

# Install Expect and SSHPass

RUN sudo apt update -y \
  && umask 0002 \
  && sudo apt install -y sshpass expect

# save time per build
RUN umask 0002 \
  && gem install bundler -v 2.4.22

# Limitations in newer jekyll
RUN umask 0002 \
  && gem install jekyll --version="~> 4.2.0"
  
RUN sudo rm -rf /var/lib/apt/lists/*

#harbor.freshbrewed.science/freshbrewedprivate/myghrunner:1.1.16

and the new

$ cat Dockerfile
FROM ghcr.io/actions/actions-runner:latest

RUN sudo apt update -y \
  && umask 0002 \
  && sudo apt install -y ca-certificates curl apt-transport-https lsb-release gnupg git build-essential

# Install MS Key
RUN curl -sLS https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor | sudo tee /etc/apt/keyrings/microsoft.gpg > /dev/null

# Add MS Apt repo
RUN umask 0002 && export AZ_DIST=$(lsb_release -cs) && echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/microsoft.gpg] https://packages.microsoft.com/repos/azure-cli/ $AZ_DIST main" | sudo tee /etc/apt/sources.list.d/azure-cli.list

# Install Azure CLI
RUN sudo apt update -y \
  && umask 0002 \
  && sudo apt install -y azure-cli awscli ruby-full

# Install Pulumi
RUN curl -fsSL https://get.pulumi.com | sh

# Install Homebrew
RUN /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# OpenTF

# Install Golang 1.19
RUN eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)" \
  && brew install go@1.19
#echo 'export PATH="/home/linuxbrew/.linuxbrew/opt/go@1.19/bin:$PATH"'

RUN eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)" \
  && brew install opentofu

RUN sudo cp /home/linuxbrew/.linuxbrew/bin/tofu /usr/local/bin/

RUN sudo chown runner /usr/local/bin

# Ruby RVM

RUN sudo chmod 777 /var/lib/gems/*

RUN sudo chown runner /var/lib/gems/*

# Install Expect and SSHPass

RUN sudo apt update -y \
  && umask 0002 \
  && sudo apt install -y sshpass expect

# save time per build
RUN umask 0002 \
  && gem install bundler -v 2.4.22

# Limitations in newer jekyll
RUN umask 0002 \
  && gem install jekyll --version="~> 4.2.0"

RUN sudo rm -rf /var/lib/apt/lists/*

#harbor.freshbrewed.science/freshbrewedprivate/myghrunner:2.0.0

This is one of the rare cases I push right to main as this is for the runner image

builder@DESKTOP-QADGF36:~/Workspaces/jekyll-blog/ghRunnerImage$ git add Dockerfile
builder@DESKTOP-QADGF36:~/Workspaces/jekyll-blog/ghRunnerImage$ git branch --show-current
main
builder@DESKTOP-QADGF36:~/Workspaces/jekyll-blog/ghRunnerImage$ git commit -m "GH Action Runner for 2.0.0"
[main f55bd73] GH Action Runner for 2.0.0
 1 file changed, 9 insertions(+), 7 deletions(-)
builder@DESKTOP-QADGF36:~/Workspaces/jekyll-blog/ghRunnerImage$ git push
Enumerating objects: 7, done.
Counting objects: 100% (7/7), done.
Delta compression using up to 16 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 644 bytes | 644.00 KiB/s, done.
Total 4 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.
To https://github.com/idjohnson/jekyll-blog
   4172b0f..f55bd73  main -> main

It honestly took the better part of a morning (while still doing my day job and checking occasionally) to get the image pushed. In fact, the only way to actually resurrect Harbor and keep it going was to do a helm upgrade (see Harbor section later).

That said, at least the part that pushes to Harbor completed

/content/images/2024/03/ghrfixes-03.png

I don’t need to dig into the NAS step too much as I know i have restricted IPs that can upload files and the Docker host (T100) is not one.

Testing on Dev Cluster

As always, let’s first test on the dev cluster.

ARC is based on a provider which we can install with Helm. Luckily, in our more modern world of OCI Charts, we don’t need to add a helm repository, just use the CR directly.

$ helm install arc --namespace "arc-systems" --create-namespace oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller
Pulled: ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller:0.8.3
Digest: sha256:cc604073ea3e64896a86e02ce5c72113a84b81e5a5515758849a660ecfa49eea
NAME: arc
LAST DEPLOYED: Wed Feb 28 15:52:02 2024
NAMESPACE: arc-systems
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thank you for installing gha-runner-scale-set-controller.

Your release is named arc.

That sets up the provider, just like Summerwind.dev used to do.

The next part is to create a “Runner Scale Set”.

It is here that I care about the runner image.

I set, first, the main variables I’ll need

$ INSTALLATION_NAME="arc-runner-set"
$ NAMESPACE="arc-runners"
$ GITHUB_CONFIG_URL="https://github.com/idjohnson/jekyll-blog"
$ GITHUB_PAT="ghp_ASDFASDFASDFASDFASDAFASDAFASDAAF"

I’m a bit stuck on this next part as their helm chart doesnt actually let me set an ImagePullSecret for the runner container (see here)

I’ll install with the default and circle back on this (Though I wish I could have set --set template.spec.containers[].image=harbor... --set template.spec.containers[].imagePullSecret=myharborreg)

I ran the helm install

$ helm install "${INSTALLATION_NAME}" --namespace "${NAMESPACE}" --create-namespace --set githubConfigUrl="${GITHUB_CONFIG_URL}" --set githubConfigSecret.github_token="${GITHUB_PAT}" oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set
Pulled: ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set:0.8.3
Digest: sha256:83f2f36a07038f120340012352268fcb2a06bbf00b0c2c740500a5383db5f91a
NAME: arc-runner-set
LAST DEPLOYED: Wed Feb 28 16:01:53 2024
NAMESPACE: arc-runners
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thank you for installing gha-runner-scale-set.

Your release is named arc-runner-set.

Unlike Summerwind, we won’t see running pods in our namespace, rather a “listener” in teh art-system namespace

$ kubectl get pods -n arc-systems
NAME                                     READY   STATUS    RESTARTS   AGE
arc-gha-rs-controller-7d5f8cbd9b-pcdv6   1/1     Running   0          11m
arc-runner-set-754b578d-listener         1/1     Running   0          102s

Which is now showing in a new section in Github runners

/content/images/2024/03/ghrfixes-05.png

I’ll need to use a new “runs-on” so let’s create a test workflow in .github/workflows/testing-ghrunnerset.yml:

name: Actions Runner Controller Demo
on:
  workflow_dispatch:

jobs:
  Explore-GitHub-Actions:
    # You need to use the INSTALLATION_NAME from the previous step
    runs-on: arc-runner-set
    steps:
    - run: echo "🎉 This job uses runner scale set runners!"
    - run: |
         set -x
         which az
         which ruby
         which tofu

Once pushed to main i could fire it off, but then it just sat there.

The logs on my Runnerset showed YET AGAIN we could not hit ‘api.github.com’

$ kubectl logs arc-runner-set-754b578d-listener -n arc-systems
2024-02-28T22:12:24Z    INFO    listener-app    app initialized
2024-02-28T22:12:24Z    INFO    listener-app    Starting listener
2024-02-28T22:12:24Z    INFO    listener-app    refreshing token        {"githubConfigUrl": "https://github.com/idjohnson/jekyll-blog"}
2024-02-28T22:12:24Z    INFO    listener-app    getting runner registration token       {"registrationTokenURL": "https://api.github.com/repos/idjohnson/jekyll-blog/actions/runners/registration-token"}
2024-02-28T22:12:54Z    ERROR   listener-app    Retryable client error  {"error": "Post \"https://api.github.com/repos/idjohnson/jekyll-blog/actions/runners/registration-token\": dial tcp: lookup api.github.com: i/o timeout", "method": "POST", "url": "https://api.github.com/repos/idjohnson/jekyll-blog/actions/runners/registration-token", "error": "request failed"}

/content/images/2024/03/ghrfixes-06.png

I tried to test with a dnsutils pod

$ kubectl run -it --rm --restart=Never --image=infoblox/dnstools:latest dnstools
If you don't see a command prompt, try pressing enter.
dnstools# host api.github.com
;; connection timed out; no servers could be reached
dnstools# host foxnews.com
;; connection timed out; no servers could be reached
dnstools#

Harbor

Harbor kept crashing. At one point it was my own doing for having dropped a NIC on the master node. But i could not get core and jobservice to stay up

Every 2.0s: kubectl get pods -l app=harbor                                                                                              DESKTOP-QADGF36: Wed Feb 28 13:40:52 2024

NAME                                           READY   STATUS    RESTARTS        AGE
harbor-registry2-exporter-59755fb475-trkqv     1/1     Running   0               129d
harbor-registry2-trivy-0                       1/1     Running   0               129d
harbor-registry2-portal-5c45d99f69-nt6sj       1/1     Running   0               129d
harbor-registry2-registry-6fcf6fdf49-glbg2     2/2     Running   0               129d
harbor-registry2-redis-0                       1/1     Running   1 (5h26m ago)   5h35m
harbor-registry2-core-5886799cd6-7npv2         1/1     Running   0               5h
harbor-registry2-jobservice-6bf7d6f5d6-wbvf6   0/1     Error     1 (39s ago)     85s

This caused the endless timeouts I alluded to earlier

/content/images/2024/03/ghrfixes-04.png

I set aside my existing values

$ helm get values harbor-registry2 -o yaml > harbor-registry2.yaml
$ kubectl get ingress harbor-registry2-ingress -o yaml > ingressbackup

To see a before and after, I looked at my tagged images of the currently running system

$ helm get values harbor-registry2 --all | grep 'tag: '
    tag: v2.9.0
      tag: v2.9.0
    tag: v2.9.0
    tag: v2.9.0
    tag: v2.9.0
    tag: v2.9.0
      tag: v2.9.0
      tag: v2.9.0
      tag: v2.9.0
    tag: v2.9.0

I then upgraded Harbor (as I did last time to the latest official version using my saved values

$ helm upgrade -f harbor-registry2.yaml harbor-registry2 harbor/harbor
Release "harbor-registry2" has been upgraded. Happy Helming!
NAME: harbor-registry2
LAST DEPLOYED: Wed Feb 28 13:44:09 2024
NAMESPACE: default
STATUS: deployed
REVISION: 4
TEST SUITE: None
NOTES:
Please wait for several minutes for Harbor deployment to complete.
Then you should be able to visit the Harbor portal at https://harbor.freshbrewed.science
For more details, please visit https://github.com/goharbor/harbor

Which eventually worked


kubectl get pods -l app=harbor                                                                                              DESKTOP-QADGF36: Wed Feb 28 13:47:31 2024

NAME                                          READY   STATUS    RESTARTS      AGE
harbor-registry2-portal-657748d5c7-4fhlc      1/1     Running   0             3m10s
harbor-registry2-exporter-6cdf6679b5-g7ntn    1/1     Running   0             3m13s
harbor-registry2-registry-64ff4b97f5-59qwg    2/2     Running   0             3m9s
harbor-registry2-trivy-0                      1/1     Running   0             112s
harbor-registry2-redis-0                      1/1     Running   0             2m1s
harbor-registry2-core-7568f658cd-xsl46        1/1     Running   0             3m13s
harbor-registry2-jobservice-768fd77d6-v8tk8   1/1     Running   3 (68s ago)   3m13s

And I could sanity check the image tags of the instances to see they were all upgraded

$ helm get values harbor-registry2 --all | grep 'tag: '
    tag: v2.10.0
      tag: v2.10.0
    tag: v2.10.0
    tag: v2.10.0
    tag: v2.10.0
    tag: v2.10.0
      tag: v2.10.0
      tag: v2.10.0
      tag: v2.10.0
    tag: v2.10.0

Linux Instance

First we make a folder, then download the latest runner

builder@builder-T100:~$ mkdir actions-runner && cd actions-runner
builder@builder-T100:~/actions-runner$ curl -o actions-runner-linux-x64-2.313.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.313.0/actions-runner-linux-x64-2.313.0.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  179M  100  179M    0     0  55.4M      0  0:00:03  0:00:03 --:--:-- 65.7M
builder@builder-T100:~/actions-runner$  tar xzf ./actions-runner-linux-x64-2.313.0.tar.gz

Next, we configure (the token is provide in the settings page)

Then I went to launch the GH Runner

builder@builder-T100:~/actions-runner$ ./config.sh --url https://github.com/idjohnson/jekyll-blog --token ASDFASDFASDFASDAFASDASDAFAS

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication


√ Connected to GitHub

# Runner Registration

Enter the name of the runner group to add this runner to: [press Enter for Default]

Enter the name of runner: [press Enter for builder-T100]

This runner will have the following labels: 'self-hosted', 'Linux', 'X64'
Enter any additional labels (ex. label-1,label-2): [press Enter to skip] dockerhost

√ Runner successfully added
√ Runner connection is good

# Runner settings

Enter name of work folder: [press Enter for _work]

√ Settings Saved.

I could then run (and make live) interactively

builder@builder-T100:~/actions-runner$ ./run.sh

√ Connected to GitHub

Current runner version: '2.313.0'
2024-02-28 13:18:13Z: Listening for Jobs

2024-02-28 13:18:16Z: Running job: build_deploy_test
2024-02-28 13:20:50Z: Job build_deploy_test completed with result: Failed
^CExiting...
Runner listener exit with 0 return code, stop the service, no retry needed.
Exiting runner...

/content/images/2024/03/ghrfixes-01.png

Once I took care of adding a multitude of libraries and missing packages, I could install this as a service (so I would not need to be logged in)

builder@builder-T100:~/actions-runner$ sudo ./svc.sh install
Creating launch runner in /etc/systemd/system/actions.runner.idjohnson-jekyll-blog.builder-T100.service
Run as user: builder
Run as uid: 1000
gid: 1000
Created symlink /etc/systemd/system/multi-user.target.wants/actions.runner.idjohnson-jekyll-blog.builder-T100.service → /etc/systemd/system/actions.runner.idjohnson-jekyll-blog.builder-T100.service.
builder@builder-T100:~/actions-runner$ sudo ./svc.sh start

/etc/systemd/system/actions.runner.idjohnson-jekyll-blog.builder-T100.service
● actions.runner.idjohnson-jekyll-blog.builder-T100.service - GitHub Actions Runner (idjohnson-jekyll-blog.builder-T100)
     Loaded: loaded (/etc/systemd/system/actions.runner.idjohnson-jekyll-blog.builder-T100.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2024-02-28 07:37:28 CST; 13ms ago
   Main PID: 573545 (runsvc.sh)
      Tasks: 2 (limit: 9082)
     Memory: 1.0M
        CPU: 6ms
     CGroup: /system.slice/actions.runner.idjohnson-jekyll-blog.builder-T100.service
             ├─573545 /bin/bash /home/builder/actions-runner/runsvc.sh
             └─573549 ./externals/node16/bin/node ./bin/RunnerService.js

Feb 28 07:37:28 builder-T100 systemd[1]: Started GitHub Actions Runner (idjohnson-jekyll-blog.builder-T100).
Feb 28 07:37:28 builder-T100 runsvc.sh[573545]: .path=/home/builder/.nvm/versions/node/v18.16.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
Feb 28 07:37:28 builder-T100 runsvc.sh[573549]: Starting Runner listener with startup type: service
Feb 28 07:37:28 builder-T100 runsvc.sh[573549]: Started listener process, pid: 573570
Feb 28 07:37:28 builder-T100 runsvc.sh[573549]: Started running service

/content/images/2024/03/ghrfixes-02.png

Things are still stuck…

builder@DESKTOP-QADGF36:~/Workspaces/jekyll-blog$ kubectl get pods --all-namespaces | grep dns
kube-system                 coredns-d76bd69b-k57p5                                      1/1     Running            3 (285d ago)       580d
builder@DESKTOP-QADGF36:~/Workspaces/jekyll-blog$ kubectl delete pod coredns-d76bd69b-k57p5 -n kube-system
pod "coredns-d76bd69b-k57p5" deleted
builder@DESKTOP-QADGF36:~/Workspaces/jekyll-blog$ kubectl get pods --all-namespaces | grep dns
kube-system                 coredns-d76bd69b-7cmnh                                      0/1     ContainerCreating   0                  5s
builder@DESKTOP-QADGF36:~/Workspaces/jekyll-blog$ kubectl get pods --all-namespaces | grep dns
kube-system                 coredns-d76bd69b-7cmnh                                      0/1     ContainerCreating   0                  7s
builder@DESKTOP-QADGF36:~/Workspaces/jekyll-blog$ kubectl get pods --all-namespaces | grep dns
kube-system                 coredns-d76bd69b-7cmnh                                      1/1     Running            0                  10s

Rebuild

This sucks. There is no going back now.

Now we rebuild.

/content/images/2024/03/now-we-rebuild-rebuild.gif

Okay, so let’s first start by using the newer NAS for NFS mounts

The old was sassynassy at 192.168.1.129 The new is sirnasilot at 192.168.1.116.

I made the dirs:

mkdir -p /mnt/nfs/k3snfs
mkdir -p /mnt/psqlbackups
mkdir -p /mnt/nfs/snalPrimary01

Then added 3 lines for the mounts

builder@builder-HP-EliteBook-745-G5:~$ cat /etc/fstab
# /etc/fstab: static file system information.
#uilder@builder-HP-EliteBook-745-G5:~$
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
# / was on /dev/nvme0n1p2 during installation
UUID=f5a264b7-35c9-4a37-a466-634771df4d94 /               ext4    errors=remount-ro 0       1
# /boot/efi was on /dev/nvme0n1p1 during installation
UUID=9C7E-4200  /boot/efi       vfat    umask=0077      0       1
/swapfile                                 none            swap    sw              0       0
192.168.1.129:/volume1/k3snfs   /mnt/nfs/k3snfs nfs     auto,nofail,noatime,nolock,intr,tcp,actimeo=1800        0       0
192.168.1.129:/volume1/postgres-prod-dbbackups  /mnt/psqlbackups nfs    auto,nofail,noatime,nolock,intr,tcp,actimeo=1800        0       0
192.168.1.116:/volume1/k3sPrimary01     /mnt/nfs/snalPrimary01 nfs    auto,nofail,noatime,nolock,intr,tcp,actimeo=1800        0       0

I could test on any host by using sudo mount -a and seeing the files

hp@hp-HP-EliteBook-850-G2:~$ sudo mount -a
hp@hp-HP-EliteBook-850-G2:~$ ls -ltra /mnt/nfs/snalPrimary01/
total 4
drwxrwxrwx 1 root root   12 Mar  1 05:56 .
drwxr-xr-x 4 root root 4096 Mar  1 05:57 ..
hp@hp-HP-EliteBook-850-G2:~$ ls -ltra /mnt/nfs/k3snfs/
total 353780
drwxrwxrwx  2 root root       4096 Nov 16  2020 '#recycle'
drwxrwxrwx  2 1026 users      4096 Nov 16  2020  test
drwxrwxrwx  2 1024 users      4096 Nov 18  2020  default-beospvc-pvc-f36c2986-ab0b-4978-adb6-710d4698e170
drwxrwxrwx  2 1024 users      4096 Nov 19  2020  default-beospvc6-pvc-099fe2f3-2d63-4df5-ba65-4c7f3eba099e
drwxrwxrwx  2 1024 users      4096 Nov 19  2020  default-fedorawsiso-pvc-cad0ce95-9af3-4cb4-959d-d8b944de47ce
drwxrwxrwx  3 1024 users      4096 Dec  3  2020  default-data-redis-ha-1605552203-server-0-pvc-35be9319-4b0b-429e-82f6-6fbf3afab721
drwxrwxrwx  3 1024 users      4096 Dec  6  2020  default-data-redis-ha-1605552203-server-2-pvc-728cf90d-b725-44b9-8a2d-73ddae84abfa
drwxrwxrwx  3 1024 users      4096 Dec 17  2020  default-data-redis-ha-1605552203-server-1-pvc-17c79f00-ac73-454f-a664-e02de9158bd5
drwxrwxrwx  2 1024 users      4096 Dec 27  2020  default-redis-data-bitnami-harbor-redis-master-0-pvc-73a7e833-90fb-41ab-b42c-7a1e7fd5aad3
drwxrwxrwx  2 1024 users      4096 Jan  2  2021  default-mongo-release-mongodb-pvc-ecb4cc4f-153e-4eff-a5e7-5972b48e6f37
-rw-r--r--  1 1024 users 362073000 Dec  6  2021  k3s-backup-master-20211206.tgz
drwxrwxrwx  2 1024 users      4096 Jul  8  2022  default-redis-data-redis-slave-1-pvc-3c569803-3275-443d-9b65-be028ce4481f
drwxrwxrwx  2 1024 users      4096 Jul  8  2022  default-redis-data-redis-master-0-pvc-bdf57f20-661c-4982-aebd-a1bb30b44830
drwxrwxrwx  2 1024 users      4096 Jul  8  2022  default-redis-data-redis-slave-0-pvc-651cdaa3-d321-45a3-adf3-62224c341fba
drwxrwxrwx  2 1024 users      4096 Oct 31  2022  backups
drwxrwxrwx 18 root root       4096 Oct 31  2022  .
drwxrwxrwx  2 1024 users    118784 Mar  1 02:09  postgres-backups
drwxr-xr-x  4 root root       4096 Mar  1 05:57  ..

The new mount uses Btrfs which should be more resiliant than the old ext4

/content/images/2024/03/ghrfixes-07.png

For this release, we’ll use the (at present) latest build of 1.26.14: https://github.com/k3s-io/k3s/releases/tag/v1.26.14%2Bk3s1

v1.26.14+k3s1 or v1.26.14%2Bk3s1

Perhaps using

curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.26.14%2Bk3s1" K3S_KUBECONFIG_MODE="644" INSTALL_K3S_EXEC=" --no-deploy traefik --tls-san 73.242.50.46" sh -

My first try failed

builder@builder-HP-EliteBook-745-G5:~$ curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.26.14%2Bk3s1" K3S_KUBECONFIG_MODE="644" INSTALL_K3S_EXEC=" --no-deploy traefik --tls-san 73.242.50.46" sh -
[INFO]  Using v1.26.14%2Bk3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.26.14%2Bk3s1/sha256sum-amd64.txt
[INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.26.14%2Bk3s1/k3s
[INFO]  Verifying binary download
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  Skipping installation of SELinux RPM
[INFO]  Skipping /usr/local/bin/kubectl symlink to k3s, command exists in PATH at /usr/bin/kubectl
[INFO]  Skipping /usr/local/bin/crictl symlink to k3s, command exists in PATH at /usr/bin/crictl
[INFO]  Skipping /usr/local/bin/ctr symlink to k3s, command exists in PATH at /usr/bin/ctr
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO]  systemd: Starting k3s
Job for k3s.service failed because the control process exited with error code.
See "systemctl status k3s.service" and "journalctl -xeu k3s.service" for details.
builder@builder-HP-EliteBook-745-G5:~$ systemctl status k3s.service
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Fri 2024-03-01 06:16:33 CST; 611ms ago
       Docs: https://k3s.io
    Process: 47993 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 2>/dev/null (code=exited, status=0/SUCCESS)
    Process: 47995 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 47996 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
    Process: 47997 ExecStart=/usr/local/bin/k3s server --no-deploy traefik --tls-san 73.242.50.46 (code=exited, status=1/FAILURE)
   Main PID: 47997 (code=exited, status=1/FAILURE)
        CPU: 24ms

Mar 01 06:16:33 builder-HP-EliteBook-745-G5 k3s[47997]:    --kube-proxy-arg value                     (agent/flags) Customized flag for kube-proxy process
Mar 01 06:16:33 builder-HP-EliteBook-745-G5 k3s[47997]:    --protect-kernel-defaults                  (agent/node) Kernel tuning behavior. If set, error >
Mar 01 06:16:33 builder-HP-EliteBook-745-G5 k3s[47997]:    --secrets-encryption                       Enable secret encryption at rest
Mar 01 06:16:33 builder-HP-EliteBook-745-G5 k3s[47997]:    --enable-pprof                             (experimental) Enable pprof endpoint on supervisor >
Mar 01 06:16:33 builder-HP-EliteBook-745-G5 k3s[47997]:    --rootless                                 (experimental) Run rootless
Mar 01 06:16:33 builder-HP-EliteBook-745-G5 k3s[47997]:    --prefer-bundled-bin                       (experimental) Prefer bundled userspace binaries ov>
Mar 01 06:16:33 builder-HP-EliteBook-745-G5 k3s[47997]:    --selinux                                  (agent/node) Enable SELinux in containerd [$K3S_SEL>
Mar 01 06:16:33 builder-HP-EliteBook-745-G5 k3s[47997]:    --lb-server-port value                     (agent/node) Local port for supervisor client load->
Mar 01 06:16:33 builder-HP-EliteBook-745-G5 k3s[47997]:
Mar 01 06:16:33 builder-HP-EliteBook-745-G5 k3s[47997]: time="2024-03-01T06:16:33-06:00" level=fatal msg="flag provided but not defined: -no-deploy"

The error is that last line:

msg="flag provided but not defined: -no-deploy"

Seems the newer way is

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server" sh -s - --disable-traefik
$ curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.26.14%2Bk3s1" K3S_KUBECONFIG_MODE="644" INSTALL_K3S_EXEC="--
disable-traefik --tls-san 73.242.50.46" sh -
[INFO]  Using v1.26.14%2Bk3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.26.14%2Bk3s1/sha256sum-amd64.txt
[INFO]  Skipping binary downloaded, installed k3s matches hash
[INFO]  Skipping installation of SELinux RPM
[INFO]  Skipping /usr/local/bin/kubectl symlink to k3s, command exists in PATH at /usr/bin/kubectl
[INFO]  Skipping /usr/local/bin/crictl symlink to k3s, command exists in PATH at /usr/bin/crictl
[INFO]  Skipping /usr/local/bin/ctr symlink to k3s, command exists in PATH at /usr/bin/ctr
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO]  No change detected so skipping service start

The new “word” is just “disable”, not “no-deploy”

$ INSTALL_K3S_VERSION="v1.26.14%2Bk3s1" K3S_KUBECONFIG_MODE="644" INSTALL_K3S_EXEC="server" ./k3s.sh --disable traefi
k --tls-san 73.242.50.46
[INFO]  Using v1.26.14%2Bk3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.26.14%2Bk3s1/sha256sum-amd64.txt
[INFO]  Skipping binary downloaded, installed k3s matches hash
[INFO]  Skipping installation of SELinux RPM
[INFO]  Skipping /usr/local/bin/kubectl symlink to k3s, command exists in PATH at /usr/bin/kubectl
[INFO]  Skipping /usr/local/bin/crictl symlink to k3s, command exists in PATH at /usr/bin/crictl
[INFO]  Skipping /usr/local/bin/ctr symlink to k3s, command exists in PATH at /usr/bin/ctr
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO]  systemd: Starting k3s

Still trying

builder@builder-HP-EliteBook-745-G5:~$ curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.26.14%2Bk3s1" K3S_KUBECONFIG_MODE="644" INSTALL_K3S_EXEC="--disable traefik
--tls-san 73.242.50.46" sh -
[INFO]  Using v1.26.14%2Bk3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.26.14%2Bk3s1/sha256sum-amd64.txt
[INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.26.14%2Bk3s1/k3s
[INFO]  Verifying binary download
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  Skipping installation of SELinux RPM
[INFO]  Skipping /usr/local/bin/kubectl symlink to k3s, command exists in PATH at /usr/bin/kubectl
[INFO]  Skipping /usr/local/bin/crictl symlink to k3s, command exists in PATH at /usr/bin/crictl
[INFO]  Skipping /usr/local/bin/ctr symlink to k3s, command exists in PATH at /usr/bin/ctr
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO]  systemd: Starting k3s
builder@builder-HP-EliteBook-745-G5:~$ systemctl status k3s.service
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2024-03-01 06:36:42 CST; 7s ago
       Docs: https://k3s.io
    Process: 2863 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 2>/dev/null (code=exited, status=0/SUCCESS)
    Process: 2865 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 2866 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 2867 (k3s-server)
      Tasks: 30
     Memory: 553.0M
        CPU: 19.107s
     CGroup: /system.slice/k3s.service
             ├─2867 "/usr/local/bin/k3s server" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             └─2894 "containerd " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" >

Mar 01 06:36:49 builder-HP-EliteBook-745-G5 k3s[2867]: I0301 06:36:49.395407    2867 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system:>
Mar 01 06:36:49 builder-HP-EliteBook-745-G5 k3s[2867]: I0301 06:36:49.395410    2867 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system:>
Mar 01 06:36:49 builder-HP-EliteBook-745-G5 k3s[2867]: I0301 06:36:49.395489    2867 secure_serving.go:213] Serving securely on 127.0.0.1:10259
Mar 01 06:36:49 builder-HP-EliteBook-745-G5 k3s[2867]: I0301 06:36:49.395494    2867 shared_informer.go:270] Waiting for caches to sync for client-ca::kube-system::exten>
Mar 01 06:36:49 builder-HP-EliteBook-745-G5 k3s[2867]: I0301 06:36:49.395454    2867 shared_informer.go:270] Waiting for caches to sync for client-ca::kube-system::exten>
Mar 01 06:36:49 builder-HP-EliteBook-745-G5 k3s[2867]: I0301 06:36:49.395569    2867 tlsconfig.go:240] "Starting DynamicServingCertificateController"
Mar 01 06:36:49 builder-HP-EliteBook-745-G5 k3s[2867]: time="2024-03-01T06:36:49-06:00" level=info msg="Handling backend connection request [builder-hp-elitebook-745-g5]"
Mar 01 06:36:49 builder-HP-EliteBook-745-G5 k3s[2867]: I0301 06:36:49.495856    2867 shared_informer.go:277] Caches are synced for RequestHeaderAuthRequestController
Mar 01 06:36:49 builder-HP-EliteBook-745-G5 k3s[2867]: I0301 06:36:49.496011    2867 shared_informer.go:277] Caches are synced for client-ca::kube-system::extension-apis>
Mar 01 06:36:49 builder-HP-EliteBook-745-G5 k3s[2867]: I0301 06:36:49.496083    2867 shared_informer.go:277] Caches are synced for client-ca::kube-system::exte

I had a bad prior install that messed up the kubeconfig, i fixed that and could see node1

builder@builder-HP-EliteBook-745-G5:~$ cp ~/.kube/config ~/.kube/config-bak
builder@builder-HP-EliteBook-745-G5:~$ cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
builder@builder-HP-EliteBook-745-G5:~$ kubectl get nodes
NAME                          STATUS   ROLES                  AGE   VERSION
builder-hp-elitebook-745-g5   Ready    control-plane,master   98s   v1.26.14+k3s1

Adwerx AWX

I managed to pull my values file before it crashed again, enough to get the right Ingress annotations and password.

isaac@isaac-MacBookAir:~$ helm get values -n adwerx adwerxawx
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /home/isaac/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /home/isaac/.kube/config
USER-SUPPLIED VALUES:
affinity: null
default_admin_password: null
default_admin_user: null
defaultAdminExistingSecret: null
defaultAdminPassword: xxxxxxxxxxxxxxxx
defaultAdminUser: admin
extraConfiguration: '# INSIGHTS_URL_BASE = "https://example.org"'
extraVolumes: []
fullnameOverride: ""
image:
  pullPolicy: IfNotPresent
  repository: ansible/awx
  tag: 17.1.0
ingress:
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    ingress.kubernetes.io/proxy-body-size: "0"
    ingress.kubernetes.io/ssl-redirect: "true"
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.org/client-max-body-size: "0"
    nginx.org/proxy-connect-timeout: "600"
    nginx.org/proxy-read-timeout: "600"
  defaultBackend: true
  enabled: true
  hosts:
  - host: awx.freshbrewed.science
    paths:
    - /
  tls:
  - hosts:
    - awx.freshbrewed.science
    secretName: awx-tls
...

It uses this Adwerx AWX chart

I just did an add and update

$ helm repo add adwerx https://adwerx.github.io/charts
"adwerx" already exists with the same configuration, skipping
$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "adwerx" chart repository
...Successfully got an update from the "portainer" chart repository
...Successfully got an update from the "ngrok" chart repository
...Successfully got an update from the "zabbix-community" chart repository
...Successfully got an update from the "novum-rgi-helm" chart repository
...Successfully got an update from the "btungut" chart repository
...Successfully got an update from the "actions-runner-controller" chart repository
...Successfully got an update from the "kuma" chart repository
...Successfully got an update from the "rhcharts" chart repository
...Successfully got an update from the "akomljen-charts" chart repository
...Successfully got an update from the "kubecost" chart repository
...Successfully got an update from the "openproject" chart repository
...Successfully got an update from the "castai-helm" chart repository
...Successfully got an update from the "lifen-charts" chart repository
...Successfully got an update from the "elastic" chart repository
...Successfully got an update from the "datadog" chart repository
...Successfully got an update from the "nginx-stable" chart repository
...Successfully got an update from the "jetstack" chart repository
...Successfully got an update from the "signoz" chart repository
...Successfully got an update from the "uptime-kuma" chart repository
...Successfully got an update from the "incubator" chart repository
...Successfully got an update from the "grafana" chart repository
...Successfully got an update from the "gitlab" chart repository
...Unable to get an update from the "freshbrewed" chart repository (https://harbor.freshbrewed.science/chartrepo/library):
        failed to fetch https://harbor.freshbrewed.science/chartrepo/library/index.yaml : 404 Not Found
...Unable to get an update from the "myharbor" chart repository (https://harbor.freshbrewed.science/chartrepo/library):
        failed to fetch https://harbor.freshbrewed.science/chartrepo/library/index.yaml : 404 Not Found
...Successfully got an update from the "nfs" chart repository
...Successfully got an update from the "kube-state-metrics" chart repository
...Successfully got an update from the "opencost-charts" chart repository
...Successfully got an update from the "dapr" chart repository
...Successfully got an update from the "jfelten" chart repository
...Successfully got an update from the "opencost" chart repository
...Successfully got an update from the "confluentinc" chart repository
...Successfully got an update from the "azure-samples" chart repository
...Successfully got an update from the "ingress-nginx" chart repository
...Successfully got an update from the "openfunction" chart repository
...Successfully got an update from the "sonarqube" chart repository
...Successfully got an update from the "longhorn" chart repository
...Successfully got an update from the "spacelift" chart repository
...Successfully got an update from the "rook-release" chart repository
...Successfully got an update from the "hashicorp" chart repository
...Successfully got an update from the "harbor" chart repository
...Successfully got an update from the "kiwigrid" chart repository
...Successfully got an update from the "openzipkin" chart repository
...Successfully got an update from the "gitea-charts" chart repository
...Successfully got an update from the "sumologic" chart repository
...Successfully got an update from the "crossplane-stable" chart repository
...Successfully got an update from the "ananace-charts" chart repository
...Successfully got an update from the "open-telemetry" chart repository
...Successfully got an update from the "makeplane" chart repository
...Unable to get an update from the "epsagon" chart repository (https://helm.epsagon.com):
        Get "https://helm.epsagon.com/index.yaml": dial tcp: lookup helm.epsagon.com on 172.22.64.1:53: server misbehaving
...Successfully got an update from the "argo-cd" chart repository
...Successfully got an update from the "rancher-latest" chart repository
...Successfully got an update from the "newrelic" chart repository
...Successfully got an update from the "prometheus-community" chart repository
...Successfully got an update from the "bitnami" chart repository
Update Complete. ⎈Happy Helming!⎈

Then install

$ helm install -n adwerxawx --create-namespace adwerxawx -f ./adwerx.awx.values adwerx/awx
NAME: adwerxawx
LAST DEPLOYED: Sun Mar  3 16:02:10 2024
NAMESPACE: adwerxawx
STATUS: deployed
REVISION: 1

To rebuild, I just followed our guide from August 2022.

I never wrote how to make an Org using a Kubernetes Job there, so here is that example:

$ cat awx_createorg.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: awxcreateorg2
  namespace: adwerxawx
spec:
  backoffLimit: 4
  template:
    spec:
      containers:
      - name: test
        image: alpine
        envFrom:
        - secretRef:
            name: adwerxawx
        command:
        - bin/sh
        - -c
        - |
          apk --no-cache add curl
          apk --no-cache add jq
          curl --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X POST -H "Content-Type: application/json"  http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/organizations/ --data "{ \"name\": \"onprem\", \"description\": \"on prem hosts\", \"max_hosts\": 0, \"custom_virtualenv\": null }"

          curl --silent --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X GET -H "Content-Type: application/json" http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/organizations/ | jq '.results[] | select(.name=="onprem") | .id' > ./ORGID
          cat ./ORGID

          curl --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X GET -H "Content-Type: application/json" http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/organizations/ | jq
      restartPolicy: Never

Then we apply

$ kubectl apply -f awx_createorg.yaml -n adwerxawx
job.batch/awxcreateorg2 created

And results

/content/images/2024/03/ghrfixes-08.png

I need a credential type to create the GH credential - namely “SCM” which is usually 2

/content/images/2024/03/ghrfixes-09.png

It also works for a local curl

$ curl --silent -X GET -H "Content-Type: application/json" --user admin:asdfsadfasdfasdf https://awx.freshbr
ewed.science/api/v2/credential_types/ | jq '.results[] | select(.kind=="scm") | .id'
2

I’ll apply the other AWX jobs and when done

$ kubectl get jobs -n adwerxawx
NAME             COMPLETIONS   DURATION   AGE
awxcreateorg1    1/1           8s         32m
awxcreateorg2    1/1           8s         24m
awxcreatescm1    1/1           8s         35s
awxcreateproj1   1/1           9s         17s

I’ll then have Orgs, a working SCM and lastly a setup project

/content/images/2024/03/ghrfixes-10.png

Creating the Inventories and hosts was a breeze

$ cat awx_createk3shosts.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: awxcreatek8hosts5
  namespace: adwerxawx
spec:
  backoffLimit: 4
  template:
    spec:
      containers:
      - name: test
        image: alpine
        envFrom:
        - secretRef:
            name: adwerxawx
        command:
        - bin/sh
        - -c
        - |
          apk --no-cache add curl
          apk --no-cache add jq

          set -x

          # Org ID for final curl
          curl --silent --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X GET -H "Content-Type: application/json" http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/organizations/ | jq '.results[] | select(.name=="onprem") | .id' > ./ORGID
          export ORGID=`cat ./ORGID | tr -d '\n'`

          # Create Kubernetes Nodes Inv
          # curl --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X POST -H "Content-Type: application/json" http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/inventories/ --data "{\"name\": \"Kubernetes Nodes\",\"description\": \"Kubernetes On Prem Nodes\", \"organization\": $ORGID, \"variables\": \"---\" }"

          # Inv ID for final curl
          curl --silent --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X GET -H "Content-Type: application/json" http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/inventories/ | jq '.results[] | select(.name=="Kubernetes Nodes") | .id' > ./INVID
          export INVID=`cat ./INVID | tr -d '\n'`

          # Anna MBAir
          curl --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X POST -H "Content-Type: application/json" http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/hosts/ --data "{ \"name\": \"AnnaMacbook\", \"description\": \"Annas Macbook Air (primary)\", \"inventory\": $INVID, \"enabled\": true, \"variables\": \"---\nansible_host: 192.168.1.81\nansible_connection: ssh\"}"

          # builder-MacBookPro2
          curl --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X POST -H "Content-Type: application/json" http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/hosts/ --data "{ \"name\": \"builder-MacBookPro2\", \"description\": \"builder-MacBookPro2 (worker)\", \"inventory\": $INVID, \"enabled\": true, \"variables\": \"---\nansible_host: 192.168.1.159\nansible_connection: ssh\"}"

          # isaac-macbookpro
          curl --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X POST -H "Content-Type: application/json" http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/hosts/ --data "{ \"name\": \"isaac-macbookpro\", \"description\": \"isaac-macbookpro (worker)\", \"inventory\": $INVID, \"enabled\": true, \"variables\": \"---\nansible_host: 192.168.1.74\nansible_connection: ssh\"}"

          # builder-hp-elitebook-745-g5
          curl --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X POST -H "Content-Type: application/json" http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/hosts/ --data "{ \"name\": \"builder-hp-elitebook-745-g5\", \"description\": \"builder-hp-elitebook-745-g5 (primary)\", \"inventory\": $INVID, \"enabled\": true, \"variables\": \"---\nansible_host: 192.168.1.33\nansible_connection: ssh\"}"

          # builder-hp-elitebook-850-g1
          curl --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X POST -H "Content-Type: application/json" http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/hosts/ --data "{ \"name\": \"builder-hp-elitebook-850-g1\", \"description\": \"builder-hp-elitebook-850-g1 (worker)\", \"inventory\": $INVID, \"enabled\": true, \"variables\": \"---\nansible_host: 192.168.1.36\nansible_connection: ssh\"}"

          # hp-hp-elitebook-850-g2
          curl --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X POST -H "Content-Type: application/json" http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/hosts/ --data "{ \"name\": \"hp-hp-elitebook-850-g2\", \"description\": \"hp-hp-elitebook-850-g2 (worker) - bad battery\", \"inventory\": $INVID, \"enabled\": true, \"variables\": \"---\nansible_host: 192.168.1.57\nansible_connection: ssh\"}"

          # builder-HP-EliteBook-850-G2
          curl --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X POST -H "Content-Type: application/json" http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/hosts/ --data "{ \"name\": \"builder-HP-EliteBook-850-G2\", \"description\": \"builder-HP-EliteBook-850-G2 (worker) - bad fan\", \"inventory\": $INVID, \"enabled\": true, \"variables\": \"---\nansible_host: 192.168.1.215\nansible_connection: ssh\"}"


      restartPolicy: Never

After applying:

$ kubectl apply -n adwerxawx -f ./awx_createk3shosts.yaml
job.batch/awxcreatek8hosts5 created

/content/images/2024/03/ghrfixes-11.png

I created the Credentials I would need

$ cat awx_createhostpw.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: awxcreatehostpw2
  namespace: adwerxawx
spec:
  backoffLimit: 4
  template:
    spec:
      containers:
      - name: test
        image: alpine
        envFrom:
        - secretRef:
            name: adwerxawx
        command:
        - bin/sh
        - -c
        - |
          apk --no-cache add curl
          apk --no-cache add jq

          # Org ID for final curl
          curl --silent --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X GET -H "Content-Type: application/json" http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/organizations/ | jq '.results[] | select(.name=="onprem") | .id' > ./ORGID
          export ORGID=`cat ./ORGID | tr -d '\n'`

          # Get Machine ID (usually 1)
          curl --silent -X GET -H "Content-Type: application/json" --user admin:Redliub\$1 http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/credential_types/ | jq '.results[] | select(.name=="Machine") | .id' > ./HOSTTYPEID
          export HOSTTYPEID=`cat ./HOSTTYPEID | tr -d '\n'`

          # Standard Builder User
          curl --silent --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X POST -H "Content-Type: application/json" http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/credentials/ --data "{\"credential_type\": $HOSTTYPEID, \"inputs\": { \"username\": \"builder\", \"password\": \"xxxpasswordxxx\" }, \"name\": \"Builder Credential\", \"organization\": $ORGID}"

          # hp builder
          curl --silent --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X POST -H "Content-Type: application/json" http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/credentials/ --data "{\"credential_type\": $HOSTTYPEID, \"inputs\": { \"username\": \"hp\", \"password\": \"xxxpasswordxxx\" }, \"name\": \"HP Credential\", \"organization\": $ORGID}"

          # Isaac user
          curl --silent --user $AWX_ADMIN_USER:$AWX_ADMIN_PASSWORD -X POST -H "Content-Type: application/json" http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP/api/v2/credentials/ --data "{\"credential_type\": $HOSTTYPEID, \"inputs\": { \"username\": \"isaac\", \"password\": \"xxxpasswordxxx\" }, \"name\": \"Isaac Credential\", \"organization\": $ORGID}"


      restartPolicy: Never

THen tested using a shell pwd command

/content/images/2024/03/ghrfixes-12.png

I recreated the Nightly Blog Post check

/content/images/2024/03/ghrfixes-14.png

Then set a schedule

/content/images/2024/03/ghrfixes-13.png

Forgejo

I was hoping it might be true - I had setup Forgejo using a MySQL backend hosted on the HA NAS.

The values files were still local, so launching it again was as easy as a helm install

$ helm upgrade --install -n forgejo --create-namespace forgejo -f /home/builder/forgego.values oci://codeberg.org/forgejo-contrib/forgejo
Release "forgejo" does not exist. Installing it now.
Pulled: codeberg.org/forgejo-contrib/forgejo:4.0.1
Digest: sha256:a1a403f7fa30ff1353a50a86aa7232faa4a5a219bc2fba4cae1c69c878c4f7af
NAME: forgejo
LAST DEPLOYED: Sun Mar  3 18:23:53 2024
NAMESPACE: forgejo
STATUS: deployed
REVISION: 1
NOTES:
1. Get the application URL by running these commands:
  echo "Visit http://127.0.0.1:3000 to use your application"
  kubectl --namespace forgejo port-forward svc/forgejo-http 3000:3000

And the ingress was noted in the original blog article

So I just recreated that

$ cat forgejo.ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    ingress.kubernetes.io/proxy-body-size: "0"
    ingress.kubernetes.io/ssl-redirect: "true"
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.org/client-max-body-size: "0"
    nginx.org/proxy-connect-timeout: "600"
    nginx.org/proxy-read-timeout: "600"
  generation: 1
  labels:
    app: forgejo
    app.kubernetes.io/instance: forgejo
    app.kubernetes.io/name: forgejo
  name: forgejo
spec:
  rules:
  - host: forgejo.freshbrewed.science
    http:
      paths:
      - backend:
          service:
            name: forgejo-http
            port:
              number: 3000
        path: /
        pathType: ImplementationSpecific
  tls:
  - hosts:
    - forgejo.freshbrewed.science
    secretName: forgejo-tls

$ kubectl apply -f forgejo.ingress.yaml -n forgejo
ingress.networking.k8s.io/forgejo created

While Forgejo did come up, the repos behind it were missing. It seems the data for the repos were in PVCs now lost.

Weeks later

I actually took a break from blogging for a while.

To get Github runners working again, I used my dockerhost

 1164  mkdir githubrunner
 1165  cd githubrunner/
 1166  ls
 1167  cd ..
 1168  mkdir actions-runner && cd actions-runner
 1169  curl -o actions-runner-linux-x64-2.313.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.313.0/actions-runner-linux-x64-2.313.0.tar.gz
 1170  ./config.sh --url https://github.com/idjohnson/jekyll-blog --token asdfasdfasdfasdfasdfadsf
 1171  ./run.sh
 1172  ls
 1173  cat svc.sh
 1174  ls
 1175  sudo apt update && sudo apt install -y ca-certificates curl apt-transport-https lsb-release gnupg
 1176  umask 0002 && echo "deb [arch=amd64] https://packages.microsoft.com/repos/azure-cli/ focal main" | sudo tee /etc/apt/sources.list.d/azure-cli.list
 1177  sudo apt update -y   && umask 0002   && sudo apt install -y azure-cli awscli ruby-full
 1178  curl -fsSL https://get.pulumi.com | sh
 1179  eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"   && brew install go@1.19
 1180  eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"   && brew install opentofu
 1181  sudo cp /home/linuxbrew/.linuxbrew/bin/tofu /usr/local/bin/
 1182  sudo chmod 777 /var/lib/gems/*
 1183  sudo chown runner /var/lib/gems/*
 1184  sudo apt update -y   && umask 0002   && sudo apt install -y sshpass expect
 1185  umask 0002   && gem install bundler -v 2.4.22
 1186  umask 0002   &&  sudo gem install bundler -v 2.4.22
 1187  sudo gem install jekyll --version="~> 4.2.0"
 1188  history
 1189  sudo apt update && sudo apt install -y gnupg git build-essential
 1190  sudo ./svc.sh install
 1191  sudo ./svc.sh start

And later a missing python lib was needed

1339  pip install --upgrade urllib3 --user

/content/images/2024/03/ghrfixes-15.png

In time, with the new replacement cluster, I launched a fresh Actions Runner system:

builder@DESKTOP-QADGF36:~$ helm list -n actions-runner-system
NAME                            NAMESPACE               REVISION        UPDATED                                 STATUS          CHART
                APP VERSION
actions-runner-controller       actions-runner-system   1               2024-03-04 18:58:01.415392351 -0600 CST deployed        actions-runner-controller-0.23.7   0.27.6
builder@DESKTOP-QADGF36:~$ helm get values actions-runner-controller -n actions-runner-system
USER-SUPPLIED VALUES:
authSecret:
  create: true
  github_token: ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

After I set some AWS Secrets, I could use the RunnerDeployment

 builder@DESKTOP-QADGF36:~$ kubectl get runnerdeployment my-jekyllrunner-deployment
NAME                         ENTERPRISE   ORGANIZATION   REPOSITORY              GROUP   LABELS                           DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
my-jekyllrunner-deployment                               idjohnson/jekyll-blog           ["my-jekyllrunner-deployment"]   1         1         1            1           15d
builder@DESKTOP-QADGF36:~$ kubectl get runnerdeployment my-jekyllrunner-deployment -o yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"actions.summerwind.dev/v1alpha1","kind":"RunnerDeployment","metadata":{"annotations":{},"name":"my-jekyllrunner-deployment","namespace":"default"},"spec":{"template":{"spec":{"dockerEnabled":true,"env":[{"name":"AWS_DEFAULT_REGION","value":"us-east-1"},{"name":"AWS_ACCESS_KEY_ID","valueFrom":{"secretKeyRef":{"key":"USER_NAME","name":"awsjekyll"}}},{"name":"AWS_SECRET_ACCESS_KEY","valueFrom":{"secretKeyRef":{"key":"PASSWORD","name":"awsjekyll"}}}],"image":"harbor.freshbrewed.science/freshbrewedprivate/myghrunner:1.1.16","imagePullPolicy":"IfNotPresent","imagePullSecrets":[{"name":"myharborreg"}],"labels":["my-jekyllrunner-deployment"],"repository":"idjohnson/jekyll-blog"}}}}
  creationTimestamp: "2024-03-05T01:07:10Z"
  generation: 19
  name: my-jekyllrunner-deployment
  namespace: default
  resourceVersion: "2598657"
  uid: 5f767ec8-2b9d-4f37-b042-b30078872f0e
spec:
  effectiveTime: null
  replicas: 1
  selector: null
  template:
    metadata: {}
    spec:
      dockerEnabled: true
      dockerdContainerResources: {}
      env:
      - name: AWS_DEFAULT_REGION
        value: us-east-1
      - name: AWS_ACCESS_KEY_ID
        valueFrom:
          secretKeyRef:
            key: USER_NAME
            name: awsjekyll
      - name: AWS_SECRET_ACCESS_KEY
        valueFrom:
          secretKeyRef:
            key: PASSWORD
            name: awsjekyll
      image: harbor.freshbrewed.science/freshbrewedprivate/myghrunner:1.1.16
      imagePullPolicy: IfNotPresent
      imagePullSecrets:
      - name: myharborreg
      labels:
      - my-jekyllrunner-deployment
      repository: idjohnson/jekyll-blog
      resources: {}
status:
  availableReplicas: 1
  desiredReplicas: 1
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

I kept at the old server, powering it on for a few minutes, before it crashed, managed to do helm list and getting values for charts. Fetching secrets and the like.

I would tgz and stuff them onto the PVC NFS mount

/content/images/2024/03/ghrfixes-16.png

Today

Today I have “Int33” to replace it.

The Master node is now a surprisingly nimble HP EliteBook 745 with a Ryzen 5 CPU.

It’s been running for 20 days without issue.

I think the last course of action here is to “graduate” it to profession monitoring. My Datadog has been offline since the cluster was finally terminated March 11th

/content/images/2024/03/ghrfixes-17.png

I’m going to follow the latest guide here.

I need to add the Datadog Helm repo and update

builder@DESKTOP-QADGF36:~$ helm repo add datadog https://helm.datadoghq.com
"datadog" already exists with the same configuration, skipping
builder@DESKTOP-QADGF36:~$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Unable to get an update from the "myharbor" chart repository (https://harbor.freshbrewed.science/chartrepo/library):
        failed to fetch https://harbor.freshbrewed.science/chartrepo/library/index.yaml : 404 Not Found
...Unable to get an update from the "freshbrewed" chart repository (https://harbor.freshbrewed.science/chartrepo/library):
        failed to fetch https://harbor.freshbrewed.science/chartrepo/library/index.yaml : 404 Not Found
...Successfully got an update from the "azure-samples" chart repository
...Successfully got an update from the "confluentinc" chart repository
...Successfully got an update from the "opencost" chart repository
...Successfully got an update from the "actions-runner-controller" chart repository
...Successfully got an update from the "hashicorp" chart repository
...Successfully got an update from the "adwerx" chart repository
...Successfully got an update from the "jfelten" chart repository
...Successfully got an update from the "akomljen-charts" chart repository
...Successfully got an update from the "opencost-charts" chart repository
...Successfully got an update from the "dapr" chart repository
...Successfully got an update from the "gitea-charts" chart repository
...Successfully got an update from the "spacelift" chart repository
...Successfully got an update from the "lifen-charts" chart repository
...Successfully got an update from the "makeplane" chart repository
...Successfully got an update from the "sonarqube" chart repository
...Successfully got an update from the "openfunction" chart repository
...Successfully got an update from the "openproject" chart repository
...Successfully got an update from the "harbor" chart repository
...Unable to get an update from the "epsagon" chart repository (https://helm.epsagon.com):
        Get "https://helm.epsagon.com/index.yaml": dial tcp: lookup helm.epsagon.com on 172.22.64.1:53: server misbehaving
...Successfully got an update from the "rancher-latest" chart repository
...Successfully got an update from the "crossplane-stable" chart repository
...Successfully got an update from the "newrelic" chart repository
...Successfully got an update from the "gitlab" chart repository
...Successfully got an update from the "prometheus-community" chart repository
...Successfully got an update from the "portainer" chart repository
...Successfully got an update from the "nfs" chart repository
...Successfully got an update from the "ngrok" chart repository
...Successfully got an update from the "kuma" chart repository
...Successfully got an update from the "kube-state-metrics" chart repository
...Successfully got an update from the "btungut" chart repository
...Successfully got an update from the "zabbix-community" chart repository
...Successfully got an update from the "rhcharts" chart repository
...Successfully got an update from the "ingress-nginx" chart repository
...Successfully got an update from the "novum-rgi-helm" chart repository
...Successfully got an update from the "longhorn" chart repository
...Successfully got an update from the "nginx-stable" chart repository
...Successfully got an update from the "kubecost" chart repository
...Successfully got an update from the "elastic" chart repository
...Successfully got an update from the "castai-helm" chart repository
...Successfully got an update from the "rook-release" chart repository
...Successfully got an update from the "sumologic" chart repository
...Successfully got an update from the "kiwigrid" chart repository
...Successfully got an update from the "jetstack" chart repository
...Successfully got an update from the "signoz" chart repository
...Successfully got an update from the "openzipkin" chart repository
...Successfully got an update from the "datadog" chart repository
...Successfully got an update from the "argo-cd" chart repository
...Successfully got an update from the "uptime-kuma" chart repository
...Successfully got an update from the "ananace-charts" chart repository
...Successfully got an update from the "incubator" chart repository
...Successfully got an update from the "grafana" chart repository
...Successfully got an update from the "open-telemetry" chart repository
...Successfully got an update from the "bitnami" chart repository
Update Complete. ⎈Happy Helming!⎈

Next, I create a secret with my Datadog API Key and App Keys

$ kubectl create secret generic datadog-secret --from-literal api-key=xxxxxxxxxxxxxxxxxxxxxxxxxxbb2 --from-literal app-key=xxxxxxxxxxxxxxxxxxxxb31
secret/datadog-secret created

And a basic values file to reference the secrets

$ cat datadog-values.yaml
datadog:
 apiKeyExistingSecret: datadog-secret
 appKeyExistingSecret: datadog-secret
 site: datadoghq.com

I can then install as a daemonset using those values

$ helm install my-datadog-release -f datadog-values.yaml --set targetSystem=linux datadog/datadog
NAME: my-datadog-release
LAST DEPLOYED: Wed Mar 20 20:06:12 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Datadog agents are spinning up on each node in your cluster. After a few
minutes, you should see your agents starting in your event stream:
    https://app.datadoghq.com/event/explorer
You disabled creation of Secret containing API key, therefore it is expected
that you create Secret named 'datadog-secret' which includes a key called 'api-key' containing the API key.

###################################################################################
####   WARNING: Cluster-Agent should be deployed in high availability mode     ####
###################################################################################

The Cluster-Agent should be in high availability mode because the following features
are enabled:
* Admission Controller

To run in high availability mode, our recommendation is to update the chart
configuration with:
* set `clusterAgent.replicas` value to `2` replicas .
* set `clusterAgent.createPodDisruptionBudget` to `true`.

I want logs and network monitoring.

The best way, I have found, to update that is to pull the latest values and edit them

$ helm get values my-datadog-release --all -o yaml > datadog-values.yaml.old
$ helm get values my-datadog-release --all -o yaml > datadog-values.yaml
$ vi datadog-values.yaml

I can now change those to true where I desire

$ diff -C5 datadog-values.yaml datadog-values.yaml.old
*** datadog-values.yaml 2024-03-20 20:08:31.391061041 -0500
--- datadog-values.yaml.old     2024-03-20 20:08:34.741063092 -0500
***************
*** 426,439 ****
    logLevel: INFO
    logs:
      autoMultiLineDetection: false
      containerCollectAll: false
      containerCollectUsingFiles: true
!     enabled: true
    namespaceLabelsAsTags: {}
    networkMonitoring:
!     enabled: true
    networkPolicy:
      cilium:
        dnsSelector:
          toEndpoints:
          - matchLabels:
--- 426,439 ----
    logLevel: INFO
    logs:
      autoMultiLineDetection: false
      containerCollectAll: false
      containerCollectUsingFiles: true
!     enabled: false
    namespaceLabelsAsTags: {}
    networkMonitoring:
!     enabled: false
    networkPolicy:
      cilium:
        dnsSelector:
          toEndpoints:
          - matchLabels:
***************
*** 468,478 ****
      processCollection: false
      processDiscovery: true
      stripProcessArguments: false
    prometheusScrape:
      additionalConfigs: []
!     enabled: enabled
      serviceEndpoints: false
      version: 2
    remoteConfiguration:
      enabled: true
    sbom:
--- 468,478 ----
      processCollection: false
      processDiscovery: true
      stripProcessArguments: false
    prometheusScrape:
      additionalConfigs: []
!     enabled: false
      serviceEndpoints: false
      version: 2
    remoteConfiguration:
      enabled: true
    sbom:

Then re-feed to the agent to enable them

$ helm upgrade --install my-datadog-release -f datadog-values.yaml --set targetSystem=linux datadog
/datadog
Release "my-datadog-release" has been upgraded. Happy Helming!
NAME: my-datadog-release
LAST DEPLOYED: Wed Mar 20 20:09:57 2024
NAMESPACE: default
STATUS: deployed
REVISION: 2
TEST SUITE: None
NOTES:
Datadog agents are spinning up on each node in your cluster. After a few
minutes, you should see your agents starting in your event stream:
    https://app.datadoghq.com/event/explorer
You disabled creation of Secret containing API key, therefore it is expected
that you create Secret named 'datadog-secret' which includes a key called 'api-key' containing the API key.

###################################################################################
####   WARNING: Cluster-Agent should be deployed in high availability mode     ####
###################################################################################

The Cluster-Agent should be in high availability mode because the following features
are enabled:
* Admission Controller

To run in high availability mode, our recommendation is to update the chart
configuration with:
* set `clusterAgent.replicas` value to `2` replicas .
* set `clusterAgent.createPodDisruptionBudget` to `true`.

The first time up, i saw errors in the logs about invalid keys


2024-03-21 01:11:46 UTC | CORE | INFO | (pkg/api/healthprobe/healthprobe.go:75 in healthHandler) | Healthcheck failed on: [forwarder]
2024-03-21 01:11:52 UTC | CORE | ERROR | (comp/forwarder/defaultforwarder/transaction/transaction.go:366 in internalProcess) | API Key invalid, dropping transaction for https://7-51-0-app.agent.datadoghq.com/api/v1/check_run
2024-03-21 01:11:52 UTC | CORE | ERROR | (comp/forwarder/defaultforwarder/transaction/transaction.go:366 in internalProcess) | API Key invalid, dropping transaction for https://7-51-0-app.agent.datadoghq.com/api/v2/series

I had flip-flopped APP and API keys (darf!)

I edited the secret and deleted the existing pods to get them to rotate and take on fresh secrets

$ kubectl edit secret datadog-secret
secret/datadog-secret edited

$ kubectl get pods --all-namespaces | grep -i dog
default                 my-datadog-release-pmlgs                             3/4     Running             0             3m59s
default                 my-datadog-release-txhnd                             3/4     Running             0             3m59s
default                 my-datadog-release-crgfj                             3/4     Running             0             3m59s
default                 my-datadog-release-cluster-agent-78d889c7c7-z8nn6    1/1     Running             0             4m

$ kubectl delete pod my-datadog-release-pmlgs & kubectl delete pod my-datadog-release-cluster-agent-78d889c7c7-z8nn6 & kubectl delete pod my-datadog-release-txhnd & kubectl delete pod  my-datadog-release-crgfj
[1] 5374
[2] 5375
[3] 5376
pod "my-datadog-release-crgfj" deleted
pod "my-datadog-release-pmlgs" deleted
pod "my-datadog-release-cluster-agent-78d889c7c7-z8nn6" deleted
pod "my-datadog-release-txhnd" deleted
[1]   Done                    kubectl delete pod my-datadog-release-pmlgs
[2]-  Done                    kubectl delete pod my-datadog-release-cluster-agent-78d889c7c7-z8nn6

I almost immediately saw data streaming in

/content/images/2024/03/ghrfixes-18.png

Overall, things look healthy

/content/images/2024/03/ghrfixes-19.png

A live tail of logs show that is working just fine

/content/images/2024/03/ghrfixes-20.png

In pains me a bit to see the Pod metrics as I’m in a place today for work where a Datadog cluster agent that can expose memory consumption would be so helpful

/content/images/2024/03/ghrfixes-21.png

One thing I realized was the “cluster” field was set to “N/A” on my workloads.

Setting ‘datadog.clusterName’ in the helm chart fixed that.

/content/images/2024/03/ghrfixes-22.png

Summary

Honestly, I’m not sure what we learned here. It could be that things don’t last forever, or that a distributed master is really the only want to keep a cluster going forever. Maybe it’s “always take a backup” should be a way of life, or “replicate. replicate. replicate”.

I thought on changing Kubernetes providers to RKE. I thought of setting up a distributed master (actually, I started down that path only to find it required at least three hosts for quorum and I couldn’t get three to agree).

I think for now, there is value in replication. There is value in backups. There is value in documenting work. The fact I pulled it together in a few days from, frankly, this actual blog (as my historical crib) and from values set aside. I replicated key containers so I wasn’t dead in the water.

Overall, it was a learning experience and if anything, I hope you enjoyed it as I documented the wild ride through to the end.

Kubernetes k3s GithubActions Datadog

Have something to add? Feedback? Try our new forums

Isaac Johnson

Isaac Johnson

Cloud Solutions Architect

Isaac is a CSA and DevOps engineer who focuses on cloud migrations and devops processes. He also is a dad to three wonderful daughters (hence the references to Princess King sprinkled throughout the blog).

Theme built by C.S. Rhymes