Go Contexts: What I Learned by Getting Things Wrong

March 01, 2026

If you write Go and work on distributed systems, you will use context in almost every function you write. It shows up in HTTP handlers, database calls, gRPC methods, and Kubernetes controllers. Most of us learn the basics early — pass a context, check for cancellation, move on. But the subtle parts? Those show up when things break in production and you spend a Thursday night staring at goroutine dumps.

I recently spent time studying Go contexts from the ground up. Not the documentation kind of study, but the kind where you write code, make mistakes, and then figure out why things went wrong. This post captures what I learned, including the parts that surprised me.

1. Cancelling Work from the Outside

The simplest use of context is cancellation. You have a function doing work, and you want the caller to be able to say “stop.” The function itself does not decide when to stop. The caller does.

ctx, cancel := context.WithCancelCause(context.Background())

This gives you a context and a cancel function. When you call cancel(someError), the context’s Done() channel closes, and anything watching it knows to wrap up. The error you pass in is the cause, and you can retrieve it later with context.Cause(ctx).

Here is a small worker I built while learning. It wraps a function and runs it in a loop, with support for cancellation and cleanup callbacks:

type Worker struct {
    fn         func() error
    ctx        context.Context
    cancel     func(cause error)
    afterFuncs []func()
}

func NewWorker(fn func() error) *Worker {
    return &Worker{fn: fn}
}

func (w *Worker) Start() {
    if w.ctx != nil {
        return
    }
    w.ctx, w.cancel = context.WithCancelCause(context.Background())
    for _, fn := range w.afterFuncs {
        context.AfterFunc(w.ctx, fn)
    }
    go w.work()
}

func (w *Worker) AfterStop(fn func()) {
    if w.ctx != nil {
        return
    }
    w.afterFuncs = append(w.afterFuncs, fn)
}

func (w *Worker) Stop() {
    if w.ctx == nil {
        return
    }
    w.cancel(ErrManual)
}

func (w *Worker) Err() error {
    return context.Cause(w.ctx)
}

The idea is simple. The caller creates a worker with a function, registers any cleanup callbacks with AfterStop, and then calls Start. The worker creates its own context and cancel function, registers the cleanup callbacks using context.AfterFunc, and launches the work loop in a goroutine. When someone calls Stop, it cancels the context with ErrManual. If the function itself fails, the loop cancels the context with ErrFailed. Either way, the registered AfterFunc callbacks run after cancellation.

The work loop looks like this:

func (w *Worker) work() {
    for {
        select {
        case <-w.ctx.Done():
            return
        default:
            err := w.fn()
            if err != nil {
                w.cancel(ErrFailed)
                return
            }
        }
    }
}

The worker checks ctx.Done() at the top of each loop. If the context is cancelled, it returns. Otherwise, it runs the function.

This works, but there is a catch.

Catch: The first cancel wins

If you call cancel more than once with different causes, only the first one counts. The rest are silently ignored.

ctx, cancel := context.WithCancelCause(context.Background())
cancel(errors.New("first"))
cancel(errors.New("second"))
fmt.Println(context.Cause(ctx)) // prints: first

This matters when you have multiple reasons a piece of work might stop. Say your worker fails internally with ErrFailed, and then someone also calls Stop() with ErrManual. The cause will be ErrFailed, because that happened first. If you are debugging a failure and checking the cause, you need to know that the cause reflects whoever got there first, not whoever you expected.

2. Cancellation Is Cooperative, Not Preemptive

Cancellation in Go is often misunderstood as something that happens automatically. It does not. Go has no mechanism to kill a goroutine from the outside. When you cancel a context, all you are doing is closing a channel. If the code running inside the goroutine never checks that channel, it will keep running as if nothing happened.

Lets take a look at this worker loop again:

select {
case <-w.ctx.Done():
    return
default:
    err := w.fn()
}

If w.fn() takes 30 seconds to run, and you cancel the context at second 2, the goroutine does not exit at second 2. It exits after 30 seconds, when fn returns and the loop gets back to the select statement. The cancellation was sitting there in the Done() channel the whole time. Nobody was listening.

This is a fundamental design decision in Go. The language gives you the tools to signal cancellation, but it is your responsibility to check for it. If your function does not look at the context, it will not respond to cancellation.

The fix: pass the context down

If your function does something slow — an HTTP call, a database query, a file transfer — it should accept a context.Context and pass it to whatever it calls.

// Instead of this:
fn func() error

// Do this:
fn func(ctx context.Context) error

Now, inside fn, if you are making an HTTP request, you use http.NewRequestWithContext(ctx, ...). If the context is cancelled, the HTTP client aborts the request. The cancellation propagates through the entire call chain, from the top-level caller all the way down to the network socket. That is the whole point of context — it flows downward through your program.

Catch: Do not spawn goroutines just to wrap cancellation

You might think: “I will launch fn in a goroutine, and if the context is cancelled, I will just abandon it.” The problem is that the goroutine is still running. Nobody is waiting for its result. It is leaked. You traded one problem (slow cancellation) for another (resource leaks). The right approach is always to make the function itself aware of the context.

3. Sequential Operations and Context in Kubernetes Operators

If you have written a Kubernetes operator, you have seen this pattern. The controller-runtime framework calls your Reconcile function whenever a custom resource changes. It passes you a context. That context gets cancelled if the operator pod is shutting down — during a rolling deploy, a node drain, or a leader election change.

Say your reconciler does three things in sequence:

func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    meta, err := r.fetchMetadata(ctx, req.Name)
    if err != nil { return ctrl.Result{}, err }

    params, err := r.validateParams(ctx, meta)
    if err != nil { return ctrl.Result{}, err }

    err = r.pushConfig(ctx, params)
    if err != nil { return ctrl.Result{}, err }

    return ctrl.Result{}, nil
}

Each call takes 2 to 5 seconds. Now your operator pod gets a SIGTERM during a rolling update. You are halfway through validateParams.

What happens is straightforward once you understand cooperative cancellation. The context gets cancelled by the framework. But validateParams is already running. It does not stop mid-execution just because the context flipped. It finishes, returns an error (assuming it checks the context internally), and then the error check prevents pushConfig from running. If validateParams does not check the context, it runs to completion, returns a result, and then pushConfig gets called with a cancelled context.

The important thing here is that you must check errors between each call. If you ignore the error from fetchMetadata and proceed to validateParams, you are doing work with potentially invalid data on a cancelled context. The context cancellation only helps if two things are true: the function you called respects the context, and you check the error it returns.

Catch: Partial completion and idempotency

What if metadata was fetched and validation passed, but pushConfig failed because the context was cancelled? The next reconcile starts from scratch. Is that a problem?

In most operators, no. The standard approach is to make reconciliation idempotent. You store progress in the resource’s status subresource. Each reconcile checks current state against desired state and does only what is needed. If metadata is already stored in status, the next reconcile skips that step. The status acts as your checkpoint.

This is why Kubernetes operators lean on status subresources so heavily. They are not just for reporting — they are the mechanism that makes retries cheap and safe.

4. Parent-Child Context Chains and Cause Propagation

Contexts form a tree. You start with context.Background(), then create children from it, and children from those children. When a parent is cancelled, all of its children are cancelled too.

But here is the part that caught me off guard: the cause propagates from parent to child.

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()

childCtx, childCancel := context.WithCancelCause(ctx)
defer childCancel(errors.New("cleanup"))

When the 5-second timeout fires, the parent is cancelled with context.DeadlineExceeded. The child gets cancelled too, because its parent is done. Now, what does context.Cause(childCtx) return?

I expected it to return "cleanup", because that is what I passed to childCancel. But it returns context.DeadlineExceeded. The parent’s timeout cancelled the child before childCancel was ever called. And since the first cancel wins, the deferred childCancel(errors.New("cleanup")) is a no-op — the cause was already set by the parent.

Catch: A child’s cancel function only matters if the child is cancelled before its parent

If you are relying on context.Cause to distinguish between “we timed out” and “we were explicitly stopped,” you need to make sure the explicit stop happens first. Otherwise, the parent’s cause will already be set.

5. `context.Background()` Is the Root

A quick note that seems obvious but is worth stating: context.Background() can never be cancelled. It has no deadline, no cancel function, and no values. It is the starting point of every context tree. Same goes for context.TODO().

ctx := context.Background()
fmt.Println(ctx.Err())          // <nil>
fmt.Println(context.Cause(ctx)) // <nil>

You create all other contexts from this root. Nothing cancels it.

6. Fire-and-Forget Goroutines and Context Lifetimes

This one shows up in almost every HTTP service I have seen. You have a handler that does some work and then kicks off a background task:

func handler(w http.ResponseWriter, r *http.Request) {
    data, err := fetchFromDB(r.Context(), query)
    if err != nil { return }

    go sendToKafka(r.Context(), data)  // fire and forget

    w.Write(response)
}

The bug: when the handler returns, the request context is cancelled. Your Kafka goroutine is using that same context. The Kafka write races against context cancellation, and most of the time, it loses. Your audit log silently drops messages, and nobody notices until someone asks “why are there gaps in the logs?”

The fix is to use a different context for work that must outlive the request:

bgCtx, bgCancel := context.WithTimeout(context.Background(), 10*time.Second)
go func() {
    defer bgCancel()
    sendToKafka(bgCtx, data)
}()

This context is not tied to the request. It has its own timeout, and it will not be cancelled when the handler returns.

Catch: Request contexts die when the handler returns

Any goroutine that needs to outlive the HTTP request must not use the request’s context. Create a new one from context.Background() with its own timeout.

7. Graceful Shutdown: Structuring the Context Hierarchy

When your service receives a SIGTERM — during a deploy, a node drain, or a scaling event — you need every subsystem to stop cleanly. Contexts give you the structure for this.

Say your service has three subsystems: an HTTP server, a Kafka consumer, and a background reconciler. Here is how you wire them up:

func main() {
    parentCtx, cancel := context.WithCancelCause(context.Background())

    sigCh := make(chan os.Signal, 1)
    signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)
    go func() {
        sig := <-sigCh
        cancel(fmt.Errorf("received signal: %v", sig))
    }()

    g, gCtx := errgroup.WithContext(parentCtx)
    g.Go(func() error { return httpServer.Run(gCtx) })
    g.Go(func() error { return kafkaConsumer.Run(gCtx) })
    g.Go(func() error { return reconciler.Run(gCtx) })

    if err := g.Wait(); err != nil {
        log.Printf("shutdown: %v", err)
    }
}

When SIGTERM arrives, the parent context is cancelled, which propagates to all three subsystems. The errgroup waits for all of them to finish. You do not exit the process until every subsystem has confirmed it is done.

Catch 1: Signal handling catches everything if you are not specific

signal.Notify(ch) with no signal arguments will catch every signal your process receives, including harmless ones like SIGCHLD and SIGURG. Always be explicit about which signals you care about.

Catch 2: Cleanup operations need a live context

When your Kafka consumer’s context is cancelled, it needs to commit the offset of the last message it processed. But if you use the cancelled context to commit, the commit will fail — the context is already done.

func (k *KafkaConsumer) Run(ctx context.Context) error {
    for {
        msg, err := k.reader.FetchMessage(ctx) // respects cancellation
        if err != nil {
            return err
        }
        k.process(msg)
        k.reader.CommitMessages(context.Background(), msg) // must succeed
    }
}

The fetch uses the cancellable context — that is how the consumer knows to stop pulling new messages. But the commit uses context.Background(), because the commit must go through regardless of whether the consumer is shutting down.

8. Fan-Out: Cancelling the Stragglers

Here is a pattern that shows up in API gateways and aggregation services. You call five downstream services in parallel. You need any three to succeed. Once you have three, cancel the remaining two.

The natural instinct is to cancel the parent context. But that is wrong. The parent context is your request context — cancelling it would also cancel your ability to write the response. Instead, create a child context specifically for the fan-out:

func aggregate(w http.ResponseWriter, r *http.Request) {
    parentCtx := r.Context()

    fanoutCtx, cancelFanout := context.WithCancel(parentCtx)
    defer cancelFanout()

    results := make(chan Result, 5) // buffered to prevent goroutine leaks

    for _, svc := range services {
        go func(s Service) {
            val, err := s.Call(fanoutCtx)
            results <- Result{val: val, err: err}
        }(svc)
    }

    var successes []Result
    for i := 0; i < 5; i++ {
        select {
        case <-parentCtx.Done():
            respond(w, successes) // return whatever we have
            return
        case r := <-results:
            if r.err == nil {
                successes = append(successes, r)
                if len(successes) == 3 {
                    cancelFanout() // cancel remaining, NOT the parent
                    respond(w, successes)
                    return
                }
            }
        }
    }
    respond(w, successes)
}

Catch 1: Buffer your channels

The channel must be buffered with capacity equal to the number of goroutines. After you cancel and return, the remaining goroutines will eventually finish and try to write their results. If the channel is unbuffered, they block forever. That is a goroutine leak. A buffered channel lets them write and exit, even though nobody reads those results.

Catch 2: Do not cancel the parent to stop children

Create a dedicated child context for the group of work you want to cancel. The parent context should stay alive for your own operations, like writing the HTTP response.

Contexts are a small API. Four functions, a couple of interfaces. But the patterns that emerge from them — cancellation chains, graceful shutdown, fan-out coordination — are the backbone of how Go services manage work. Getting them right is the difference between a service that shuts down cleanly and one that leaks goroutines, drops messages, and leaves you debugging at 2 AM.