Blog
  • Go

Golang contexts and blocking functions

Learn how to prevent intermittent application stalls by using Golang contexts and non-blocking functions, ensuring potential issues are avoided in the future by signaling blocking functions in their signatures and allowing callers to take necessary precautions.

Written By
Ori Shoshan
Published Date
May 06 2022
Read Time
9 minutes

Once you've been programming long enough, you're bound to encounter issues where intermittently an application would become stuck, but for no obvious reason. With the root cause found and the issue resolved, you might ask yourself "how can I keep this kind of bug from happening in the future?"


In this post I suggest a possible method for preventing this kind of issue with the help of the compiler: in function signatures, you would indicate that they can block by having a Context argument and allowing the caller to take the necessary precautions to avoid blocking for too long (or at all).


Similarity to the convention of errors as return values


In Go, whenever you call a function that returns an error, you must check for an error — or risk the function having not done what it was supposed to do. If you handle the error, then all's fine and well: you need not propagate it to your callers. If you don't handle it then, by convention, you simply propagate it up by returning the error.


For example, os.Getenv - func Getenv(key string) string cannot fail: it either returns the value of the environment variable, or an empty string if the environment variable did not exist. On the other hand, http.Get - func Get(url string) (resp *Response, err error) can fail: if it fails, you should handle the failure or tell your caller by returning an error yourself.


This means that, when you write a function that calls other functions that can return errors, you are forced to explicitly make this choice: handle the error or propagate it to your callers.


With IO or blocking operations, there is a similar complexity that emerges: you could call a function and not know how it would behave, absent documentation. Can it block? For how long? If it can block, how do you set a timeout? How do you cancel an ongoing operation? These are all questions you can only answer through the documentation or reading the code. If you rely the documentation, it might not be up to date: some subtle property could have changed since the documentation was written and which makes the function potentially blocking, but you won't know that as the function signature itself tells you nothing about this.


Contexts


You can use contexts, as in Context from the context package, to surface the complexity of your function performing some potentially blocking operation. They can also allow it to be canceled and to specify a timeout, forcing the caller to handle the possibility of your function taking a variable and unknown amount of time to return.


When is a Context useful


Imagine you were asked to implement a mechanism that reports logged errors to an external error tracking service, such as Bugsnag or Sentry, but the requirement is that it only report errors from production.


Your codebase has a configuration package that uses environment variables to determine the current configuration. You decide to add a function that tells you whether errors should be reported:

func ShouldReportErrors() bool {

  reportBugs := os.Getenv("SHOULD_REPORT_ERRORS")

  if reportBugs == "true" {

    return true

  }

  return false

}


As it is, this function can never block or fail — so it does not return an error (cannot fail), and it also does not take a context.Context parameter (cannot block).


func ReportError(err error) {

  if !configuration.ShouldReportErrors() {

    return

  }

  // ... report error asynchronously ...

}


Next, you've implemented a logging hook that sends logged errors to Sentry or Bugsnag, but only does so for production environments. You call ReportError from within your logging hook when you detect an error that should be logged. ReportError then checks if it should report errors (using the environment variable), and if so, takes care to report the error in a manner that does not block the logging hook.


At this point ReportError cannot block, so it is safe for use from within your logging hook — which is called from every component that logs errors in your codebase, including components where blocking would be harmful.


What happens if ShouldReportErrors later becomes blocking?


Later, another engineer, perhaps in a different team, is tasked with changing the configuration package so that it fetches all configurations from a centralized configuration service instead of environment variables. A possible naive implementation would use the built-in http.Get to fetch the value from that centralized configuration service:


func ShouldReportErrors() bool {

  environment := os.Getenv("ENV")

  if environment == "" {

    environment = "dev"

  }

  res, err := http.Get("http://configuration/%s/report-errors-external")

  if err != nil || res.StatusCode != http.StatusOK {

    // log ...

    return false

  }

  var result bool

  err = json.NewDecoder(res.Body).Decode(&result)

  if err != nil {

    // log ...

    return false

  }

  return result

}


All hell breaks loose


Let's look at an example case: A function handles a user request, and an error occurs. You log it using your logging framework: logger.WithError(err).Error(…) and the code continues to the next request.


Initially, everything works fine because http.Get will return quickly, assuming the configuration service is local and functioning. But what happens when the configuration service has an issue? http.Get has a default timeout of 30s…


So now you have: myFunction() → logging framework → error report logging hook → ReportError → shouldReportErrors which blocks for 30s if the configuration service is unresponsive. Where normally this would log and continue, now every time you log an error, it blocks for 30s. This quickly grinds your production system to a halt as any request hitting this kind of log is blocked, and memory consumption balloons as the number of concurrent requests increases at the rate of incoming requests.


And all of this happened while:


- The author of the HTTP endpoint handler thought they were just logging.

- The author of the bug reporting code thought they were taking care to do the reporting asynchronously — and that the configuration code could not block.

- The author of the new, centralized configuration code did not realize their code would be called from literally everywhere in the codebase, through the logging hook.


All of them did not have the full picture, and could not reasonably be expected to be aware of all of this. Consider a codebase hundreds to thousands of files large, perhaps over multiple repositories and many teams. How could any one developer be expected to consistently deal with all of that complexity, current and future, when it is hiding behind so many abstractions, such as a logging framework?


The problem here is that this function, through its signature, tells you nothing about whether it could block. This specific implementation would block for 30 seconds (the default HTTP client timeout) if the configuration service on the other end was unresponsive. Even worse, you might make this change, and code relying on this function would keep compiling and working, even if it relies on the previous behavior of never blocking (maybe it holds a lock?).


So you find the issue, fix it, and then ask yourself — how can I prevent this for happening again? How can I help other code authors see that this can happen when they write code, but without asking them to read a lot of code?


What can you do?


You can surface this complexity to the caller by adding a context.Context argument. The caller must then pass you a context that specifies when the operation should be canceled. If they want the operation to time out, they could use context.WithTimeout. They cannot call this function without making the choice of which context to pass you: the compiler will force them to deal with this complexity.


You might adapt ShouldReportErrors to use a Context this way:


func ShouldReportErrors(ctx context.Context) (bool, error) {

  environment := os.Getenv("ENV")

  if environment == "" {

    environment = "dev"

  }

  req, err := http.NewRequestWithContext(ctx, "GET", 
"http://configuration/%s/report-errors-external", nil)

  if err != nil {

    // log ...

    return false, err

  }

  res, err := http.DefaultClient.Do(req)

  if err != nil || res.StatusCode != http.StatusOK {

    // log ...

    return false, err

  }

  var result bool

  err = json.NewDecoder(res.Body).Decode(&result)

  if err != nil {

    // log ...

    return false, err

  }

  return result, nil

}


Now, the responsibility to tie any blocking operations with the context, and allow them to be terminated, lies with you, and not with the caller. The caller's responsibility ends with understanding this code may block, and specifying restrictions for the blocking using the context argument. Most importantly, they do not have to worry about the internals of your implementation, or your dependencies.


Returning the error leaves it up to the caller to decide what to do if fetching the value fails due to timeout or another error. Knowing that the function can block and fail, they might decide to try to fetch this configuration value just once and cache it, if they judge their code to be sensitive to blocking here.


If you wanted to make the implementation never fail, you could decide on some strategy for what value to return if you get an error. For example, you could return false on every error, and also cache the result on an error so you fail fast. Either way, the context forces you to respect the caller's wishes: you mustn't block if the context is canceled or its timeout elapses.


This change also makes any calls to the original function's signature fail to compile (since they are now missing the context argument), which forces you to do one of the following:


- Add a version of the function which does not block, if possible.

- If this function is part of a library others may use, make it known that this version of the library introduces a breaking API change (this function can now block, whereas before it couldn't), perhaps through a new major version and release notes.

- If this function is part of a monolithic repo you are working on, you can alter existing code so that it now deals with the complexity of this function blocking and compiles again.


Conclusion


The above is a specific example of a general case where Context is useful to detect potentially blocking functions. In this example, you might have been able to read through the code and find the issue. I’ve found that, often, blocking functions can be hidden behind many layers of abstraction and, sometimes, async code, making it incredibly hard (time-consuming!) to locate the source of the problem. Using a Context can make it trivial.


In general, I like using the compiler as a tool to surface and handle complexity. I feel that a Context is a really powerful tool that isn't used enough in third party libraries I've seen.


Contexts let the caller group multiple operations into a single timeout, e.g.: call 3 blocking functions, and say "all of these should complete within 3 seconds", without implementing any complex logic. In addition, you can use child contexts to stop components, all of their goroutines and their in-flight operations, by canceling the parent context and without worrying about which underlying components are there.


I hope this proves useful to you in your code. If you do, or if you think I'm missing something, or perhaps if you know about another tool that could be used to protect against similar bugs, feel free to let me know via our Slack or my email: [email protected]. I'm eager to hear about it!

Like this article?

Sign up for newsletter updates

By subscribing you agree to with our Privacy Policy and to receive updates from us.
Share article
Visit the Otter’s Den

Your go-to hub for Kubernetes security and tech know-how

  • Kubernetes
  • Zero-trust
  • IBAC
  • Dropbox
  • Automation
  • Startups
  • Podcasts
Blog
Apr 16 2024
First Person Platform Episode 2 - Andrew Moore on Uber Workload Identity and Authorization

The second episode of First Person Platform, a podcast: platform engineers nerd out with Ori Shoshan on access controls, Kubernetes, and platform engineering.

  • Network Policy
  • Kubernetes
  • Zero-trust
Blog
Feb 12 2024
Network policies are not the right abstraction (for developers)

We explore the limitations of relying solely on Kubernetes network policies as a solution for achieving zero-trust between pods, identifying multiple flaws that hinder their effectiveness in meeting the demands of real-world use cases, particularly when prioritizing developer experience in a Kubernetes-based platform.

  • Kubernetes
  • Zero-trust
  • IBAC
  • Dropbox
  • Automation
  • Startups
  • Podcasts
Blog
Jan 24 2024
First Person Platform Episode 1 - Andrew Fong on Dropbox Grouper

The first episode of First Person Platform, a podcast: platform engineers nerd out with Ori Shoshan on access controls, Kubernetes, and platform engineering.