The (Hidden?) Costs of Vertex AI Resource Pools: A Cautionary Tale


In the article Custom model training & deployment on Google Cloud using Vertex AI in Go we explored how to leverage Go to create a resource pool and train a machine learning model using Vertex AI’s allocated resources. While this approach offers flexibility, there’s a crucial aspect to consider: the cost implications of resource pools.

This article details my experience with a sudden price increase in Vertex AI and the hidden culprit – a seemingly innocuous resource pool.

A Unexplained Cost Surge

Vertex AI price report for April and March

The green color in the graph represents Vertex AI expenditure. The image above shows that something happened around the 8 of March. During that period I was only working on the dashboard of fitsleepinsights.app, so definitely nothing changed in the infrastructure, the code, or the load of the Vertex AI services I was using. Anyway, from the graph it’s clearly visible an increase of more than 500% in the cost of Vertex AI. I was spending literally less than 1 €/day until the 8 March, and from that day onward I started spending more than 5 €/day.

Reaching out to Google Cloud support proved unhelpful. They couldn’t pinpoint the reason behind the cost increase. Left to my own devices, I embarked on a multi-day investigation through various Vertex AI dashboards.

The Culprit Revealed (with a Glitch)

The first hint came from the pricing dashboard, in the cost table breakdown.

Cost table for March

We can see from it that the Vertex AI costs come mainly from 2 Online/Batch prediction resources:

  • The Instance Core running in EMEA for AI platform
  • The Instance RAM running in EMEA for AI platform

All the other costs are negligible. Unfortunately, there’s no detailed information on where these prices are coming from. It’s only clear that something in the online/batch prediction is using core (CPUs) and RAM.

In Custom model training & deployment on Google Cloud using Vertex AI in Go we created a resource pool for training our models. After digging inside the online/batch dashboards of Vertex AI, I stumbled upon the resource pool dashboard – the potential culprit.

Unfortunately, displaying the dashboard resulted in an error 😒

Vertex AI resource pool error

So, no dashboard is available. Lucky me? To be sure that there wasn’t something flooding the endpoints deployed (but the logs were clear) I, anyway, deleted all the models and all the endpoints deployed. I did this around mid-March. As it is visible from the graph at the beginning of the article, nothing changed. Therefore the resource pool is the major suspect.

Taking Back Control: Deleting the Resource Pool (Go Code Included)

The resource pool, likely active for months, might have been unknowingly incurring charges. After all, I couldn’t find any documentation regarding a free tier for resource pools.

To curb the runaway costs and return to normalcy (ideally, zero cost since no resources were actively used), I had to delete the resource pool. Here’s the Go code that did the trick:

ctx := context.Background()
var resourcePoolClient *vai.DeploymentResourcePoolClient
if resourcePoolClient, err = vai.NewDeploymentResourcePoolClient(ctx, option.WithEndpoint(_vaiEndpoint)); err != nil {
  log.Error("error creating resource pool client: ", err)
  return err
}
defer resourcePoolClient.Close()

deploymentResourcePoolId := "resource-pool"
var deploymentResourcePool *vaipb.DeploymentResourcePool = nil

iter := resourcePoolClient.ListDeploymentResourcePools(ctx, &vaipb.ListDeploymentResourcePoolsRequest{
  Parent: fmt.Sprintf("projects/%s/locations/%s", _vaiProjectID, _vaiLocation),
})

var item *vaipb.DeploymentResourcePool
for item, _ = iter.Next(); err == nil; item, err = iter.Next() {
  if strings.Contains(item.GetName(), deploymentResourcePoolId) {
    deploymentResourcePool = item
    log.Print("Found deployment resource pool: ", deploymentResourcePool.GetName())

    // Delete the resource pool
    var deleteResourcePoolOp *vai.DeleteDeploymentResourcePoolOperation
    if deleteResourcePoolOp, err = resourcePoolClient.DeleteDeploymentResourcePool(ctx, &vaipb.DeleteDeploymentResourcePoolRequest{
      Name: deploymentResourcePool.GetName(),
    }); err != nil {
      log.Error("Error deleting deployment resource pool: ", err)
      return err
    }
    if err = deleteResourcePoolOp.Wait(ctx); err != nil {
      log.Error("Error waiting for deployment resource pool deletion: ", err)
      return err
    }
    log.Print("Deleted deployment resource pool: ", deploymentResourcePool.GetName())
    break
  }
}

For the sake of completeness, I report below the code I used in the past to create the resource pool. This way, I hope the article becomes a self-contained resource about the management of the resource pool lifetime using Vertex AI in Go.

How to create a resource pool - click me to see the code
var resourcePoolClient *vai.DeploymentResourcePoolClient
if resourcePoolClient, err = vai.NewDeploymentResourcePoolClient(ctx, option.WithEndpoint(vaiEndpoint)); err != nil {
    return err
}
defer resourcePoolClient.Close()

deploymentResourcePoolId := "resource-pool"
var deploymentResourcePool *vaipb.DeploymentResourcePool = nil
iter := resourcePoolClient.ListDeploymentResourcePools(ctx, &vaipb.ListDeploymentResourcePoolsRequest{
    Parent: fmt.Sprintf("projects/%s/locations/%s", os.Getenv("VAI_PROJECT_ID"), os.Getenv("VAI_LOCATION")),
})
var item *vaipb.DeploymentResourcePool
for item, _ = iter.Next(); err == nil; item, err = iter.Next() {
    fmt.Println(item.GetName())
    if strings.Contains(item.GetName(), deploymentResourcePoolId) {
        deploymentResourcePool = item
        fmt.Printf("Found deployment resource pool %s\n", deploymentResourcePool.GetName())
        break
    }
}

if deploymentResourcePool == nil {
    fmt.Println("Creating a new deployment resource pool")
    // Create a deployment resource pool: FOR SHARED RESOURCES ONLY
    var createDeploymentResourcePoolOp *vai.CreateDeploymentResourcePoolOperation
    if createDeploymentResourcePoolOp, err = resourcePoolClient.CreateDeploymentResourcePool(ctx, &vaipb.CreateDeploymentResourcePoolRequest{
        Parent:                   fmt.Sprintf("projects/%s/locations/%s", os.Getenv("VAI_PROJECT_ID"), os.Getenv("VAI_LOCATION")),
        DeploymentResourcePoolId: deploymentResourcePoolId,
        DeploymentResourcePool: &vaipb.DeploymentResourcePool{
            DedicatedResources: &vaipb.DedicatedResources{
                MachineSpec: &vaipb.MachineSpec{
                    MachineType:      "n1-standard-4",
                    AcceleratorCount: 0,
                },
                MinReplicaCount: 1,
                MaxReplicaCount: 1,
            },
        },
    }); err != nil {
        return err
    }

    if deploymentResourcePool, err = createDeploymentResourcePoolOp.Wait(ctx); err != nil {
        return err
    }
    fmt.Println(deploymentResourcePool.GetName())
}

The Sweet Relief of Reduced Costs

Following the deletion of the resource pool, my Vertex AI costs (green) plummeted back to almost zero (only some cents are still there for the Gemini requests). This confirmed my suspicion – the resource pool was indeed the culprit behind the cost increase.

Vertex AI price report for April and March - highlighted the day of resource pool deletion

Conclusions

Resource pools in Vertex AI offer a convenient way to manage shared compute resources for training and deploying models. However, it’s crucial to understand their cost implications. Here are some key takeaways:

  • The documentation of the resource pool doesn’t mention a free period. This period apparently exists since Vertex AI started to charge for this service out of nowhere.
  • Resource pools incur charges even when not actively used for training or prediction.
  • Carefully monitor your Vertex AI resource pool usage to avoid unexpected cost increases.
  • Consider deleting unused resource pools to optimize your spending.
  • The resource pool dashboard is broken (or at least it was and still is on my account).

By being mindful of these hidden costs, you can leverage Vertex AI resource pools effectively without jeopardizing your budget.

For any feedback or comments, please use the Disqus form below - Thanks!

Don't you want to miss the next article? Do you want to be kept updated?
Subscribe to the newsletter!

Related Posts

Fixing the code signing and notarization issues of Unreal Engine (5.3+) projects

Starting from Unreal Engine 5.3, Epic Games added support for the so-called modern Xcode workflow. This workflow allows the Unreal Build Tool (UBT) to be more consistent with the standard Xcode app projects, and to be compliant with the Apple requirements for distributing applications... In theory! 😅 In practice this workflow is flawed: both the code signing and the framework supports are not correctly implemented, making the creation of working apps and their distribution impossible. In this article, we'll go through the problems faced during the packaging, code signing, and notarization of an Unreal Engine application on macOS and end up with the step-by-step process to solve them all.

Building a RAG for tabular data in Go with PostgreSQL & Gemini

In this article we explore how to combine a large language model (LLM) with a relational database to allow users to ask questions about their data in a natural way. It demonstrates a Retrieval-Augmented Generation (RAG) system built with Go that utilizes PostgreSQL and pgvector for data storage and retrieval. The provided code showcases the core functionalities. This is an overview of how the "chat with your data" feature of fitsleepinsights.app is being developed.

Using Gemini in a Go application: limits and details

This article explores using Gemini within Go applications via Vertex AI. We'll delve into the limitations encountered, including the model's context window size and regional restrictions. We'll also explore various methods for feeding data to Gemini, highlighting the challenges faced due to these limitations. Finally, we'll briefly introduce RAG (Retrieval-Augmented Generation) as a potential solution, but leave its implementation details for future exploration.

Custom model training & deployment on Google Cloud using Vertex AI in Go

This article shows a different approach to solving the same problem presented in the article AutoML pipeline for tabular data on VertexAI in Go. This time, instead of relying on AutoML we will define the model and the training job ourselves. This is a more advanced usage that allows the experienced machine learning practitioner to have full control on the pipeline from the model definition to the hardware to use for training and deploying. At the end of the article, we will also see how to use the deployed model. All of this, in Go and with the help of Python and Docker for the custom training job definition.

Integrating third-party libraries as Unreal Engine plugins: solving the ABI compatibility issues on Linux when the source code is available

In this article, we will discuss the challenges and potential issues that may arise during the integration process of a third-party library when the source code is available. It will provide guidance on how to handle the compilation and linking of the third-party library, manage dependencies, and resolve compatibility issues. We'll realize a plugin for redis plus plus as a real use case scenario, and we'll see how tough can it be to correctly compile the library for Unreal Engine - we'll solve every problem step by step.

AutoML pipeline for tabular data on VertexAI in Go

In this article, we delve into the development and deployment of tabular models using VertexAI and AutoML with Go, showcasing the actual Go code and sharing insights gained through trial & error and extensive Google research to overcome documentation limitations.

Advent of Code 2022 in pure TensorFlow - Day 12

Solving problem 12 of the AoC 2022 in pure TensorFlow is a great exercise in graph theory and more specifically in using the Breadth-First Search (BFS) algorithm. This problem requires working with a grid of characters representing a graph, and the BFS algorithm allows us to traverse the graph in the most efficient way to solve the problem.

Advent of Code 2022 in pure TensorFlow - Day 11

In this article, we'll show how to solve problem 11 from the Advent of Code 2022 (AoC 2022) using TensorFlow. We'll first introduce the problem and then provide a detailed explanation of our TensorFlow solution. The problem at hand revolves around the interactions of multiple monkeys inspecting items, making decisions based on their worry levels, and following a set of rules.

Advent of Code 2022 in pure TensorFlow - Day 10

Solving problem 10 of the AoC 2022 in pure TensorFlow is an interesting challenge. This problem involves simulating a clock signal with varying frequencies and tracking the state of a signal-strength variable. TensorFlow's ability to handle complex data manipulations, control structures, and its @tf.function decorator for efficient execution makes it a fitting choice for tackling this problem. By utilizing TensorFlow's features such as Dataset transformations, efficient filtering, and tensor operations, we can create a clean and efficient solution to this intriguing puzzle.

Advent of Code 2022 in pure TensorFlow - Day 9

In this article, we'll show two different solutions to the Advent of Code 2022 day 9 problem. Both of them are purely TensorFlow solutions. The first one, more traditional, just implement a solution algorithm using only TensorFlow's primitive operations - of course, due to some TensorFlow limitations this solution will contain some details worth reading (e.g. using a pairing function for being able to use n-dimensional tf.Tensor as keys for a mutable hashmap). The second one, instead, demonstrates how a different interpretation of the problem paves the way to completely different solutions. In particular, this solution is Keras based and uses a multi-layer convolutional model for modeling the rope movements.