Analyzing tf.function to discover AutoGraph strengths and subtleties - part 2


In part 1 we learned how to convert a 1.x code to its eager version, the eager version to its graph representation and faced the problems that arise when working with functions that create a state.

In this second part, we’ll analyze what happens when instead of a tf.Variable we pass a tf.Tensor or a Python native type as input to a tf.function decorated function. Are we sure everything is going to be converted to the Graph representation we expect?

tf.function uses AutoGraph

For sake of clarity, below is reported the complete signature of the tf.function function:

def function(func=None,
             input_signature=None,
             autograph=True,
             experimental_autograph_options=None)

The default value of the autograph parameter is True, hence this means that tf.function uses AutoGraph. The documentation describes what happens when autograph is True or False.

Quoting the documentation:

  • When autograph is True, all Python code that depends on Tensor values is staged into a TensorFlow graph.
  • When autograph is False, the function is traced and control flow is not allowed to depend on data.

Thus, by default tf.function uses AutoGraph and we are going to analyze how it behaves by changing the input types and the function structure.

Changing tf.Tensor input type

Let’s start by defining our test Python function. The function parameters type is of fundamental importance since is used to create a graph, that is a statically typed object, and to assign it an ID (for a complete and informal explanation of what’s going on when calling the function for the first time, see tf.function: layman explanation).

@tf.function
def f(x):
    print("Python execution: ", x)
    tf.print("Graph execution: ", x)
    return x

Here’s the brief function description:

  • Line 1: the function accepts a Python variable x that can be literally everything.
  • Line 2: the print function is executed once, only during the function creation.
  • Line 3: the tf.print function is executed every time the graph is evaluated.
  • Line 4: x is returned.

Let’s see if everything goes as we expect by running some test.

print("##### float32 test #####")
a = tf.constant(1, dtype=tf.float32)
print("first call")
f(a)
a = tf.constant(1.1, dtype=tf.float32)
print("second call")
f(a)

print("##### uint8 test #####")

b = tf.constant(2, dtype=tf.uint8)
print("first call")
f(b)
b = tf.constant(3, dtype=tf.uint8)
print("second call")
f(b)

Everything goes as we expect:

##### float32 test #####
first call
Python execution:  Tensor("x:0", shape=(), dtype=float32)
Graph execution:  1
second call
Graph execution:  1.1
##### uint8 test #####
first call
Python execution:  Tensor("x:0", shape=(), dtype=uint8)
Graph execution:  2
second call
Graph execution:  3

A graph is created for every different input type of the tf.Tensor object passed. We can have a look at the graph version of the function f by using the tf.autograph module.

tf.autograph.to_code(f.python_function)

returns a string that represents the graph code of the python function f.

def tf__f(x):
  try:
    with ag__.function_scope('f'):
      do_return = False
      retval_ = None
      with ag__.utils.control_dependency_on_returns(ag__.converted_call(print, None, ag__.ConversionOptions(recursive=True, force_conversion=False, optional_features=ag__.Feature.ALL, internal_convert_user_code=True), ('Python execution: ', x), {})):
        tf_1, x_1 = ag__.utils.alias_tensors(tf, x)
        with ag__.utils.control_dependency_on_returns(ag__.converted_call('print', tf_1, ag__.ConversionOptions(recursive=True, force_conversion=False, optional_features=ag__.Feature.ALL, internal_convert_user_code=True), ('Graph execution: ', x_1), {})):
          x_2 = ag__.utils.alias_tensors(x_1)
          do_return = True
          retval_ = x_1
          return retval_
  except:
    ag__.rewrite_graph_construction_error(ag_source_map__)

The code is machine generated and therefore hard to read. However, we can notice something interesting: in the Graph code, we can still find some reference to the Python code that’s executed only on the first-call.

(rewritten and formatted for clarity)

with ag__.utils.control_dependency_on_returns(
        ag__.converted_call(
            print, None, ag__.ConversionOptions(
                recursive=True,
                force_conversion=False,
                optional_features=ag__.Feature.ALL,
                internal_convert_user_code=True),
            ('Python execution: ', x), {})
        ):

We can see that ag__.utils.control_dependency_on_returns creates a tf.control_dependency context manager when the function created by converted_call returns. This allows preserving the execution order in graph mode, forcing the execution of the nodes to be sequential.

The converted_call function compiles a function call. This function has all the information required to convert and execute a function call (print in this case) as we can see analyzing its signature. That is (f, owner, options, args, kwargs) where:

  • f is the function we are interested in calling. In this case print, in the next call (graph execution) the string 'print'.
  • owner is the function package or its owner. In this case is None since print is the standard Python print function, in the next call instead is tf_1 that is an alias for the tf package.
  • options are the conversion options.
  • args are the function f (print) positional arguments.
  • kwargs are the function f (print) named arguments.

Question Time:

Begin a graph, why is there a reference to the Python code that it should be executed only on the first call when tracing the function execution?

Hypothesis:

My guess is that while tracing the function execution on the first-call there is no easy way to understand if a Python function produces some side effect that will change the graph behavior and for this reason (and to be sure to preserve the execution order) every python function invoked on the first-call is traced and added to the graph.

If during the execution of the function (in this case a print, but it could be any arbitrary complex function) a side effect is detected (a call to a tf. method) then the graph code is updated; otherwise, as in this case, the operation is converted to a tf.no_op by the converted_call.

This is my guess since we don’t see any output or side effect generated by the print converted call, while we do see them when using the tf.print function (now, node).

Since this behavior is not clear, and I’m only guessing what’s going on, I asked on the Tensorflow Developers Group an explanation; if you are curious about this behavior please subscribe to the mailing list and monitor that thread!

Using a Python native type

The function input is not constrained to be a tf.Tensor object; AutoGraph understands the input type and creates a new Graph accordingly. Let’s see if there is something different between using a tf.Tensor object (that has a dtype parameter and thus has a well-defined type) and a Python native type.

Since in Python there are only three distinct numeric types (integers, floating point numbers, and complex numbers) we’ll test how tf.function behaves with all of them.

def printinfo(x):
  print("Type: ", type(x), " value: ", x)

print("##### int test #####")
print("first call")
a = 1
printinfo(a)
f(a)
print("second call")
b = 2
printinfo(b)
f(b)

print("##### float test #####")
print("first call")
a = 1.0
printinfo(a)
f(a)
print("second call")
b = 2.0
printinfo(b)
f(b)

print("##### complex test #####")
print("first call")
a = complex(1.0, 2.0)
printinfo(a)
f(a)
print("second call")
b = complex(2.0, 1.0)
printinfo(b)
f(b)

Differently from the tf.Tensor input case, the behavior is different from the expected:

##### int test #####
first call
Type:  <class 'int'>  value:  1
Python execution:  1
Graph execution:  1

second call
Type:  <class 'int'>  value:  2
Python execution:  2
Graph execution:  2

##### float test #####
first call
Type:  <class 'float'>  value:  1.0
Graph execution:  1
second call
Type:  <class 'float'>  value:  2.0
Graph execution:  2

##### complex test #####
first call
Type:  <class 'complex'>  value:  (1+2j)
Python execution:  (1+2j)
Graph execution:  (1+2j)
second call
Type:  <class 'complex'>  value:  (2+1j)
Python execution:  (2+1j)
Graph execution:  (2+1j)

We expect a Graph for type, but instead, if we look carefully we have a different graph per input value. In fact:

  1. The first call f(1) executes the Python code, traces its execution, creates a Graph and executes it.
  2. The second call f(2) executes the Python code again, traces its execution, creates a Graph and executes it.
  3. The first call `f(1.0) does not execute the python code, therefore is using an already existing Graph.
  4. The second call f(2.0) does not execute the python code, therefore behaves like 3.
  5. The first call f(1 + 2j) behaves like 1.
  6. The first call f(2 + 1j) behaves like 2.

This is weird.

Are we sure that tf.function is creating a different Graph for every input value? Let’s check it.

The function f returns the input parameter x passed as input, thus we can check if the input type returned matches the one passed. Since we defined first the graph for the integer values 1 and 2, let’s see if the call f(1.0) and f(2.0) return, as we expect, the integer value or not.

ret = f(1.0)
if tf.float32 == ret.dtype:
    print("f(1.0) returns float")
else:
    print("f(1.0) return ", ret)

The output is

Graph execution:  1
f(1.0) return  tf.Tensor(1, shape=(), dtype=int32)

We can conclude that the ID associated with the graph is built using the input parameters value when using Python native types (1.0 == 1) causing this weird behavior.

Warning: this is highly inefficient since every time a tf.function decorated function is called with a different input value, both the Python execution + tracing and Graph creation must be executed, making the Graph conversion useless.

Performance measurement

The following code is a simple benchmark to check if the previous reasoning is correct.

@tf.function
def g(x):
  return x

start = time.time()
for i in tf.range(1000):
  g(i)
end = time.time()

print("tf.Tensor time elapsed: ", (end-start))

start = time.time()
for i in range(1000):
  g(i)
end = time.time()

print("Native type time elapsed: ", (end-start))

The g function, decorated with tf.function, is executed the first time in a loop of tf.Tensor objects, all created with the same dtype tf.int32 by the tf.range call, while the second time it is executed in a loop of Python integers.

The benchmark confirms the hypothesis:

tf.Tensor time elapsed:  0.41594886779785156
Native type time elapsed:  5.189513444900513

Conclusion: do use tf.Tensor everywhere.

AutoGraph is highly optimized and works well when the input is a tf.Tensor object, while it creates a new graph for every different input parameter value with a huge drop in performance.

Is tf.function really using AutoGraph?

The signature of tf.function clearly states that autograph=True, therefore we expect that the function is converted by using tracing + autograph; let’s check if it is true.

Experimenting I found out that there are functions (like functions without a return value) that tf.autograph.to_code can’t convert, but that work correctly if decorated with tf.function and called.

Let’s just update the function f removing the return statement.

@tf.function
def f(x):
    print("Python execution: ", x)
    tf.print("Graph execution: ", x)

If called twice it works correctly (f(1) followed by f(1)): it executes the print function only on the first call. On every non-first call, it prints only the tf.print output.

This is the behavior we expect when a function is traced first, then graph converted, then an ID is assigned to the graph and the graph is being re-used.

Something weird happens if we try to look at the code, as we did it before, generated by the autograph to_code method; in fact

tf.autograph.to_code(f.python_function)

raises the following exception:

ValueError during conversion: Unable to insert statement into the computation flow: it is not followed by any computation which the statement could gate.

Question time

So the question is: shouln’t tf.function use tf.autograph.to_code/tf.autograph.to_graph under the hood? How is the graph being built if to_code fails?

Hypothesis:

It seems that tf.function builds and stores a graph correctly since on every non-first call we got only the tf.print output. However, tf.autograph.to_code fails raising a ValueError; if tf.function uses tf.autograph internally, probably this exception is caught inside the tf.function body and handled somehow in order to simulate the graph execution.

This is just a guess and in the previously linked thread, I’ve asked for an additional explanation of this behavior; as for the previous question, if you are curious about this behavior please subscribe to the mailing list and monitor that thread!

Conclusions

In this article we analyzed the behavior of tf.function when AutoGraph is used and discovered that a @tf.function decorated function:

  • behaves as we expect if we use a tf.Tensor as input type (please use tf.Tensor everywhere!),
  • creates a new graph (or it behaves like is creating it) if the input is a Python native type; in this case, the graph seems created by looking at the parameters’ value (please use tf.Tensor everywhere!)²,
  • reuses the first graph created for a Python native type comparing the parameters input values (not the type, thus 1.0 == 1),
  • contains still a reference to the Python code executed only once in the graph-converted code.

In addition, we found out that certain functions work with tf.function but the autograph module can’t convert them; how is this possible?

This second part was thought to be the last one, but due to the overwhelming list of peculiarities to know and analyze, the behavior of tf.function and AutoGraph when the function body is more complex than a print function invocation haven’t been analyzed yet; so stay tuned for the part 3!

If you find this article useful, feel free to share it using the buttons below!

Update: link to part 3.

Don't you want to miss the next article? Do you want to be kept updated?
Subscribe to the newsletter!

Related Posts

Fixing the code signing and notarization issues of Unreal Engine (5.3+) projects

Starting from Unreal Engine 5.3, Epic Games added support for the so-called modern Xcode workflow. This workflow allows the Unreal Build Tool (UBT) to be more consistent with the standard Xcode app projects, and to be compliant with the Apple requirements for distributing applications... In theory! 😅 In practice this workflow is flawed: both the code signing and the framework supports are not correctly implemented, making the creation of working apps and their distribution impossible. In this article, we'll go through the problems faced during the packaging, code signing, and notarization of an Unreal Engine application on macOS and end up with the step-by-step process to solve them all.

The (Hidden?) Costs of Vertex AI Resource Pools: A Cautionary Tale

In the article "Custom model training & deployment on Google Cloud using Vertex AI in Go" we explored how to leverage Go to create a resource pool and train a machine learning model using Vertex AI's allocated resources. While this approach offers flexibility, there's a crucial aspect to consider: the cost implications of resource pools. This article details my experience with a sudden price increase in Vertex AI and the hidden culprit – a seemingly innocuous resource pool.

Building a RAG for tabular data in Go with PostgreSQL & Gemini

In this article we explore how to combine a large language model (LLM) with a relational database to allow users to ask questions about their data in a natural way. It demonstrates a Retrieval-Augmented Generation (RAG) system built with Go that utilizes PostgreSQL and pgvector for data storage and retrieval. The provided code showcases the core functionalities. This is an overview of how the "chat with your data" feature of fitsleepinsights.app is being developed.

Using Gemini in a Go application: limits and details

This article explores using Gemini within Go applications via Vertex AI. We'll delve into the limitations encountered, including the model's context window size and regional restrictions. We'll also explore various methods for feeding data to Gemini, highlighting the challenges faced due to these limitations. Finally, we'll briefly introduce RAG (Retrieval-Augmented Generation) as a potential solution, but leave its implementation details for future exploration.

Custom model training & deployment on Google Cloud using Vertex AI in Go

This article shows a different approach to solving the same problem presented in the article AutoML pipeline for tabular data on VertexAI in Go. This time, instead of relying on AutoML we will define the model and the training job ourselves. This is a more advanced usage that allows the experienced machine learning practitioner to have full control on the pipeline from the model definition to the hardware to use for training and deploying. At the end of the article, we will also see how to use the deployed model. All of this, in Go and with the help of Python and Docker for the custom training job definition.

Integrating third-party libraries as Unreal Engine plugins: solving the ABI compatibility issues on Linux when the source code is available

In this article, we will discuss the challenges and potential issues that may arise during the integration process of a third-party library when the source code is available. It will provide guidance on how to handle the compilation and linking of the third-party library, manage dependencies, and resolve compatibility issues. We'll realize a plugin for redis plus plus as a real use case scenario, and we'll see how tough can it be to correctly compile the library for Unreal Engine - we'll solve every problem step by step.

AutoML pipeline for tabular data on VertexAI in Go

In this article, we delve into the development and deployment of tabular models using VertexAI and AutoML with Go, showcasing the actual Go code and sharing insights gained through trial & error and extensive Google research to overcome documentation limitations.

Advent of Code 2022 in pure TensorFlow - Day 12

Solving problem 12 of the AoC 2022 in pure TensorFlow is a great exercise in graph theory and more specifically in using the Breadth-First Search (BFS) algorithm. This problem requires working with a grid of characters representing a graph, and the BFS algorithm allows us to traverse the graph in the most efficient way to solve the problem.

Advent of Code 2022 in pure TensorFlow - Day 11

In this article, we'll show how to solve problem 11 from the Advent of Code 2022 (AoC 2022) using TensorFlow. We'll first introduce the problem and then provide a detailed explanation of our TensorFlow solution. The problem at hand revolves around the interactions of multiple monkeys inspecting items, making decisions based on their worry levels, and following a set of rules.

Advent of Code 2022 in pure TensorFlow - Day 10

Solving problem 10 of the AoC 2022 in pure TensorFlow is an interesting challenge. This problem involves simulating a clock signal with varying frequencies and tracking the state of a signal-strength variable. TensorFlow's ability to handle complex data manipulations, control structures, and its @tf.function decorator for efficient execution makes it a fitting choice for tackling this problem. By utilizing TensorFlow's features such as Dataset transformations, efficient filtering, and tensor operations, we can create a clean and efficient solution to this intriguing puzzle.