Using AOT compilation（使用AOT编译）

使用AOT编译

什么是tfcompile？

tfcompile是一种独立的工具，可以提前（AOT）将TensorFlow图形编译为可执行代码。它可以减少总二进制大小，并且还可以避免一些运行时间的开销。一个典型的用例tfcompile是将推理图编译成移动设备的可执行代码。

TensorFlow图通常由TensorFlow运行时执行。这会导致执行图中每个节点的运行时开销。这也会导致更大的总二进制大小，因为除了图本身之外，还需要TensorFlow运行时的代码。生成的可执行代码tfcompile不使用TensorFlow运行时，并且只依赖于计算中实际使用的内核。

编译器建立在XLA框架之上。将TensorFlow桥接到XLA框架的代码驻留在tensorflow /编译器中，其中还包括对TensorFlow图形的即时（JIT）编译的支持。

tfcompile做什么？

tfcompile需要一个由TensorFlow的feed和fetches概念标识的子图，并生成一个实现该子图的函数。这feeds是函数的输入参数，并且fetches是函数的输出参数。所有输入必须完全由供稿指定; 生成的剪枝子图不能包含占位符或变量节点。通常将所有占位符和变量指定为提要，这可确保生成的子图不再包含这些节点。生成的函数被打包为一个cc_library带有导出函数签名的头文件和一个包含实现的对象文件。用户编写代码以适当地调用生成的函数。

使用tfcompile

本节详细介绍tfcompile了使用TensorFlow子图生成可执行二进制文件的高级步骤。步骤是：

第1步：配置子图进行编译

第2步：使用tf_library构建宏编译子图

第3步：编写代码来调用子图

第4步：创建最终的二进制文件

第1步：配置子图进行编译

识别与生成的函数的输入和输出参数相对应的提要和提取。然后在proto中配置feeds和。fetchestensorflow.tf2xla.Config

# Each feed is a positional input argument for the generated function.  The order
# of each entry matches the order of each input argument.  Here “x_hold” and “y_hold”
# refer to the names of placeholder nodes defined in the graph.
feed {
  id { node_name: "x_hold" }
  shape {
    dim { size: 2 }
    dim { size: 3 }
  }
}
feed {
  id { node_name: "y_hold" }
  shape {
    dim { size: 3 }
    dim { size: 2 }
  }
}

# Each fetch is a positional output argument for the generated function.  The order
# of each entry matches the order of each output argument.  Here “x_y_prod”
# refers to the name of a matmul node defined in the graph.
fetch {
  id { node_name: "x_y_prod" }
}

第2步：使用tf_library构建宏来编译子图

此步骤将图形转换为cc_library使用tf_library构建宏。它cc_library由一个包含从图形生成的代码的目标文件以及一个用于访问生成代码的头文件组成。tf_library利用tfcompile将TensorFlow图形编译成可执行代码。

load("//third_party/tensorflow/compiler/aot:tfcompile.bzl", "tf_library")

# Use the tf_library macro to compile your graph into executable code.
tf_library(
    # name is used to generate the following underlying build rules:
    # <name>           : cc_library packaging the generated header and object files
    # <name>_test      : cc_test containing a simple test and benchmark
    # <name>_benchmark : cc_binary containing a stand-alone benchmark with minimal deps;
    #                    can be run on a mobile device
    name = "test_graph_tfmatmul",
    # cpp_class specifies the name of the generated C++ class, with namespaces allowed.
    # The class will be generated in the given namespace(s), or if no namespaces are
    # given, within the global namespace.
    cpp_class = "foo::bar::MatMulComp",
    # graph is the input GraphDef proto, by default expected in binary format.  To
    # use the text format instead, just use the ‘.pbtxt’ suffix.  A subgraph will be
    # created from this input graph, with feeds as inputs and fetches as outputs.
    # No Placeholder or Variable ops may exist in this subgraph.
    graph = "test_graph_tfmatmul.pb",
    # config is the input Config proto, by default expected in binary format.  To
    # use the text format instead, use the ‘.pbtxt’ suffix.  This is where the
    # feeds and fetches were specified above, in the previous step.
    config = "test_graph_tfmatmul.config.pbtxt",
)

要为此示例生成GraphDef proto（test_graph_tfmatmul.pb），请运行make_test_graphs.py并使用--out_dir标志指定输出位置。

典型图包含Variables表示通过训练学习的权重，但tfcompile不能编译包含的子图Variables。该freeze_graph.py工具转换成变量常量，使用存储在检查点文件中的值。为方便起见，tf_library宏支持freeze_checkpoint运行该工具的参数。更多示例请参阅tensorflow / compiler / aot / tests / BUILD。

在编译的子图中显示的常量将直接编译到生成的代码中。要将常量传递到生成的函数中，而不是将它们编译进来，只需将它们作为提要传递即可。

有关tf_library构建宏的详细信息，请参阅tfcompile.bzl。

有关底层tfcompile工具的详细信息，请参阅tfcompile_main.cc。

第3步：编写代码来调用子图

这一步使用上一步中test_graph_tfmatmul.h由tf_library构建宏生成的头文件（）来调用生成的代码。头文件位于bazel-genfiles与构建包相对应的目录中，并且基于在tf_library构建宏中设置的名称属性进行命名。例如，生成的头test_graph_tfmatmul将是test_graph_tfmatmul.h。以下是生成内容的缩略版本。生成的文件in bazel-genfiles包含其他有用的注释。

namespace foo {
namespace bar {

// MatMulComp represents a computation previously specified in a
// TensorFlow graph, now compiled into executable code.
class MatMulComp {
 public:
  // AllocMode controls the buffer allocation mode.
  enum class AllocMode {
    ARGS_RESULTS_AND_TEMPS,  // Allocate arg, result and temp buffers
    RESULTS_AND_TEMPS_ONLY,  // Only allocate result and temp buffers
  };

  MatMulComp(AllocMode mode = AllocMode::ARGS_RESULTS_AND_TEMPS
  ~MatMulComp(

  // Runs the computation, with inputs read from arg buffers, and outputs
  // written to result buffers. Returns true on success and false on failure.
  bool Run(

  // Arg methods for managing input buffers. Buffers are in row-major order.
  // There is a set of methods for each positional argument.
  void** args(

  void set_arg0_data(float* data
  float* arg0_data(
  float& arg0(size_t dim0, size_t dim1

  void set_arg1_data(float* data
  float* arg1_data(
  float& arg1(size_t dim0, size_t dim1

  // Result methods for managing output buffers. Buffers are in row-major order.
  // Must only be called after a successful Run call. There is a set of methods
  // for each positional result.
  void** results(

  float* result0_data(
  float& result0(size_t dim0, size_t dim1
};

}  // end namespace bar
}  // end namespace foo

生成的C ++类MatMulComp在foo::bar名称空间中调用，因为它是cpp_class在tf_library宏中指定的。所有生成的类都有一个类似的API，唯一的区别是处理arg和结果缓冲区的方法。这些方法根据由宏feed和fetch参数指定的缓冲区的数量和类型而有所不同tf_library。

在生成的类中管理三种类型的缓冲区：args表示输入，results表示输出，并temps表示在内部用于执行计算的临时缓冲区。默认情况下，生成的类的每个实例都为您分配和管理所有这些缓冲区。该AllocMode构造函数的参数可以用来改变这种行为。提供了一个便利库tensorflow/compiler/aot/runtime.h来帮助手动分配缓冲区; 这个库的使用是可选的。所有的缓冲区应该对齐到32字节的边界。

生成的C ++类仅仅是XLA生成的低级代码的一个包装。

基于以下内容调用生成的函数的示例tfcompile_test.cc：

#define EIGEN_USE_THREADS
#define EIGEN_USE_CUSTOM_THREAD_POOL

#include <iostream>
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
#include "tensorflow/compiler/aot/tests/test_graph_tfmatmul.h" // generated

int main(int argc, char** argv) {
  Eigen::ThreadPool tp(2  // Size the thread pool as appropriate.
  Eigen::ThreadPoolDevice device(&tp, tp.NumThreads()

  foo::bar::MatMulComp matmul;
  matmul.set_thread_pool(&device

  // Set up args and run the computation.
  const float args[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
  std::copy(args + 0, args + 6, matmul.arg0_data()
  std::copy(args + 6, args + 12, matmul.arg1_data()
  matmul.Run(

  // Check result
  if (matmul.result0(0, 0) == 58) {
    std::cout << "Success" << std::endl;
  } else {
    std::cout << "Failed. Expected value 58 at 0,0. Got:"
              << matmul.result0(0, 0) << std::endl;
  }

  return 0;
}

第4步：创建最终的二进制文件

该步骤将tf_library步骤2中生成的库和步骤3中编写的代码结合起来，以创建最终的二进制文件。以下是一个示例bazelBUILD文件。

# Example of linking your binary
# Also see //third_party/tensorflow/compiler/aot/tests/BUILD
load("//third_party/tensorflow/compiler/aot:tfcompile.bzl", "tf_library")

# The same tf_library call from step 2 above.
tf_library(
    name = "test_graph_tfmatmul",
    ...
)

# The executable code generated by tf_library can then be linked into your code.
cc_binary(
    name = "my_binary",
    srcs = [
        "my_code.cc",  # include test_graph_tfmatmul.h to access the generated header
    ],
    deps = [
        ":test_graph_tfmatmul",  # link in the generated object file
        "//third_party/eigen3",
    ],
    linkopts = [
          "-lpthread",
    ]
)