At the end of my comparison — TensorFlow 1.14 Keras’ API versus Julia’s Flux.jl and Knet.jl high level APIs — I indicated some future write-ups I plan to do, one of which is to compare (obviously) on the low level APIs. However, with the release of the much anticipated TensorFlow 2.0, I decided not to wait for the next comparison and instead dedicate a separate article for the said release. The goal of this blog post then is to simply replicate the modeling in my previous article, without using the Keras API. Specifically, I’ll discuss the steps on constructing a custom layer, how we can chain multiple layers as in Keras’ Sequential, do a feedforward, a backward by updating the gradient, and batching the training datasets.

To start with, load the necessary libraries by running the following codes: I should emphasize that Line 7 is optional, and I’m using this for annotation purposes on function parameters.

Define the Constants

The following are the constants that we will be using on the specification of the succeeding codes: As you notice, these are the hyperparameters that we configured for the model below. Specifically, we use ReLU (tf.nn.relu) as the activation for the first hidden layer and softmax (tf.nn.softmax) for the output layer. Other parameters include batch-size, the number of epochs, and the optimizer which in this case is Adam (tf.keras.optimizers.Adam). Feel free to change the items above, for example, you can explore on the different optimizers, or on the activation functions.

The Iris dataset is available in Python’s Scikit-Learn library and can be loaded as follows: xdat and ydat are both np.ndarray objects, and we are going to partition these into training and testing datasets using the Data class defined in the next section.

Define the Data Class

To organize the data processing, below is the class with methods on partitioning, tensorizing, and batching the input data:

Data Cleaning/Processing

We then apply the above codes to the iris dataset as shown below:

Defining the Dense Layer

The model as mentioned in my previous article is a MultiLayer Perceptron (MLP) with one hidden layer, please refer to the said article for more details. In order to use the low level APIs, we need to define the Dense layer from scratch. As a guide, below is the mathematical formulation of the said layer: $$\mathbf{y} = \sigma\left(\mathbf{x}\cdot\mathbf{W} + \mathbf{b}\right),$$ where $\sigma$ is the activation function, which can either be ReLU, sigmoid, etc. Now in order for the above dot product to work, $\mathbf{x}$ must be a row-vector, i.e.: $$\mathbf{x} = \left[ x_1,x_2,\cdots,x_n \right]$$ where $n$ is the number of features. Thus, $\mathbf{W}$ is a $n\times j$-matrix, with $n$ as the dimension of the input layer and $j$ as the dimension of the next layer (it could be hidden or output). With regards to the weights matrix and the bias vector, we need to initialize them to some random starting values. There are several ways to do this, one of which is to use samples from the standard Gaussian distribution, i.e.

\begin{align} \mathbf{W} = [(w_{ij})]\quad &\textrm{and}\quad \mathbf{b} = [(b_i)],\nonumber\\ w_{ij}, b_i \sim \mathcal{N}(0,1), \;i = &\{1,\cdots,m\}, \;j=\{1,\cdots,n\}\nonumber \end{align}

Translating all these into codes, we have the following: Lines 3-6 initialize the Dense layer by assigning starting values to the weights, and Lines 8-12 define the feedforward. Further, if there is something special in the above code, it’s the tf.Variable. This class tells TensorFlow that the object is learnable during model optimization. Lastly, I want to emphasize that prior to TensorFlow 2.0, tf.random.normal was defined as tf.random_normal.

Feedforward

We can now use the Dense layer to perform feedforward computation as follows: The above code uses the first and the first five rows, respectively, of the data.xtrn as input. Of course, these values may change since we did not specify any seed for the initial weights and biases.

Defining the Chain

Now that we know how to initialize a single Dense layer, we need to extend it to multiple layers to come up with a MLP architecture. Hence, what we need to have now is a Chain for layers. In short, we need to replicate (not all attributes and methods) the Keras’ Sequential, and this is done as follows: Let me walk you through, method-by-method. To initialize this class, we need to have a list of Dense layers (referring to self.layers at Line 4 above) as input. Once we have that, we can do feedforward computation using the untrained weights by simply calling the object (as defined by __call__).

To train the weights, we need to call the backward method, which does backpropagation. This is done by optimize-ing the weights, using the gradient obtained by differentiating the loss function of the model, with respect to the learnable parameters (referring to self.params in Line 12 above). In TensorFlow 2.0, it is recommended to use the Keras optimizers (keras.optimizers) instead of the tf.train.GradientDescentOptimizer. One thing to note as well, is the loss function which is a .categorical_crossentropy as opposed to .sparse_categorical_crossentropy which is used in my previous article. The difference is due to the fact that the target variable we have here is encoded as one-hot vector (you can confirm this under the to_tensor method of the Data class above), and thus cross entropy is used, as opposed to integer encoding, in which case sparse cross entropy is more appropriate.

Referring back to the codes, I think Lines 33-34 are self-explanatory and should redirect us to the grad function where we find the tf.GradientTape, which is not obvious as to what exactly it does at first glance, apart from the high level understanding that it has something to do with the gradient computation (or it could really be the main moving part). Well, from the name itself, we can think of it as a “Tape Recorder”, which records the gradient of the trainable variables (referring to tf.Variable in Lines 4-5 of the Dense layer) with respect to the loss of the model. That is, when we do forward operation (referring to self(inputs) at Line 28 above), TensorFlow automatically differentiate the loss with respect to the parameters. In order then to update the weights under the Adam (referring to opt in Line 34 above) algorithm, we need to have these gradients at the given iteration. Thus we have a recorder (tf.GradientTape in this case), meant to extract the recorded gradients.

Finally, we chain the layers as follows: This extends the previous model (referring to layer object above) into chains of two layers, and we can call it as follows: The output above is the prediction of the model using the untrained weights.

Training

At this point, we can now optimize the parameters by calling the backward method; and because we are not using the Keras’ Sequential([...]).fit, we can customize our training procedure to suit our needs — starting with the custom definition of the model accuracy: Here’s the code for weight’s estimation: As you can see, we have three loops, two of which are inner-loops for the minibatches on both training and testing datasets. Needless to say, the minibatches used in the testing dataset above are not really necessary, since we can have a single batch for validation. However, we have them for purpose of comparing the performance of the optimization algorithm on both single batch and three minibatches. Finally, the following tabularizes the statistics we obtained from model estimation. The following plot shows the loss of the model across 500 epochs. Since I did not specify any seed on weights’ initial values, you will likely get a different curve: The corresponding accuracy is depicted below: For the codes of the above figures, please refer to this link. With regards to the series above, the optimization using three minibatches overfitted the data after 400+ epochs. This is evident on both figures, where we find a decrease in accuracy. Thus, it is recommended to have a model-checkpoint and early-stopping during calibration, and I’ll leave that to the reader.

End Note

That’s it, I have shown you how to do modeling using TensorFlow 2.0’s Core APIs. As an end note, I want to highlight two key points on the importance of using the low-level APIs. The first one, is having full control on your end-to-end modeling process. Having the flexibility on your tools, enables the user to solve problems with custom models, custom objective function, custom optimization algorithms, and whatnot. The second point, is the appreciation of the theory. It is simply fulfilling to see how the theory works in practice, and it gives the user the confidence to experiment, for example on the gradients and other internals.

Lastly, I am pleased with the clean API of the TF 2.0 as opposed to the redundant APIs we have in the previous versions; and with eager-execution as the default configuration, makes the library even more pythonic. Feel free to share your thoughts, if you have comments/suggestions.

Next Steps

In my next article, I will likely start on TensorFlow Probability, which extends the TF core APIs by incorporating Bayesian approach to modeling and statistical analyses. Otherwise, I will touch on modeling image datasets, or present new topic.

Complete Codes

If you are impatient, here is the complete code excluding the plots. This should work after installing the required libraries: