Automatic differentiation - d2l.ai Exercises - Part 5
The fifth notebook in the series solving exercises from d2l.ai, this blog tries to test and understand the working of the autodiff functionality in tensorflow
- Why is the second derivative much more expensive to compute than the first derivative?
- After running the function for backpropagation, immediately run it again and see what happens.
- Question 3
- Redesign an example of finding the gradient of the control flow. Run and analyze the result.
- Let $f(x)=sin(x)$ . Plot $f(x)$ and $\frac {df(x)}{dx}$ , where the latter is computed without exploiting that $f′(x)=cos(x)$.
import tensorflow as tf
x = tf.range(4, dtype=tf.float32)
# allocate memory
x = tf.Variable(x)
As we have learnt in our previous lessons about the chain rule and its usage to differentiate composite functions, applying the chain rule for a second time quickly adds up the number of times that we need to differentiate. Let's take an example,
Suppose that functions $y=f(u)$ and $u=g(x)$ are both differentiable, then the chain rule states that,
$$\frac{dy}{dx} = \frac {dy}{du} * \frac {du}{dx}$$
let's assume $a=\frac {dy}{du}$ and $b = \frac {du}{dx}$
so a second derivative of that would be (using product rule here),
$$\frac{d^2y}{dx^2} = a*\frac{db}{dx} + b * \frac {da}{dx} $$
We can see that the number of differentiations have doubled for this simple composite function. This is a simple reason for why the calculation is expensive for the second derivative.
Let's check value of x
x
with tf.GradientTape() as t:
y = x * x
t.gradient(y, x)
t.gradient(y, x)
I get the above error when I try to run the backward propagation two times, looking at the documentation of tf.GradientTape
it is specified that
By default, the resources held by a GradientTape are released as soon as GradientTape.gradient() method is called. To compute multiple gradients over the same computation, create a persistent gradient tape. This allows multiple calls to the gradient() method as resources are released when the tape object is garbage collected.
Question 3
TLDR: Try using a vector/matrix in the function used to demonstrate gradient calculation on a control flow.
Let's bring that function used in the lesson
def f(a):
b = a * 2
while tf.norm(b) < 1000:
b = b * 2
if tf.reduce_sum(b) > 0:
c = b
else:
c = 100 * b
return c
In the lesson they have used single value random variable, the question asks us to use a vector or matrix in its place.
Let's use the following random vector with shape (1,2)
a = tf.Variable(tf.random.normal(shape=(1, 2)))
a
The differentiation
with tf.GradientTape() as t:
d = f(a)
d_grad = t.gradient(d, a)
d_grad
We can try using a bigger matrix now.
a = tf.Variable(tf.random.normal(shape=(5, 3, 2)))
a
with tf.GradientTape() as t:
d = f(a)
d_grad = t.gradient(d, a)
d_grad
So when we differentiate a single value, the resultant gradient vector is similar in shape, as in the cases of the higher dimensional vectors/matrices.
I did not get what the question was trying to test the reader on, I read the discussion, A person said to use tf.hessians
to a related question. Let's see why
Let's take a look at the matrix we have
a = tf.Variable(tf.random.normal(shape=()), trainable=True)
a
The following is the suggested way to calculate a hessian based vector product
with tf.GradientTape() as t2:
with tf.GradientTape() as t1:
d1 = f(a)
diff_1 = t1.gradient(d1, a)
diff_2 = t2.gradient(diff_1, a)
diff_2
diff_1, diff_2
The following is modified based on this to see what happens.
with tf.GradientTape() as t4:
t4.watch(a)
with tf.GradientTape() as t3:
t3.watch(a)
d = f(a)
dif = t3.gradient(d, a)
j_diff = t4.jacobian(dif, a)
dif, j_diff
Both the methods return the same result for the small vector, where the second order derivative is turning into None, I am not sure why..., thinking about the question asked, since $f(a)$ is no longer scalar, the resultant gradient is also a vector/matrix. So I am again not sure of the suggestion given.
The example given in the lesson was a little complex, I will use a simple control flow, to see if we can calculate the gradient with autodiff.
Let's start with a simple transformation, We all know the gradient of an exponential function is the same.
def f(x):
return tf.math.exp(x)
Let's take a single value
x = tf.Variable(tf.random.normal(shape=()), trainable=True)
x
with tf.GradientTape() as t:
y = f(x)
dy_dx = t.gradient(y, x)
y == dy_dx
So now we can try performing some transformations on the value.
$$y \rightarrow f(x) \rightarrow \left\{\left( \eqalign{ \sin{x}, if \sum{x}_n/n > 0.5\\ 2 * \cos{x}, else\\ } \right) \right\}$$def f(x):
if tf.math.reduce_sum(x) > 0.5:
x = tf.math.sin(x)
else:
x = 2 * tf.math.cos(x)
return x
x = tf.Variable(tf.random.normal(shape=(20, 1)), trainable=True)
x
The differentiation
with tf.GradientTape() as t:
y = f(x)
dy_dx = t.gradient(y, x)
dy_dx
x_axis = tf.range(0, 20, 1)
plt.plot(x_axis, tf.reshape(f(x), [-1]).numpy(), color='r', label='y')
plt.plot(x_axis, tf.reshape(dy_dx, [-1]).numpy(), color='g', label='dy_dx')
plt.legend(loc='lower right')
plt.show()
The graph seems to be validating differentiation, since we can see the green line following the red line
Let's setup the required variables
# sine function
f = tf.math.sin
x = tf.Variable(tf.range(-10, 10, 0.1))
with tf.GradientTape() as t:
y = f(x)
# the gradient of sin(x) is cos(x)
dy_dx = t.gradient(y, x)
plt.figure(1)
x_axis = tf.range(-10, 10, 0.1)
plt.plot(x_axis, f(x).numpy(), color='r')
plt.plot(x_axis, dy_dx.numpy(), color='g')
plt.show()