Assignment2

Problem

Previously in 2_fullyconnected.ipynb, you trained a logistic regression and a neural network model. The goal of this assignment is to explore regularization techniques.

本来使用新浪博客记录,结果没有自动保存功能,一下子就把写了很多的东西给我废了.超级气,于是直接使用notebook来写学习记录~ 言归正传,首先对正则项,我们直观感受一下:

可以看出,这是在原损失的基础上加上的,如果按照这个形式:

1.损失永远是正值,除非w为0

2.与w有关,w越大则整体损失越大,所以我们也经常把这个形式的L2 loss叫做对w的惩罚项

我们先来看一下这个正则项的形式,其函数很规则,是一个最小值为0的'碗状图',我们希望损失尽量小就要把这个正则压在碗底附近. 且对于一些不重要的w,如果正则的影响远大于原loss,那么这些不重要的w会变得几乎为0.换句话说整个模型就变得精致了很多. 如果用L1正则,则得到的是稀疏解.这点也更好理解.你的模型对于解决一个问题过于复杂了,或者说参数超过了这个模型可以表达的复杂度,那么就会有很多free weights,而这些训练样本中不见得用到的自由权重,在面对测试样本时可能有很糟糕的表现.所以如果他们无所作为,直接砍掉他们.

从MAP角度也可以解释,这个正则项的形式实际上是为w加了一个0均值高斯先验分布,方差为1/beta.所以对于没有添加正则项的w解,实际上没有假设w的先验分布,或者说假设的是一个协防差无穷大的高斯先验分布(这其实就不叫高斯了).至于w的解是否应该具有高斯特性,实验证明大部分还是有的.我个人认为如果w是囊括万象的一个集合,应该是在中心及限定定理的保证下具有高斯特性的.

所以说:

3.beta越大,说明w解越偏向系统稳定性(解的数值偏小,且free weights值应该更小);beta越小,则说明w更偏向于使得训练数据得到高测试表现

那么我们怎么去确定这个beta,我们就要从正则项加入的目的减少过拟合来说了.减少过拟合功利地表现为提升验证集准确率.当验证集准确率开始走低或者到达损失下降的平原区的时候的原loss值当作参考,来设定这个beta值.

举个例子:

Minibatch loss at step 0: 20.254705
Minibatch accuracy: 12.5%
Validation accuracy: 13.8%
2991.0
Minibatch loss at step 500: 1.555762
Minibatch accuracy: 73.4%
Validation accuracy: 76.1%
2522.21
Minibatch loss at step 1000: 1.609549
Minibatch accuracy: 73.4%
Validation accuracy: 76.7%
2261.39
Minibatch loss at step 1500: 1.476673
Minibatch accuracy: 74.2%
Validation accuracy: 77.2%
2069.25
Minibatch loss at step 2000: 1.114166
Minibatch accuracy: 75.8%
Validation accuracy: 77.6%
1911.93
Minibatch loss at step 2500: 1.026689
Minibatch accuracy: 74.2%
Validation accuracy: 78.2%
1777.95
Minibatch loss at step 3000: 0.807441
Minibatch accuracy: 79.7%
Validation accuracy: 78.5%
1666.99
Test accuracy: 85.8%

我们先在官方给的SGD_logistic上面测试 可以看到,在验证集准确率77%左右稳定的原loss在1这个数量级,而w的L2_loss在10^3数量级.所以我这里给正则项权重为0.001. 当然,你可以adaptive跑循环调整这个参数. 加入正则:


reg_beta=0.001
loss=loss+reg_beta*tf.nn.l2_loss(weights)

In [ ]:

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,)
Training set (200000, 784) (200000, 10)
Validation set (10000, 784) (10000, 10)
Test set (10000, 784) (10000, 10)
Initialized
Minibatch loss at step 0: 18.735794
Minibatch accuracy: 13.3%
Validation accuracy: 15.6%
3033.46
Minibatch loss at step 500: 2.865627
Minibatch accuracy: 75.0%
Validation accuracy: 76.0%
1531.36
Minibatch loss at step 1000: 2.031458
Minibatch accuracy: 74.2%
Validation accuracy: 77.7%
810.517
Minibatch loss at step 1500: 1.358890
Minibatch accuracy: 79.7%
Validation accuracy: 80.0%
442.753
Minibatch loss at step 2000: 1.069137
Minibatch accuracy: 78.1%
Validation accuracy: 81.0%
250.141
Minibatch loss at step 2500: 0.828736
Minibatch accuracy: 82.8%
Validation accuracy: 81.6%
146.253
Minibatch loss at step 3000: 0.671674
Minibatch accuracy: 80.5%
Validation accuracy: 80.9%
90.3744
Test accuracy: 88.4%

可以看到效果有两个点的提升,接下来我们再在2层的全链接网络里面加入正则项.先看没有加入的结果是:

Minibatch loss at step 3000: 7.275184 Minibatch accuracy: 82.8% Validation accuracy: 80.5% 916.006 Test accuracy: 87.8%

按照前述道理加入正则项,权重为0.01


reg_beta=0.01
loss=loss+reg_beta*(tf.nn.l2_loss(weights)+tf.nn.l2_loss(weights_1))


Minibatch loss at step 3000: 0.670678
Minibatch accuracy: 86.7%
Validation accuracy: 83.4%
7.05956
Test accuracy: 90.1%

可以看到,结果同样提升了两个点多,也许调权重可以有更好的表现.不过我不打算做,跑一波就要将近一分钟,这里我们感受一下能够有效果上的提升就可以了.

这里值得一说的是,两波权重是否应该分开加权.我的感受是应该分开最好,不过我懒得尝试了.也许有一些现成的结论,这里先mark一下该问题.

Problem 2

Let's demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

train_subset = 1000 data.train_dataset=data.train_dataset[0:train_subset,:] data.train_labels=data.train_labels[0:train_subset]

我们使用原来数据的十分之一来训练.因为这样规模的数据是无法训练复杂模型的,但是我们的模型里的参数却很多.所以说过拟合可能比较明显,看下结果:

Minibatch loss at step 3000: 0.339734
Minibatch accuracy: 100.0%
Validation accuracy: 79.2%
12.4676
Test accuracy: 86.4%

可以看到,训练数据集轻易就过达到100%了,loss降到几乎是0,而在验证集和测试集表现并没有那么好.如果把正则项去掉过拟合会更明显一些,不过由于接下来要加入dropout进行对比,就不去掉了. 不过这里虽然轻易过拟合了,但是验证集和测试集的准确率并没有掉,这是因为我们还有正则项在.不然效果会更差些,具体差多少会根据解平面的形状和陷入的局部极值而不同

Problem 3

Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides nn.dropout() for that, but you have to make sure it's only inserted during training.

What happens to our extreme overfitting case?

我个人是觉得drop out和l2正则没有必要一起使用的,然而我试了试单用drop out会使得权重和loss的绝对值变得很大,虽然对识别率影响不大.就还是一起使用了,这样l2的权重应该是需要调整.这里两个参数的调整我还不知道太大的技巧,不过通过自己的实验证实,这样一个想法是正确的: 1.drop out的keep_prob越低,越能够防止过拟合,然而模型表达能力就越差,因为相当于每次的权重大部分就扔了.所以先选择一个能够keep模型表达力的drop out值 2.如果过拟合继续发生,加大正则项的权重.因为drop out的力度不够,w对于解决当前问题还是太多太大,需要对w的惩罚加大. (drop out我个人理解是一种随机的临时的方法对w进行杀害,从而能够在SGD这种每步迭代时可能跳出局部极小值.然而对于全局来说,和正则项可能达到的对w的限制还是不同的.)


# dropout on hidden layer
    keep_prob = tf.placeholder("float")
    hidden_layer_drop = tf.nn.dropout(hidden1, keep_prob)

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (data.train_labels.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = data.train_dataset[offset:(offset + batch_size), :]
    batch_labels = data.train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob : 0.5}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 500 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
         session.run(train_prediction,feed_dict={tf_train_dataset: data.valid_dataset,keep_prob:1.0}), data.valid_labels))
      print(tf.nn.l2_loss(weights).eval())
  feed_dict = {tf_train_dataset: data.test_dataset,keep_prob:1.0}
  predictions=session.run(train_prediction, feed_dict=feed_dict)
  print("Test accuracy: %.1f%%" % accuracy(predictions, data.test_labels))

主要就是注意在验证和测试的时候,权重需要都保留下来.手动调了两三次参,并不是最佳.结果来看验证集和测试集与过拟合情况并没有改善. 不过这个训练集准确率没有到100%至少说明可能继续迭代下去还有进步的可能性.

Minibatch loss at step 3000: 0.685943

Minibatch accuracy: 94.5%

Validation accuracy: 78.8%

7.26925 Test accuracy: 86.0%

Problem 4

Try to get the best performance you can using a multi-layer model! The best reported test accuracy using a deep network is 97.1%.

One avenue you can explore is to add multiple layers.

Another one is to use learning rate decay:

global_step = tf.Variable(0) # count the number of steps taken. learning_rate = tf.train.exponential_decay(0.5, step, ...) optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)

这里我们就再加一层意思一下就好了,然后使用一个学习率衰减.

再说一下就是这个 learning_rate = tf.train.exponential_decay(0.5, global_step, decay_steps, decay_rate)

其中global_step是说整体迭代次数,而decay_steps我不知道有什么意义,总之也是控制衰减快慢和步数之间关系的值

然后学习率就是基于global_step / decay_steps来decay的,decay_rate是一个初始的decay基数 然后每次新的lr=lr*(decay_rate^比值),可以看到当g_step增大时,(decay_rate^比值)会减小(因为decay_rate小于1) 所以lr会逐渐放缓减慢速度 这个函数我觉得还是有点过于复杂了,有必要有这么多参数吗 这里我也懒得去多调一个学习率的参数,而直接使用了Adagradientdescent方法.

因此我还是固定drop rate是0.5,只需要调正则项.先看代码:

hn1_num = 1024
hn2_num= 512
    # dropout on hidden layer
    keep_prob = tf.placeholder("float")

    with tf.name_scope('hidden') as scope_1:
        # Variables.
        weights_1 = tf.Variable(
            tf.truncated_normal([image_size * image_size, hn1_num]), name='weights')
        biases_1 = tf.Variable(tf.zeros([hn1_num]), name='biases')
        print (tf.shape(weights_1))
        hidden1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights_1) + biases_1)

    # dropout on hidden layer
    keep_prob = tf.placeholder("float")
    hidden_layer_drop1 = tf.nn.dropout(hidden1, keep_prob)

    with tf.name_scope('hidden') as scope_2:
        # Variables.
        weights_2 = tf.Variable(
            tf.truncated_normal([hn1_num, hn2_num]), name='weights')
        biases_2 = tf.Variable(tf.zeros([hn2_num]), name='biases')
        print (tf.shape(weights_2))
        hidden2 = tf.nn.relu(tf.matmul(hidden_layer_drop1, weights_2) + biases_2)

    hidden_layer_drop2 = tf.nn.dropout(hidden2, keep_prob)
    with tf.name_scope('out') as scope_3:
        # Variables.
        weights = tf.Variable(
            tf.truncated_normal([hn2_num, num_labels]), name='weights')
        biases = tf.Variable(tf.zeros([num_labels]), name='biases')
        print (tf.shape(weights))
        logits = tf.matmul(hidden_layer_drop2, weights) + biases

    # regularizer=[weights_1,weights]
    ##########################
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))
    loss = loss + reg_beta * (tf.nn.l2_loss(weights)+tf.nn.l2_loss(weights_2)+tf.nn.l2_loss(weights_1))
    # Optimizer.


    optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
    # optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)

然后,还是按照之前说的,依然是2分法手动调参2次,发现正则项系数应该是在0.005到0.01之间效果最好 这里我对正则项居然是个常数深表怀疑,因为我感到正则项如果一开始过小,整体loss会下降的很慢,间接导致模型收敛慢,而如果比较大.在最后收敛时又占据了整体loss太大,导致了模型最终表现能力不好.我总感到正则项系数应该动态调整比较合适,这里先mark一下该问题 结果如下,训练了17000轮:

Minibatch accuracy: 85.2%

Validation accuracy: 85.6%

Test accuracy: 91.8%

可以看到这么多轮了,过拟合还没发生,理论上可以继续跑,但是太慢了我就不打算继续测试了. 结果差最好结果比较多,但是相比较之前的logistic回归要强了,原因有很多哈,不过这里我就不打算纠结了,体会到意思即可~ 而且我这里直接test set放在迭代里调参了,也是个不科学的方法.

results matching ""

    No results matching ""