Recurrent Neural Networks（复发性神经网络）

Recurrent Neural Networks

介绍

请看这篇很棒的文章，特别介绍递归神经网络和LSTM。

语言建模

在本教程中，我们将展示如何在语言建模的具有挑战性的任务上训练递归神经网络。问题的目标是拟合概率模型，将概率赋予句子。它通过预测文本中的下一个单词给出以前单词的历史记录来实现。为此，我们将使用Penn Tree Bank（PTB）数据集，该数据集是衡量这些模型质量的流行基准，同时规模较小，操练速度相对较快。

语言建模是许多有趣问题的关键，如语音识别，机器翻译或图像字幕。这也很有趣 - 点这里看看。

本教程中，我们将重现来自Zaremba等人，2014（pdf）论文的结果，该结果在PTB数据集上实现了非常好的质量。

教程文件

本教程是引用了以下文件models/tutorials/rnn/ptb中TensorFlow模型回购：

文件	目的
ptb_word_lm.py	在PTB数据集上操练语言模型的代码。
reader.py	读取数据集的代码。

下载并准备数据

本教程所需data/的数据位于Tomas Mikolov网页的PTB数据集的目录中。

数据集已经预处理并含有整体10000个不同的词，包括结束句子的标记和用于罕见词语的特殊符号（\ <UNK>）。在中reader.py，我们将每个单词转换为一个唯一的整数标识符，以便让神经网络轻松处理数据。

模型

LSTM

该模型的核心由一个LSTM单元组成，每次处理一个单词并计算句子中下一个单词的可能值的概率。网络的内存状态用零向量初始化，并在读取每个单词后得到更新。出于计算原因，我们将以小批量处理数据batch_size。在这个例子中，重要的是要注意current_batch_of_words不对应于单词的“句子”。

批次中的每个单词应该对应于时间t。Tensorflow会自动将您每批次的梯度加起来。

例如：

 t=0  t=1    t=2  t=3     t=4
[The, brown, fox, is,     quick]
[The, red,   fox, jumped, high]

words_in_dataset[0] = [The, The]
words_in_dataset[1] = [fox, fox]
words_in_dataset[2] = [is, jumped]
words_in_dataset[3] = [quick, high]
num_batches = 4, batch_size = 2, time_steps = 5

基本的伪代码如下：

words_in_dataset = tf.placeholder(tf.float32, [num_batches, batch_size, num_features])
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
hidden_state = tf.zeros([batch_size, lstm.state_size])
current_state = tf.zeros([batch_size, lstm.state_size])
state = hidden_state, current_state
probabilities = []
loss = 0.0
for current_batch_of_words in words_in_dataset:
    # The value of state is updated after processing each batch of words.
    output, state = lstm(current_batch_of_words, state)

    # The LSTM output can be used to make next word predictions
    logits = tf.matmul(output, softmax_w) + softmax_b
    probabilities.append(tf.nn.softmax(logits))
    loss += loss_function(probabilities, target_words)

截断后向传播

通过设计，递归神经网络（RNN）的输出取决于任意远距离的输入。不幸的是，这使得反向传播计算困难。为了使学习过程易于处理，通常的做法是创建一个“展开”版本的网络，其中包含固定数量的（num_steps）LSTM输入和输出。然后对该模型进行有限RNN近似训练。这可以通过一次输入长度的输入num_steps并在每个这样的输入块之后执行反向通过来实现。

下面是创建一个执行截断后向传播的图的简化代码块：

# Placeholder for the inputs in a given iteration.
words = tf.placeholder(tf.int32, [batch_size, num_steps])

lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
initial_state = state = tf.zeros([batch_size, lstm.state_size])

for i in range(num_steps):
    # The value of state is updated after processing each batch of words.
    output, state = lstm(words[:, i], state)

    # The rest of the code.
    # ...

final_state = state

这就是如何实现对整个数据集的迭代：

# A numpy array holding the state of LSTM after each batch of words.
numpy_state = initial_state.eval()
total_loss = 0.0
for current_batch_of_words in words_in_dataset:
    numpy_state, current_loss = session.run([final_state, loss],
        # Initialize the LSTM state from the previous iteration.
        feed_dict={initial_state: numpy_state, words: current_batch_of_words})
    total_loss += current_loss

输入

在输入LSTM之前，单词ID将被嵌入到一个密集的表示中（请参阅矢量表示教程）。这允许模型有效地表示关于特定单词的知识。写起来也很容易：

# embedding_matrix is a tensor of shape [vocabulary_size, embedding size]
word_embeddings = tf.nn.embedding_lookup(embedding_matrix, word_ids)

嵌入矩阵将随机初始化，模型将学习通过查看数据来区分单词的含义。

损失函数

我们想要最小化目标单词的平均负对数概率：

$$ \text{loss} = -\frac{1}{N}\sum_{i=1}^{N} \ln p_{\text{target}_i} $$

It is not very difficult to implement but the function sequence_loss_by_example is already available, so we can just use it here.

The typical measure reported in the papers is average per-word perplexity (often just called perplexity), which is equal to

$$e^{-\frac{1}{N}\sum_{i=1}^{N} \ln p_{\text{target}_i}} = e^{\text{loss}} $$

我们将在整个操练过程中监控其价值。

堆叠多个LSTM

为了赋予模型更强大的表现力，我们可以添加多层LSTM来处理数据。第一层的输出将成为第二层的输入等等。

我们有一个叫做MultiRNNCell的类来实现无缝实现：

def lstm_cell():
  return tf.contrib.rnn.BasicLSTMCell(lstm_size)
stacked_lstm = tf.contrib.rnn.MultiRNNCell(
    [lstm_cell() for _ in range(number_of_layers)])

initial_state = state = stacked_lstm.zero_state(batch_size, tf.float32)
for i in range(num_steps):
    # The value of state is updated after processing each batch of words.
    output, state = stacked_lstm(words[:, i], state)

    # The rest of the code.
    # ...

final_state = state

运行代码

在运行代码之前，请下载PTB数据集，如本教程开始部分所述。然后，按如下方式提取主目录下的PTB数据集：

tar xvfz simple-examples.tgz -C $HOME

（注意：在Windows上，您可能需要使用 其他工具。）

现在，从GitHub 复制TensorFlow模型回购。运行以下命令：

cd models/tutorials/rnn/ptb
python ptb_word_lm.py --data_path=$HOME/simple-examples/data/ --model=small

教程代码中有3种支持的模型配置：“小”，“中”和“大”。它们之间的区别在于LSTM的大小和用于操练的一组超参数。

模型越大，结果就越好。该small模型应该能够达到测试集低于120的困扰度和large低于80的困扰度，尽管操练可能需要几个小时。

接下来是什么？

有几个我们没有提到的技巧可以优化模型，其中包括：

降低学习速度时间表，

LSTM层之间的压差。

研究代码并对其进行修改以进一步改进模型。