ANN vs RNN(LSTM) Performance test

4월 17, 2017

ANN RNN performance test

enter image description here
Image source

Tensorflow를 이용한 기계학습 실험

Google에서 Tensorflow를 오픈소스로 공개한 이후, 기계학습에 대한 접근성은 더욱 용이해졌다. 물론 Tensorflow가 나오기 전에도 Keras나 Theano와 같이 기계학습 관련 툴들은 존재했었다. 하지만 Google이라는 막강한 기업이 python이라는 비교적 쉬운 언어를 통해 이번 툴을 무료로 배포하고, 전세계의 다양한 연구자들이 함께 코드를 발전시켜나갈수 있도록 기반을 다져주었기에 Tensorflow의 인기는 다른 툴들보다 나날이 상승할 것으로 보인다. 물론 초반에 나온 코드라 그런지 업데이트가 자주 일어나서 펑션들이 이리저리 옮기다니는 덕분에 코드를 자주 수정해야되는 번거로움이 따르긴 하지만, 이런 불편함을 감수할 정도로 Tensorflow는 많은 편리성을 제공한다. 어찌되었던 이런 시대의 흐름을 타고 필자도 최근 Theano로 사용하던 기계학습 코드를 Tensorflow로 수정하였고 이를 통해 다양한 실험을 진행하고 있다. 여기 올리는 실험 결과들은 비교적 단순한 호기심, 혹은 연구 주제를 발전해 나가는 과정중에 도출된 것이기에 편하게 봐주길 바란다.

Comparison of English character recognition performance between ANN and RNN(LSTM).

Hyungwon Yang
04.17.17
NAMZ Labs

Task

Compare the performance achieved by two machine learning techniques: ANN and RNN(LSTM).

Train the preprocessed character level corpus, Project Gutenberg’s The Divine Comedy, Complete ebook, with the two techniques and test their performance.

Training Corpus

Project Gutenberg’s The Divine Comedy, Complete, by Dante Alighieri
This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org The part of the corpus was extracted for training.

Experimental Setting.

Python 3.5.3
Tnesorflow 1.0.0
Mac OSX sierra 10.12.4

Data Preprocessing.

데이터 정제와 관련된 사항을 적었음.

'''
Data preprocessing.
This part shows how I preprocessed the text data.
You don't have to run this code becuase this process is time consuming.
If you want to download the data file, find and run downloader.py in a HY_python_NN folder.
'''

# Data preparation for ANN & LSTM performance comparison.

import re
import numpy as np


with open('train_data/pg8800_train','r') as train_n:
    train_ngram = np.loadtxt(train_n.readlines(),dtype=int)

with open('train_data/pg8800_test','r') as test_n:
    test_ngram = np.loadtxt(test_n.readlines(),dtype=int)

with open('train_data/pg8800_words','r') as look_w:
    lookup = look_w.readlines()
    lookup_words = []
    for string in lookup:
        lookup_words.append(re.sub('\n','',string))

vocab_size = len(lookup_words)
train_data_size = len(train_ngram)
test_data_size = len(test_ngram)


############################################################################
### word level
# This is first method. combine 3-gram. (1*5000) * 3 = 1 * 15000
# This part is not written as a code.

# This is second method. put 3-gram into 1 vocabulary.
## ANN
ann_train_inputs = np.zeros((train_data_size,vocab_size))
ann_train_outputs = np.zeros((train_data_size,vocab_size))
box = 0
for idx in train_ngram:
    ann_train_inputs[box][idx[0:3]] = 1
    ann_train_outputs[box][idx[-1]] = 1
    box += 1
# test_inputs
ann_test_inputs = np.zeros((test_data_size,vocab_size))
ann_test_outputs = np.zeros((test_data_size,vocab_size))
box = 0
for idx in test_ngram:
    ann_test_inputs[box][idx[0:3]] = 1
    ann_train_outputs[box][idx[-1]] = 1
    box += 1


## LSTM
timeStep = 3

# dictionary list.
# Be cautious. This takes a lot of time to generate data.
word_box = np.identity(len(lookup_words),dtype=int)
input_box = np.zeros((timeStep,vocab_size))
lstm_train_inputs = np.empty((1,timeStep,vocab_size))
lstm_train_outputs = np.empty((1,timeStep,vocab_size))
con=0
for idx in train_ngram:
    input_box = np.zeros((timeStep,vocab_size))
    output_box = np.zeros((timeStep, vocab_size))
    for input in list(range(timeStep)):
        input_box[input][idx[input]] = 1
    for output in list(range(timeStep)):
        output_box[output][idx[output+1]] = 1
    lstm_train_inputs = np.append(lstm_train_inputs,[input_box],axis=0)
    lstm_train_outputs = np.append(lstm_train_outputs, [output_box], axis=0)
    con+=1
    if con % 500 == 0:
        print ('{} / {} is completed'.format(con,train_data_size))


############################################################################
### char level
# data preprocessing.
# import data.
with open('train_data/pg8800_train_chars','r') as train_n:
    train_char = train_n.readlines()
    tmp_train = train_char[0].split(' ')
    train_split_char = tmp_train[0:170000]

with open('train_data/pg8800_test_chars','r') as test_n:
    test_char = test_n.readlines()
    tmp_test = test_char[0].split(' ')
    test_split_char = tmp_train[0:33000]

with open('train_data/pg8800_char_list','r') as look_w:
    lookup = look_w.readlines()
    lookup_chars = []
    for string in lookup:
        lookup_chars.append((re.sub('\n','',string)))


vocab_size = len(lookup_chars)
train_data_size = len(train_split_char)
test_data_size = len(test_split_char)

# data digitizing.
## ANN
# train data
ann_train_input_char = np.zeros((train_data_size-1,vocab_size))
ann_train_output_char = np.zeros((train_data_size-1,vocab_size))
for dat in list(range(train_data_size-1)):
    input_sym = train_split_char[dat]
    output_sym = train_split_char[dat+1]
    input_idx = lookup_chars.index(input_sym)
    output_idx = lookup_chars.index(output_sym)
    ann_train_input_char[dat][input_idx] = 1
    ann_train_output_char[dat][output_idx] = 1


# test data
ann_test_input_char = np.zeros((test_data_size - 1, vocab_size))
ann_test_output_char = np.zeros((test_data_size - 1, vocab_size))
for dat in list(range(test_data_size - 1)):
    input_sym = test_split_char[dat]
    output_sym = test_split_char[dat + 1]
    input_idx = lookup_chars.index(input_sym)
    output_idx = lookup_chars.index(output_sym)
    ann_test_input_char[dat][input_idx] = 1
    ann_test_output_char[dat][output_idx] = 1
np.savez('train_data/pg8800_ann_char_data',train_input=ann_train_input_char,
         train_output=ann_train_output_char,test_input=ann_test_input_char,test_output=ann_test_output_char)

## LSTM
lstm_train_input_char = np.zeros((int(train_data_size/timeStep),timeStep,vocab_size))
lstm_train_output_char = np.zeros((int(train_data_size/timeStep),timeStep,vocab_size))
his = 0
for dat in list(range(int(train_data_size/timeStep)-1)):
    for times in list(range(timeStep)):
        input_sym = train_split_char[his]
        output_sym = train_split_char[his+1]
        input_idx = lookup_chars.index(input_sym)
        output_idx = lookup_chars.index(output_sym)
        lstm_train_input_char[dat][times][input_idx] = 1
        lstm_train_output_char[dat][times][output_idx] = 1
        his += 1

# test data
lstm_test_input_char = np.zeros((int(test_data_size/timeStep),timeStep,vocab_size))
lstm_test_output_char = np.zeros((int(test_data_size/timeStep),timeStep,vocab_size))
his = 0
for dat in list(range(int(test_data_size/timeStep)-1)):
    for times in list(range(timeStep)):
        input_sym = test_split_char[his]
        output_sym = test_split_char[his+1]
        input_idx = lookup_chars.index(input_sym)
        output_idx = lookup_chars.index(output_sym)
        lstm_test_input_char[dat][times][input_idx] = 1
        lstm_test_output_char[dat][times][output_idx] = 1
        his += 1
np.savez('train_data/pg8800_lstm_char_data',train_input=lstm_train_input_char,
         train_output=lstm_train_output_char,test_input=lstm_test_input_char,test_output=lstm_test_output_char)

ANN training.

Comments
1. 훈련에 사용된 데이터: 170,000 * 38 (# of examples, # of input features)
2. 테스트에 사용된 데이터 : 33,000 * 38 (# of examples, # of input features)
3. 본 실험에서는 사용하지 않았지만 리포트 상에서는 Accuracy의 변화를 보여주고자 훈련에 사용되는 데이터중 20%를 validation 셋(34,000개)으로 구성하였다. 이 validation은 epoch가 진행됨에 따라 변화되는 accuracy(인풋 케릭터에 대한 아웃풋 케릭터 결과)를 보여준다.
4. parameters
  - Epoch: 200 (고정)
  - The number of hidden layer: 1 (고정)
  - The number of hidden units: 50, 100, 200
  - Learning Rate: 0.001,
  - Cost Function: AdamOptimizer

import sys
# HY_python_NN absolute directory.
my_absdir = "/Users/hyungwonyang/Google_Drive/Python/HY_python_NN"
sys.path.append(my_absdir)
import numpy as np
import main.setvalues as set
import main.dnnnetworkmodels as net


# import data.
# data directory.
ann_data = np.load(my_absdir+'/train_data/pg8800_ann_char_data.npz')
train_input = ann_data['train_input']
train_output = ann_data['train_output']
test_input = ann_data['test_input']
test_output = ann_data['test_output']

vocab_size = train_input.shape[1]
train_data_size = train_input.shape[0]
test_data_size = test_input.shape[0]

# parameter setting.
fineTrainEpoch = 20
fineLearningRate = 0.001
learningRateDecay = 'off' # on, off
batchSize = 100
hiddenLayers = [200]
problem = 'classification' # classification, regression
hiddenFunction= 'tanh'
costFunction = 'adam' # gradient, adam
validationCheck = 'on' # if validationCheck is on, then 20% of train data will be taken for validation.
PlotGraph = 'off' # If this is on, graph will be saved in the rnn_graph directory.
                  # You can check the dnn structure on the tensorboard.


DNN_values = set.setParam(inputData=train_input,
                    targetData=train_output,
                    hiddenUnits=hiddenLayers
                    )

# Setting hidden layers: weightMatrix and biasMatrix
weightMatrix = DNN_values.genWeight()
biasMatrix = DNN_values.genBias()

# Generating input symbols.
input_x, input_y = DNN_values.genSymbol()

dnn = net.DNNmodel(inputSymbol=input_x,
                   outputSymbol=input_y,
                   problem=problem,
                   fineTrainEpoch=fineTrainEpoch,
                   fineLearningRate=fineLearningRate,
                   learningRateDecay=learningRateDecay,
                   batchSize=batchSize,
                   hiddenFunction=hiddenFunction,
                   costFunction=costFunction,
                   validationCheck=validationCheck,
                   weightMatrix=weightMatrix,
                   biasMatrix=biasMatrix
                   )

# Generate a DNN network.
dnn.genDNN()

# Train the DNN network.
# In this tutorial, we will run only 20 epochs.
dnn.trainDNN(train_input,train_output)

Epoch: 1 / 20, Cost : 2.680907, Validation Accuracy: 28.74%
Epoch: 2 / 20, Cost : 2.394805, Validation Accuracy: 29.00%
Epoch: 3 / 20, Cost : 2.373727, Validation Accuracy: 29.00%
Epoch: 4 / 20, Cost : 2.366924, Validation Accuracy: 29.00%
Epoch: 5 / 20, Cost : 2.363674, Validation Accuracy: 28.89%
Epoch: 6 / 20, Cost : 2.361691, Validation Accuracy: 28.89%
Epoch: 7 / 20, Cost : 2.360365, Validation Accuracy: 28.89%
Epoch: 8 / 20, Cost : 2.359450, Validation Accuracy: 28.89%
Epoch: 9 / 20, Cost : 2.358752, Validation Accuracy: 28.89%
Epoch: 10 / 20, Cost : 2.358191, Validation Accuracy: 28.89%
Epoch: 11 / 20, Cost : 2.357709, Validation Accuracy: 28.89%
Epoch: 12 / 20, Cost : 2.357276, Validation Accuracy: 28.74%
Epoch: 13 / 20, Cost : 2.356880, Validation Accuracy: 28.74%
Epoch: 14 / 20, Cost : 2.356511, Validation Accuracy: 28.74%
Epoch: 15 / 20, Cost : 2.356167, Validation Accuracy: 28.74%
Epoch: 16 / 20, Cost : 2.355842, Validation Accuracy: 28.74%
Epoch: 17 / 20, Cost : 2.355536, Validation Accuracy: 28.74%
Epoch: 18 / 20, Cost : 2.355244, Validation Accuracy: 28.74%
Epoch: 19 / 20, Cost : 2.354967, Validation Accuracy: 28.74%
Epoch: 20 / 20, Cost : 2.354703, Validation Accuracy: 28.74%
The model has been trained successfully.

# Test the trained DNN network.
dnn.testDNN(test_input,test_output)

Tested with 32999 datasets.
Test Accuracy: 28.09 %

# Save the trained parameters.
vars = dnn.getVariables()

# Terminate the session.
dnn.closeDNN()

DNN training session is terminated.

RNN(LSTM) training.

Comments
1. 훈련에 사용된 데이터: 8,500 * 20 * 38 (# of examples, # of time steps ,# of input features)
2. 테스트에 사용된 데이터 : 1,650 * 20 * 38 (# of examples, # of time steps ,# of input features)
3. 본 실험에서는 사용하지 않았지만 리포트 상에서는 Accuracy의 변화를 보여주고자 훈련에 사용되는 데이터중 20%를 validation 셋(1,700개)으로 구성하였다. 이 validation은 epoch가 진행됨에 따라 변화되는 accuracy(인풋 케릭터에 대한 아웃풋 케릭터 결과)를 보여준다.
4. Parameters
  - Epoch: 200 (고정)
  - The number of hidden layer: 1 (고정)
  - The number of hidden units: 50, 100, 200
  - Learning Rate: 0.001
  - Cost Function: AdamOptimizer

import numpy as np
import main.setvalues as set
import main.rnnnetworkmodels as net

# import data.
# data directory.
lstm_data = np.load(my_absdir+'/train_data/pg8800_lstm_char_data.npz')
train_input = lstm_data['train_input']
train_output = lstm_data['train_output']
test_input = lstm_data['test_input']
test_output = lstm_data['test_output']

# parameters
problem = 'classification' # classification, regression
rnnCell = 'lstm' # rnn, lstm, gru
trainEpoch = 20
learningRate = 0.001
learningRateDecay = 'off' # on, off
batchSize = 100
dropout = 'off' # on, off
hiddenLayers = [200]
timeStep = 20
costFunction = 'adam' # gradient, adam
validationCheck = 'on' # if validationCheck is on, then 20% of train data will be taken for validation.

lstm_values = set.RNNParam(inputData=train_input,
                           targetData=train_output,
                           timeStep=timeStep,
                           hiddenUnits=hiddenLayers
                           )

# Setting hidden layers: weightMatrix and biasMatrix
lstm_weightMatrix = lstm_values.genWeight()
lstm_biasMatrix = lstm_values.genBias()
lstm_input_x,lstm_input_y = lstm_values.genSymbol()

lstm_net = net.RNNModel(inputSymbol=lstm_input_x,
                        outputSymbol=lstm_input_y,
                        rnnCell=rnnCell,
                        problem=problem,
                        hiddenLayer=hiddenLayers,
                        trainEpoch=trainEpoch,
                        learningRate=learningRate,
                        learningRateDecay=learningRateDecay,
                        timeStep=timeStep,
                        batchSize=batchSize,
                        dropout=dropout,
                        validationCheck=validationCheck,
                        weightMatrix=lstm_weightMatrix,
                        biasMatrix=lstm_biasMatrix)

# Generate a RNN(lstm) network.
lstm_net.genRNN()

########## RNN Setting #########
Task : classification
Cell Type : lstm
Hidden Layers : 1
Hidden Units : [200]
Train Epoch : 20
Learning Rate : 0.001
Time Steps : 20
Batch Size : 100
Drop Out : off
Validation : on
########## RNN Setting #########
RNN structure is generated.

# Train the RNN(lstm) network.
# In this tutorial, we will run only 20 epochs.
lstm_net.trainRNN(train_input,train_output)

Activating training process.
Epoch: 1 / 20, Cost : 2.838343, Validation Accuracy: 31.55%
Epoch: 2 / 20, Cost : 2.412195, Validation Accuracy: 34.15%
Epoch: 3 / 20, Cost : 2.281162, Validation Accuracy: 35.16%
Epoch: 4 / 20, Cost : 2.212879, Validation Accuracy: 36.22%
Epoch: 5 / 20, Cost : 2.165706, Validation Accuracy: 37.12%
Epoch: 6 / 20, Cost : 2.124667, Validation Accuracy: 38.01%
Epoch: 7 / 20, Cost : 2.086630, Validation Accuracy: 38.89%
Epoch: 8 / 20, Cost : 2.051740, Validation Accuracy: 39.71%
Epoch: 9 / 20, Cost : 2.020120, Validation Accuracy: 40.24%
Epoch: 10 / 20, Cost : 1.991863, Validation Accuracy: 40.69%
Epoch: 11 / 20, Cost : 1.966709, Validation Accuracy: 41.18%
Epoch: 12 / 20, Cost : 1.944162, Validation Accuracy: 41.61%
Epoch: 13 / 20, Cost : 1.923527, Validation Accuracy: 41.90%
Epoch: 14 / 20, Cost : 1.903926, Validation Accuracy: 42.26%
Epoch: 15 / 20, Cost : 1.885284, Validation Accuracy: 42.67%
Epoch: 16 / 20, Cost : 1.867924, Validation Accuracy: 43.06%
Epoch: 17 / 20, Cost : 1.851544, Validation Accuracy: 43.36%
Epoch: 18 / 20, Cost : 1.835906, Validation Accuracy: 43.65%
Epoch: 19 / 20, Cost : 1.820853, Validation Accuracy: 43.90%
Epoch: 20 / 20, Cost : 1.806262, Validation Accuracy: 44.12%
The model has been trained successfully.

# Test the trained RNN(lstm) network.
lstm_net.testRNN(test_input,test_output)

Activating Testing Process
Tested with 1650 datasets.
Test Accuracy: 46.23 %

# Save the trained parameters.
vars = lstm_net.getVariables()
# Terminate the session.
lstm_net.closeRNN()

RNN training session is terminated.

Result

위 코드상에서 실험은 히든레이어 유닛 개수가 200개인경우만 한정지어 진행하였으나, 실제로는 히든레이어 유닛 개수를 50, 100, 200으로 달리하여 진행하였으며 그에 따른 결과는 다음과 같다.

표에서 보다시피 ANN은 히든레이어의 유닛 개수와 상관없이 훈련을 제대로 진행하지 못하는 반면, RNN(LSTM)은 히든레이어 유닛이 증가함에 따라 정확도가 상승하는 것을 확인할 수 있었다. (실제 error의 감소도 ANN은 변화폭이 거의 없는 반면, RNN(LSTM)은 꾸준히 감소하였다. Accuracy 측면에서 ANN은 오히려 감소하고, RNN(LSTM)은 꾸준히 상승하였다.)
ANN의 훈련이 안되는 현상은 훈련 데이터(1번)가 잘못 준비되었을수도 있기에 다른 방식(2번)으로 훈련 데이터를 준비하고 재훈련 해보고자 한다.
- 훈련 데이터 준비 방식 1 (현재 방식): 4-gram으로 데이터가 준비되었을 경우(e.g., I want to go to school에서 ‘I’,’want’,’to’,’go’) 각 단어를 vocab size에 맞게 one hot coding을 한 뒤 (e.g., vocab size가 1000일 경우 ‘i’는 1*1000의 one hot coding이 된다.) 각 단어에 1이 적용된 값을 하나의 vocab사이즈에 몰아넣은 뒤 그 값을 훈련값으로 사용한다. (e.g., ‘I’,’want’,’to’가 각각 vocab size에서 1,2,3번째 값에 1이 매겨졌을 경우[1,0,0,…,0],[0,1,0,0,…,0],[0,0,1,0,…,0] 다음과 같이 [1,1,1,0,…,0]으로 만든다.) 한 vocab size에 3개의 단어가 동시에 담기므로 훈련 데이터의 크기가 n-gram 사이즈에 비례해서 증가하지 않는다.
- 훈련 데이터 준비 방식 2 (추가방식): 4-gram으로 데이터가 준비되었을 경우(e.g., I want to go to school에서 ‘I’,’want’,’to’,’go’) 각 단어를 vocab size에 맞게 one hot coding을 한 뒤 (e.g., vocab size가 1000일 경우 ‘i’는 11000의 one hot coding이 된다.) 각 단어에 1이 적용된 값을 모두 이어 붙인 뒤 훈련값으로 사용한다. (e.g., ‘I’,’want’,’to’가 각각 vocab size에서 1,2,3번째 값에 1이 매겨졌을 경우[1,0,0,…,0],[0,1,0,0,…,0],[0,0,1,0,…,0]을 이어붙여 13000의 훈련데이터가 만들어진다.) n-gram 사이즈에 비례해서 훈련데이터가 증가한다.
현재의 값이 이전 값과의 연관성을 갖는 text data의 경우 시간상의 정보를 훈련하는 RNN(LSTM)을 이용할 경우 더 좋은 훈련결과를 얻을 수 있음을 본 실험을 통해 알게되었다.

Model	Hidden Units	Accuracy
ANN	50	28.81%
ANN	100	28.86%
ANN	200	28.86%
RNN(LSTM)	50	49.76%
RNN(LSTM)	100	56.54%
RNN(LSTM)	200	72.59%

Further Experiment.

word 단위의 데이터를 준비하고 실험해보자.
데이터 준비 방법 2를 통해 데이터셋을 정제하고 ANN에서 재실험을 해보자.
전통적인 방식의 RNN-cell과 LSTM-cell 두 종류를 이용한 RNN의 성능 비교를 해보자.

Github Code

Go to the following github and find reports folder. You can run char_ANN_LSTM.py for duplicating the experiment.

https://github.com/hyung8758/HY_python_NN.git

Reference

Hyungwon Yang

이 블로그 검색

Hyungwon's Notebook