2017-06-13 3 views
0

Je traverse le code suivant du réseau de mémoire à l'aide keras sur jeu de données Babi -Keras Mémoire mise en œuvre du réseau sur Babi Dataset

  '''Trains a memory network on the bAbI dataset. 
      References: 
      - Jason Weston, Antoine Bordes, Sumit Chopra, Tomas Mikolov, Alexander M. Rush, 
       "Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks", 
       http://arxiv.org/abs/1502.05698 
      - Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus, 
       "End-To-End Memory Networks", 
       http://arxiv.org/abs/1503.08895 
      Reaches 98.6% accuracy on task 'single_supporting_fact_10k' after 120 epochs. 
      Time per epoch: 3s on CPU (core i7). 
      ''' 
      from __future__ import print_function 

      from keras.models import Sequential, Model 
      from keras.layers.embeddings import Embedding 
      from keras.layers import Input, Activation, Dense, Permute, Dropout, add, dot, concatenate 
      from keras.layers import LSTM 
      from keras.utils.data_utils import get_file 
      from keras.preprocessing.sequence import pad_sequences 
      from functools import reduce 
      import tarfile 
      import numpy as np 
      import re 


      def tokenize(sent): 
       '''Return the tokens of a sentence including punctuation. 
       >>> tokenize('Bob dropped the apple. Where is the apple?') 
       ['Bob', 'dropped', 'the', 'apple', '.', 'Where', 'is', 'the', 'apple', '?'] 
       ''' 
       return [x.strip() for x in re.split('(\W+)?', sent) if x.strip()] 


      def parse_stories(lines, only_supporting=False): 
       '''Parse stories provided in the bAbi tasks format 
       If only_supporting is true, only the sentences 
       that support the answer are kept. 
       ''' 
       data = [] 
       story = [] 
       for line in lines: 
        line = line.decode('utf-8').strip() 
        nid, line = line.split(' ', 1) 
        nid = int(nid) 
        if nid == 1: 
         story = [] 
        if '\t' in line: 
         q, a, supporting = line.split('\t') 
         q = tokenize(q) 
         substory = None 
         if only_supporting: 
          # Only select the related substory 
          supporting = map(int, supporting.split()) 
          substory = [story[i - 1] for i in supporting] 
         else: 
          # Provide all the substories 
          substory = [x for x in story if x] 
         data.append((substory, q, a)) 
         story.append('') 
        else: 
         sent = tokenize(line) 
         story.append(sent) 
       return data 


      def get_stories(f, only_supporting=False, max_length=None): 
       '''Given a file name, read the file, 
       retrieve the stories, 
       and then convert the sentences into a single story. 
       If max_length is supplied, 
       any stories longer than max_length tokens will be discarded. 
       ''' 
       data = parse_stories(f.readlines(), only_supporting=only_supporting) 
       flatten = lambda data: reduce(lambda x, y: x + y, data) 
       data = [(flatten(story), q, answer) for story, q, answer in data if not max_length or len(flatten(story)) < max_length] 
       return data 


      def vectorize_stories(data, word_idx, story_maxlen, query_maxlen): 
       X = [] 
       Xq = [] 
       Y = [] 
       for story, query, answer in data: 
        x = [word_idx[w] for w in story] 
        xq = [word_idx[w] for w in query] 
        # let's not forget that index 0 is reserved 
        y = np.zeros(len(word_idx) + 1) 
        y[word_idx[answer]] = 1 
        X.append(x) 
        Xq.append(xq) 
        Y.append(y) 
       return (pad_sequences(X, maxlen=story_maxlen), 
         pad_sequences(Xq, maxlen=query_maxlen), np.array(Y)) 

      try: 
       path = get_file('babi-tasks-v1-2.tar.gz', origin='https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz') 
      except: 
       print('Error downloading dataset, please download it manually:\n' 
         '$ wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz\n' 
         '$ mv tasks_1-20_v1-2.tar.gz ~/.keras/datasets/babi-tasks-v1-2.tar.gz') 
       raise 
      tar = tarfile.open(path) 

      challenges = { 
       # QA1 with 10,000 samples 
       'single_supporting_fact_10k': 'tasks_1-20_v1-2/en-10k/qa1_single-supporting-fact_{}.txt', 
       # QA2 with 10,000 samples 
       'two_supporting_facts_10k': 'tasks_1-20_v1-2/en-10k/qa2_two-supporting-facts_{}.txt', 
      } 
      challenge_type = 'single_supporting_fact_10k' 
      challenge = challenges[challenge_type] 

      print('Extracting stories for the challenge:', challenge_type) 
      train_stories = get_stories(tar.extractfile(challenge.format('train'))) 
      test_stories = get_stories(tar.extractfile(challenge.format('test'))) 

      vocab = set() 
      for story, q, answer in train_stories + test_stories: 
       vocab |= set(story + q + [answer]) 
      vocab = sorted(vocab) 

      # Reserve 0 for masking via pad_sequences 
      vocab_size = len(vocab) + 1 
      story_maxlen = max(map(len, (x for x, _, _ in train_stories + test_stories))) 
      query_maxlen = max(map(len, (x for _, x, _ in train_stories + test_stories))) 

      print('-') 
      print('Vocab size:', vocab_size, 'unique words') 
      print('Story max length:', story_maxlen, 'words') 
      print('Query max length:', query_maxlen, 'words') 
      print('Number of training stories:', len(train_stories)) 
      print('Number of test stories:', len(test_stories)) 
      print('-') 
      print('Here\'s what a "story" tuple looks like (input, query, answer):') 
      print(train_stories[0]) 
      print('-') 
      print('Vectorizing the word sequences...') 

      word_idx = dict((c, i + 1) for i, c in enumerate(vocab)) 
      inputs_train, queries_train, answers_train = vectorize_stories(train_stories, 
                      word_idx, 
                      story_maxlen, 
                      query_maxlen) 
      inputs_test, queries_test, answers_test = vectorize_stories(test_stories, 
                     word_idx, 
                     story_maxlen, 
                     query_maxlen) 

      print('-') 
      print('inputs: integer tensor of shape (samples, max_length)') 
      print('inputs_train shape:', inputs_train.shape) 
      print('inputs_test shape:', inputs_test.shape) 
      print('-') 
      print('queries: integer tensor of shape (samples, max_length)') 
      print('queries_train shape:', queries_train.shape) 
      print('queries_test shape:', queries_test.shape) 
      print('-') 
      print('answers: binary (1 or 0) tensor of shape (samples, vocab_size)') 
      print('answers_train shape:', answers_train.shape) 
      print('answers_test shape:', answers_test.shape) 
      print('-') 
      print('Compiling...') 

      # placeholders 
      input_sequence = Input((story_maxlen,)) 
      question = Input((query_maxlen,)) 

      # encoders 
      # embed the input sequence into a sequence of vectors 
      input_encoder_m = Sequential() 
      input_encoder_m.add(Embedding(input_dim=vocab_size, 
              output_dim=64)) 
      input_encoder_m.add(Dropout(0.3)) 
      # output: (samples, story_maxlen, embedding_dim) 

      # embed the input into a sequence of vectors of size query_maxlen 
      input_encoder_c = Sequential() 
      input_encoder_c.add(Embedding(input_dim=vocab_size, 
              output_dim=query_maxlen)) 
      input_encoder_c.add(Dropout(0.3)) 
      # output: (samples, story_maxlen, query_maxlen) 

      # embed the question into a sequence of vectors 
      question_encoder = Sequential() 
      question_encoder.add(Embedding(input_dim=vocab_size, 
              output_dim=64, 
              input_length=query_maxlen)) 
      question_encoder.add(Dropout(0.3)) 
      # output: (samples, query_maxlen, embedding_dim) 

      # encode input sequence and questions (which are indices) 
      # to sequences of dense vectors 
      input_encoded_m = input_encoder_m(input_sequence) 
      input_encoded_c = input_encoder_c(input_sequence) 
      question_encoded = question_encoder(question) 

      # compute a 'match' between the first input vector sequence 
      # and the question vector sequence 
      # shape: `(samples, story_maxlen, query_maxlen)` 
      match = dot([input_encoded_m, question_encoded], axes=(2, 2)) 
      match = Activation('softmax')(match) 

      # add the match matrix with the second input vector sequence 
      response = add([match, input_encoded_c]) # (samples, story_maxlen, query_maxlen) 
      response = Permute((2, 1))(response) # (samples, query_maxlen, story_maxlen) 

      # concatenate the match matrix with the question vector sequence 
      answer = concatenate([response, question_encoded]) 

      # the original paper uses a matrix multiplication for this reduction step. 
      # we choose to use a RNN instead. 
      answer = LSTM(32)(answer) # (samples, 32) 

      # one regularization layer -- more would probably be needed. 
      answer = Dropout(0.3)(answer) 
      answer = Dense(vocab_size)(answer) # (samples, vocab_size) 
      # we output a probability distribution over the vocabulary 
      answer = Activation('softmax')(answer) 

      # build the final model 
      model = Model([input_sequence, question], answer) 
      model.compile(optimizer='rmsprop', loss='categorical_crossentropy', 
          metrics=['accuracy']) 

      # train 
      model.fit([inputs_train, queries_train], answers_train, 
         batch_size=32, 
         epochs=120, 
         validation_data=([inputs_test, queries_test], answers_test)) 

C'est ce que je crois comprendre pour la partie de création de modèle -

Après créant des vecteurs denses de l'histoire et de remettre en question une partie avec le code ci-dessous -

  input_encoded_m = input_encoder_m(input_sequence) 
      input_encoded_c = input_encoder_c(input_sequence) 
      question_encoded = question_encoder(question) 

sorties seront ont des formes ci-dessous

input_encoded_m aura la forme - échantillons, story_maxlen, query_maxlen input_encoded_c aura la forme - échantillons, story_maxlen, query_maxlen question_encoded aura la forme - échantillons, query_maxlen, embedding_dim

input_encoded_m et input_encoded_c ont même entrée incorporée dans différentes dimensions à savoir (68 et 4). et question_encoded aura une question intégrée.

maintenant ci-dessous partie correspond au mot dans l'histoire et question et applique l'activation de softmax sur la sortie qui signifie les mots correspondants sont identifiés -

  match = dot([input_encoded_m, question_encoded], axes=(2, 2)) 
      match = Activation('softmax')(match) 

Je ne suis pas clair sur les raisons même vecteur d'entrée intégré différemment est ajouté à la matrice appariée de l'étape ci-dessus. Commentaire dit "Deuxième vecteur d'entrée" mais nous ne traitons pas encore 2ème entrée..Pas capable de comprendre cela, Toute aide ??? # ajouter la matrice de correspondance avec la deuxième réponse de séquence de vecteur d'entrée = add ([jeu, input_encoded_c]) # (échantillons, story_maxlen, query_maxlen)

Aussi ce permuter la sortie de l'étape ci-dessus ne dans ce contexte - réponse = Permute ((2, 1)) (réponse) # (samples, query_maxlen, story_maxlen)

Ceci est juste concaténant l'histoire de la partie supérieure avec une question pour la couche LSTM? S'il vous plaît corriger si je comprends mal ici -

  # concatenate the match matrix with the question vector sequence 
      answer = concatenate([response, question_encoded]) 

Je coudn't trouver une explication intuitive de ce partout afin de poster ici.

Toute aide est fortement appréciée!

Merci.

Répondre

0

En premier lieu, la variable match n'identifie pas seulement les mots correspondants, elle donne une distribution de probabilité sur l'entrée. Ceux-ci peuvent être considérés comme des poids pour chaque phrase d'entrée.

La séquence d'entrée est incorporée en utilisant deux matrices différentes dont les résultats sont input_encoded_c et input_encoded_m dans le code. En utilisant la première incorporation, nous trouvons les poids de correspondance. Puis, en appliquant des poids aux 2èmes vecteurs incorporés, nous trouvons la réponse. Il ne sera pas logique d'appliquer les poids aux mêmes vecteurs où nous les avons calculés. Puis vient Permute.Pour générer la réponse, nous ajoutons la requête au response, pour avoir les mêmes dimensions nous permuterons les dimensions de la réponse.

Dans le document End-to-End Memory Network, si vous lisez la section 2.1, il vous aidera à comprendre.