I have been working on a sentiment analysis project project recently, and have been looking for ways to improve the performance of my model.
Today I set myself the challenge of learning how to make use of pre-trained GloVe word embeddings.
On the official website for GloVe word embeddings, there are several zip files containing embeddings that are trained on different data. One is trained on Wikipedia, another on Twitter, and two through webcrawling the internet.
Within the zip files, there are several text files that contain the
actual word vectors. There is a different file for different word
embedding sizes trained on the same data. For example, here is a list
of the files in the glove.6B.zip
zip file trained on Wikipedia.
glove.6B.50d.txt glove.6B.100d.txt glove.6B.200d.txt glove.6B.300d.txt
6B
specifieds the number of tokens the data was trained on.50d
, 100d
... etc specified the size of the word vectors. The format inside each file is as follows, in this simplified example:
when 0.27062 -0.36596 0.097193 -0.50708 0.37375 year -0.098793 0.26983 0.35304 -0.10727 -0.015183 there 0.68491 0.32385 -0.11592 -0.35925 0.49889 ...
Each line starts with the token/word followed all the values of the
corresponding vector for that token, each separated by a space. The
number of actual values would match the word vector size. So for
example, if you opened the glove.6B.300d.txt
file, it would contain
300 numbers on each line.
The number of lines corresponds to the size of the vocabulary.
In my case, I already had an established vocabulary for the data that I was using for my project. It is a vocabulary that will not match up with what GloVe embeddings, so the challenge for today was to load up the GloVe vectors in a format that would be useful given my established vocabulary.
We can start with a list, which maps the ids of your own established vocabulary to the actual tokens, and a dictionary that maps token strings to ids (such as in this simple example):
id2word = ["PAD", "UNKNOWN", "the", "there", "year", "when"] word2id = {word: id for id, word in enumerate(id2word)}
We will need to initialize the array that will store the embeddings. Since we cannot assume that all of the tokens in our pre-defined vocabulary will exist in the pretrained GloVe vectors, we will need to initialize the values to some random values.
import numpy as np # INITIALIZE EMBEDDINGS TO RANDOM VALUES embed_size = 100 vocab_size = len(id2word) sd = 1/np.sqrt(embed_size) # Standard deviation to use weights = np.random.normal(0, scale=sd, size=[vocab_size, embed_size]) weights = weights.astype(np.float32)
I initialized the weights using a variant of Xavier intialization. You could play around with different initialization strategies to see if you get better results.
Note: the embed_size
variable should match the size of the word
vector size for the file you will use.
To actually override the given word vectors from the GloVe text files, we can run the following.
from io import open file = "/path/to/glove.6B.100d.txt" # EXTRACT DESIRED GLOVE WORD VECTORS FROM TEXT FILE with open(file, encoding="utf-8", mode="r") as textFile: for line in textFile: # Separate the values from the word line = line.split() word = line[0] # If word is in our vocab, then update the corresponding weights id = word2id.get(word, None) if id is not None: weights[id] = np.array(line[1:], dtype=np.float32)
In my case, I am using PyTorch, and I have a model that is created as a
subclass of torch.nn.Module
as follows:
class Model(nn.Module): def __init__(self, n_vocab, embed_size, ...): super(Model, self).__init__() ... self.embeddings = nn.Embedding(n_vocab, embed_size) ... ... model = Model(n_vocab, embed_size, ...)
In order to update the embeddings in my model, I run the following:
# UPDATING PYTORCH EMBEDDINGS model.embeddings.weight.data = torch.Tensor(weights)
Note you can comment without any login by: