Word2Vec in Gensim¶
The idea behind Word2Vec is pretty simple. We are making and assumption that you can tell the meaning of a word by the company it keeps. This is analogous to the saying show me your friends, and I'll tell who you are. So if you have two words that have very similar neighbors (i.e. the usage context is about the same), then these words are probably quite similar in meaning or are at least highly related. For example, the words shocked
,appalled
and astonished
are typically used in a similar context.
In this tutorial, you will learn how to use the Gensim implementation of Word2Vec and actually get it to work! I have heard a lot of complaints about poor performance etc, but its really a combination of two things, (1) your input data and (2) your parameter settings. Note that the training algorithms in this package were ported from the original Word2Vec implementation by Google and extended with additional functionality.
Imports and logging¶
First, we start with our imports and get logging established:
# imports needed and set up logging
import gzip
import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
Dataset¶
Next, is our dataset. The secret to getting Word2Vec really working for you is to have lots and lots of text data. In this case I am going to use data from the OpinRank dataset. This dataset has full user reviews of cars and hotels. I have specifically concatenated all of the hotel reviews into one big file which is about 97MB compressed and 229MB uncompressed. We will use the compressed file for this tutorial. Each line in this file represents a hotel review. You can download the OpinRank Word2Vec dataset here.
To avoid confusion, while gensim’s word2vec tutorial says that you need to pass it a sequence of sentences as its input, you can always pass it a whole review as a sentence (i.e. a much larger size of text), and it should not make much of a difference.
Now, let's take a closer look at this data below by printing the first line. You can see that this is a pretty hefty review.
data_file="reviews_data.txt.gz"
with gzip.open ('reviews_data.txt.gz', 'rb') as f:
for i,line in enumerate (f):
print(line)
break
b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Beijing, then you will be ok.I chose to have some breakfast in the hotel, which was really tasty and there was a good selection of dishes. There are a couple of computers to use in the communal area, as well as a pool table. There is also a small swimming pool and a gym area.I would definitely stay in this hotel again, but only if I did not plan to travel to central Beijing, as it can take a long time. The location is ok if you plan to do a lot of shopping, as there is a big shopping centre just few minutes away from the hotel and there are plenty of eating options around, including restaurants that serve a dog meat!\t\r\n"
Read files into a list¶
Now that we've had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the Word2Vec model. Notice in the code below, that I am directly reading the
compressed file. I'm also doing a mild pre-processing of the reviews using gensim.utils.simple_preprocess (line)
. This does some basic pre-processing such as tokenization, lowercasing, etc and returns back a list of tokens (words). Documentation of this pre-processing method can be found on the official Gensim documentation site.
def read_input(input_file):
"""This method reads the input file which is in gzip format"""
logging.info("reading file {0}...this may take a while".format(input_file))
with gzip.open (input_file, 'rb') as f:
for i, line in enumerate (f):
if (i%10000==0):
logging.info ("read {0} reviews".format (i))
# do some pre-processing and return a list of words for each review text
yield gensim.utils.simple_preprocess (line)
# read the tokenized reviews into a list
# each review item becomes a serries of words
# so this becomes a list of lists
documents = list (read_input (data_file))
logging.info ("Done reading data file")
2024-10-21 13:36:31,245 : INFO : reading file reviews_data.txt.gz...this may take a while 2024-10-21 13:36:31,247 : INFO : read 0 reviews 2024-10-21 13:36:33,317 : INFO : read 10000 reviews 2024-10-21 13:36:35,116 : INFO : read 20000 reviews 2024-10-21 13:36:37,160 : INFO : read 30000 reviews 2024-10-21 13:36:39,089 : INFO : read 40000 reviews 2024-10-21 13:36:41,227 : INFO : read 50000 reviews 2024-10-21 13:36:43,293 : INFO : read 60000 reviews 2024-10-21 13:36:45,311 : INFO : read 70000 reviews 2024-10-21 13:36:46,908 : INFO : read 80000 reviews 2024-10-21 13:36:48,579 : INFO : read 90000 reviews 2024-10-21 13:36:50,217 : INFO : read 100000 reviews 2024-10-21 13:36:52,039 : INFO : read 110000 reviews 2024-10-21 13:36:53,690 : INFO : read 120000 reviews 2024-10-21 13:36:55,468 : INFO : read 130000 reviews 2024-10-21 13:36:57,400 : INFO : read 140000 reviews 2024-10-21 13:36:59,165 : INFO : read 150000 reviews 2024-10-21 13:37:01,509 : INFO : read 160000 reviews 2024-10-21 13:37:03,356 : INFO : read 170000 reviews 2024-10-21 13:37:05,260 : INFO : read 180000 reviews 2024-10-21 13:37:07,151 : INFO : read 190000 reviews 2024-10-21 13:37:09,260 : INFO : read 200000 reviews 2024-10-21 13:37:11,216 : INFO : read 210000 reviews 2024-10-21 13:37:13,130 : INFO : read 220000 reviews 2024-10-21 13:37:14,943 : INFO : read 230000 reviews 2024-10-21 13:37:16,711 : INFO : read 240000 reviews 2024-10-21 13:37:19,267 : INFO : read 250000 reviews 2024-10-21 13:37:20,230 : INFO : Done reading data file
Training the Word2Vec model¶
Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step (the documents
). So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.
After building the vocabulary, we just need to call train(...)
to start training the Word2Vec model. Training on the OpinRank dataset takes about 10 minutes so please be patient while running your code on this dataset.
Behind the scenes we are actually training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn.
model = gensim.models.Word2Vec(documents, vector_size=150, window=10, min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=2)
2024-10-21 13:37:20,264 : INFO : collecting all words and their counts 2024-10-21 13:37:20,267 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types 2024-10-21 13:37:20,575 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types 2024-10-21 13:37:20,853 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types 2024-10-21 13:37:21,170 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types 2024-10-21 13:37:21,495 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types 2024-10-21 13:37:21,848 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types 2024-10-21 13:37:22,140 : INFO : PROGRESS: at sentence #60000, processed 11013728 words, keeping 76788 word types 2024-10-21 13:37:22,397 : INFO : PROGRESS: at sentence #70000, processed 12637530 words, keeping 83201 word types 2024-10-21 13:37:22,645 : INFO : PROGRESS: at sentence #80000, processed 14099756 words, keeping 88461 word types 2024-10-21 13:37:22,913 : INFO : PROGRESS: at sentence #90000, processed 15662154 words, keeping 93359 word types 2024-10-21 13:37:23,142 : INFO : PROGRESS: at sentence #100000, processed 17164492 words, keeping 97888 word types 2024-10-21 13:37:23,377 : INFO : PROGRESS: at sentence #110000, processed 18652297 words, keeping 102134 word types 2024-10-21 13:37:23,609 : INFO : PROGRESS: at sentence #120000, processed 20152534 words, keeping 105925 word types 2024-10-21 13:37:23,859 : INFO : PROGRESS: at sentence #130000, processed 21684335 words, keeping 110106 word types 2024-10-21 13:37:24,136 : INFO : PROGRESS: at sentence #140000, processed 23330211 words, keeping 114110 word types 2024-10-21 13:37:24,377 : INFO : PROGRESS: at sentence #150000, processed 24838759 words, keeping 118176 word types 2024-10-21 13:37:24,644 : INFO : PROGRESS: at sentence #160000, processed 26390915 words, keeping 118672 word types 2024-10-21 13:37:24,908 : INFO : PROGRESS: at sentence #170000, processed 27913921 words, keeping 123358 word types 2024-10-21 13:37:25,193 : INFO : PROGRESS: at sentence #180000, processed 29535617 words, keeping 126750 word types 2024-10-21 13:37:25,468 : INFO : PROGRESS: at sentence #190000, processed 31096464 words, keeping 129849 word types 2024-10-21 13:37:25,748 : INFO : PROGRESS: at sentence #200000, processed 32805276 words, keeping 133257 word types 2024-10-21 13:37:26,021 : INFO : PROGRESS: at sentence #210000, processed 34434203 words, keeping 136366 word types 2024-10-21 13:37:26,306 : INFO : PROGRESS: at sentence #220000, processed 36083487 words, keeping 139420 word types 2024-10-21 13:37:26,552 : INFO : PROGRESS: at sentence #230000, processed 37571767 words, keeping 142401 word types 2024-10-21 13:37:26,809 : INFO : PROGRESS: at sentence #240000, processed 39138195 words, keeping 145234 word types 2024-10-21 13:37:27,064 : INFO : PROGRESS: at sentence #250000, processed 40695054 words, keeping 147968 word types 2024-10-21 13:37:27,208 : INFO : collected 150061 word types from a corpus of 41519360 raw words and 255404 sentences 2024-10-21 13:37:27,209 : INFO : Creating a fresh vocabulary 2024-10-21 13:37:27,452 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=2 retains 70537 unique words (47.01% of original 150061, drops 79524)', 'datetime': '2024-10-21T13:37:27.451995', 'gensim': '4.3.0', 'python': '3.11.7 (main, Dec 15 2023, 12:09:04) [Clang 14.0.6 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'prepare_vocab'} 2024-10-21 13:37:27,452 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=2 leaves 41439836 word corpus (99.81% of original 41519360, drops 79524)', 'datetime': '2024-10-21T13:37:27.452764', 'gensim': '4.3.0', 'python': '3.11.7 (main, Dec 15 2023, 12:09:04) [Clang 14.0.6 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'prepare_vocab'} 2024-10-21 13:37:27,766 : INFO : deleting the raw counts dictionary of 150061 items 2024-10-21 13:37:27,770 : INFO : sample=0.001 downsamples 55 most-common words 2024-10-21 13:37:27,771 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 30349251.36700416 word corpus (73.2%% of prior 41439836)', 'datetime': '2024-10-21T13:37:27.771279', 'gensim': '4.3.0', 'python': '3.11.7 (main, Dec 15 2023, 12:09:04) [Clang 14.0.6 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'prepare_vocab'} 2024-10-21 13:37:28,281 : INFO : estimated required memory for 70537 words and 150 dimensions: 119912900 bytes 2024-10-21 13:37:28,282 : INFO : resetting layer weights 2024-10-21 13:37:28,338 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2024-10-21T13:37:28.338877', 'gensim': '4.3.0', 'python': '3.11.7 (main, Dec 15 2023, 12:09:04) [Clang 14.0.6 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'build_vocab'} 2024-10-21 13:37:28,339 : INFO : Word2Vec lifecycle event {'msg': 'training model with 10 workers on 70537 vocabulary and 150 features, using sg=0 hs=0 sample=0.001 negative=5 window=10 shrink_windows=True', 'datetime': '2024-10-21T13:37:28.339566', 'gensim': '4.3.0', 'python': '3.11.7 (main, Dec 15 2023, 12:09:04) [Clang 14.0.6 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'train'} 2024-10-21 13:37:29,352 : INFO : EPOCH 0 - PROGRESS: at 5.43% examples, 1659243 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:37:30,355 : INFO : EPOCH 0 - PROGRESS: at 10.10% examples, 1599961 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:37:31,359 : INFO : EPOCH 0 - PROGRESS: at 13.97% examples, 1529843 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:37:32,368 : INFO : EPOCH 0 - PROGRESS: at 17.68% examples, 1474579 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:37:33,370 : INFO : EPOCH 0 - PROGRESS: at 20.84% examples, 1421623 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:37:34,383 : INFO : EPOCH 0 - PROGRESS: at 24.13% examples, 1374400 words/s, in_qsize 17, out_qsize 2 2024-10-21 13:37:35,387 : INFO : EPOCH 0 - PROGRESS: at 28.11% examples, 1336229 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:37:36,390 : INFO : EPOCH 0 - PROGRESS: at 31.95% examples, 1302417 words/s, in_qsize 19, out_qsize 1 2024-10-21 13:37:37,395 : INFO : EPOCH 0 - PROGRESS: at 35.02% examples, 1259135 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:37:38,407 : INFO : EPOCH 0 - PROGRESS: at 37.93% examples, 1214082 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:37:39,418 : INFO : EPOCH 0 - PROGRESS: at 40.87% examples, 1177423 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:37:40,420 : INFO : EPOCH 0 - PROGRESS: at 43.91% examples, 1147791 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:37:41,445 : INFO : EPOCH 0 - PROGRESS: at 46.74% examples, 1118432 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:37:42,457 : INFO : EPOCH 0 - PROGRESS: at 49.36% examples, 1091400 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:37:43,461 : INFO : EPOCH 0 - PROGRESS: at 52.28% examples, 1076449 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:37:44,461 : INFO : EPOCH 0 - PROGRESS: at 54.93% examples, 1059866 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:37:45,464 : INFO : EPOCH 0 - PROGRESS: at 57.81% examples, 1045074 words/s, in_qsize 19, out_qsize 1 2024-10-21 13:37:46,497 : INFO : EPOCH 0 - PROGRESS: at 61.01% examples, 1036698 words/s, in_qsize 20, out_qsize 0 2024-10-21 13:37:47,501 : INFO : EPOCH 0 - PROGRESS: at 64.36% examples, 1030281 words/s, in_qsize 20, out_qsize 0 2024-10-21 13:37:48,509 : INFO : EPOCH 0 - PROGRESS: at 67.54% examples, 1026369 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:37:49,516 : INFO : EPOCH 0 - PROGRESS: at 70.60% examples, 1021772 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:37:50,538 : INFO : EPOCH 0 - PROGRESS: at 73.54% examples, 1012761 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:37:51,546 : INFO : EPOCH 0 - PROGRESS: at 77.25% examples, 1019036 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:37:52,548 : INFO : EPOCH 0 - PROGRESS: at 80.56% examples, 1019379 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:37:53,556 : INFO : EPOCH 0 - PROGRESS: at 83.48% examples, 1013154 words/s, in_qsize 20, out_qsize 1 2024-10-21 13:37:54,563 : INFO : EPOCH 0 - PROGRESS: at 86.39% examples, 1008496 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:37:55,584 : INFO : EPOCH 0 - PROGRESS: at 89.97% examples, 1007275 words/s, in_qsize 14, out_qsize 5 2024-10-21 13:37:56,586 : INFO : EPOCH 0 - PROGRESS: at 93.49% examples, 1008081 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:37:57,593 : INFO : EPOCH 0 - PROGRESS: at 97.29% examples, 1010846 words/s, in_qsize 20, out_qsize 0 2024-10-21 13:37:58,184 : INFO : EPOCH 0: training on 41519360 raw words (30350992 effective words) took 29.8s, 1017076 effective words/s 2024-10-21 13:37:59,195 : INFO : EPOCH 1 - PROGRESS: at 3.73% examples, 1148937 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:38:00,203 : INFO : EPOCH 1 - PROGRESS: at 7.43% examples, 1139115 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:01,204 : INFO : EPOCH 1 - PROGRESS: at 10.32% examples, 1096895 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:02,212 : INFO : EPOCH 1 - PROGRESS: at 13.49% examples, 1102877 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:03,230 : INFO : EPOCH 1 - PROGRESS: at 17.11% examples, 1134680 words/s, in_qsize 16, out_qsize 3 2024-10-21 13:38:04,243 : INFO : EPOCH 1 - PROGRESS: at 20.30% examples, 1145045 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:05,247 : INFO : EPOCH 1 - PROGRESS: at 23.60% examples, 1146897 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:06,249 : INFO : EPOCH 1 - PROGRESS: at 27.31% examples, 1142800 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:07,268 : INFO : EPOCH 1 - PROGRESS: at 31.54% examples, 1141686 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:08,269 : INFO : EPOCH 1 - PROGRESS: at 35.72% examples, 1149495 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:09,291 : INFO : EPOCH 1 - PROGRESS: at 39.36% examples, 1135831 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:10,293 : INFO : EPOCH 1 - PROGRESS: at 43.12% examples, 1128063 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:11,294 : INFO : EPOCH 1 - PROGRESS: at 47.20% examples, 1129346 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:12,307 : INFO : EPOCH 1 - PROGRESS: at 51.04% examples, 1125838 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:13,323 : INFO : EPOCH 1 - PROGRESS: at 54.98% examples, 1129721 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:14,332 : INFO : EPOCH 1 - PROGRESS: at 59.33% examples, 1135696 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:15,333 : INFO : EPOCH 1 - PROGRESS: at 63.30% examples, 1135724 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:16,340 : INFO : EPOCH 1 - PROGRESS: at 66.61% examples, 1125270 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:17,350 : INFO : EPOCH 1 - PROGRESS: at 69.73% examples, 1114733 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:38:18,353 : INFO : EPOCH 1 - PROGRESS: at 72.81% examples, 1105041 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:19,357 : INFO : EPOCH 1 - PROGRESS: at 76.09% examples, 1099610 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:20,362 : INFO : EPOCH 1 - PROGRESS: at 79.33% examples, 1095203 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:21,372 : INFO : EPOCH 1 - PROGRESS: at 82.31% examples, 1086916 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:22,383 : INFO : EPOCH 1 - PROGRESS: at 85.58% examples, 1083685 words/s, in_qsize 19, out_qsize 2 2024-10-21 13:38:23,397 : INFO : EPOCH 1 - PROGRESS: at 89.53% examples, 1083626 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:24,398 : INFO : EPOCH 1 - PROGRESS: at 93.22% examples, 1083596 words/s, in_qsize 20, out_qsize 0 2024-10-21 13:38:25,398 : INFO : EPOCH 1 - PROGRESS: at 96.72% examples, 1080905 words/s, in_qsize 19, out_qsize 1 2024-10-21 13:38:26,321 : INFO : EPOCH 1: training on 41519360 raw words (30349555 effective words) took 28.1s, 1078859 effective words/s 2024-10-21 13:38:27,345 : INFO : EPOCH 2 - PROGRESS: at 3.51% examples, 1070599 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:28,361 : INFO : EPOCH 2 - PROGRESS: at 7.20% examples, 1092063 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:29,370 : INFO : EPOCH 2 - PROGRESS: at 10.06% examples, 1053009 words/s, in_qsize 20, out_qsize 1 2024-10-21 13:38:30,370 : INFO : EPOCH 2 - PROGRESS: at 12.89% examples, 1051912 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:31,377 : INFO : EPOCH 2 - PROGRESS: at 15.75% examples, 1030789 words/s, in_qsize 17, out_qsize 2 2024-10-21 13:38:32,384 : INFO : EPOCH 2 - PROGRESS: at 18.68% examples, 1038818 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:33,389 : INFO : EPOCH 2 - PROGRESS: at 21.92% examples, 1054094 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:34,410 : INFO : EPOCH 2 - PROGRESS: at 24.52% examples, 1045440 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:38:35,411 : INFO : EPOCH 2 - PROGRESS: at 27.92% examples, 1030226 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:38:36,421 : INFO : EPOCH 2 - PROGRESS: at 31.06% examples, 1015217 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:37,436 : INFO : EPOCH 2 - PROGRESS: at 34.23% examples, 1004177 words/s, in_qsize 20, out_qsize 1 2024-10-21 13:38:38,467 : INFO : EPOCH 2 - PROGRESS: at 38.03% examples, 1008556 words/s, in_qsize 20, out_qsize 2 2024-10-21 13:38:39,470 : INFO : EPOCH 2 - PROGRESS: at 41.95% examples, 1012225 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:40,478 : INFO : EPOCH 2 - PROGRESS: at 45.77% examples, 1015029 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:41,480 : INFO : EPOCH 2 - PROGRESS: at 49.31% examples, 1015438 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:42,483 : INFO : EPOCH 2 - PROGRESS: at 53.03% examples, 1022347 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:43,518 : INFO : EPOCH 2 - PROGRESS: at 55.87% examples, 1007462 words/s, in_qsize 17, out_qsize 2 2024-10-21 13:38:44,531 : INFO : EPOCH 2 - PROGRESS: at 58.19% examples, 989026 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:38:45,536 : INFO : EPOCH 2 - PROGRESS: at 61.79% examples, 991664 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:46,563 : INFO : EPOCH 2 - PROGRESS: at 65.79% examples, 997251 words/s, in_qsize 20, out_qsize 0 2024-10-21 13:38:47,565 : INFO : EPOCH 2 - PROGRESS: at 69.41% examples, 1000879 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:48,572 : INFO : EPOCH 2 - PROGRESS: at 72.94% examples, 1003251 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:49,573 : INFO : EPOCH 2 - PROGRESS: at 76.11% examples, 1001519 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:50,574 : INFO : EPOCH 2 - PROGRESS: at 79.61% examples, 1005324 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:51,593 : INFO : EPOCH 2 - PROGRESS: at 82.71% examples, 1001837 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:38:52,604 : INFO : EPOCH 2 - PROGRESS: at 86.11% examples, 1003452 words/s, in_qsize 17, out_qsize 2 2024-10-21 13:38:53,607 : INFO : EPOCH 2 - PROGRESS: at 90.14% examples, 1007618 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:54,611 : INFO : EPOCH 2 - PROGRESS: at 94.16% examples, 1013164 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:38:55,632 : INFO : EPOCH 2 - PROGRESS: at 98.60% examples, 1022178 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:38:56,011 : INFO : EPOCH 2: training on 41519360 raw words (30349166 effective words) took 29.7s, 1022392 effective words/s 2024-10-21 13:38:57,025 : INFO : EPOCH 3 - PROGRESS: at 3.49% examples, 1074069 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:58,034 : INFO : EPOCH 3 - PROGRESS: at 7.66% examples, 1176089 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:38:59,043 : INFO : EPOCH 3 - PROGRESS: at 10.98% examples, 1171127 words/s, in_qsize 20, out_qsize 1 2024-10-21 13:39:00,048 : INFO : EPOCH 3 - PROGRESS: at 14.22% examples, 1166291 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:39:01,049 : INFO : EPOCH 3 - PROGRESS: at 17.27% examples, 1147946 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:02,067 : INFO : EPOCH 3 - PROGRESS: at 20.50% examples, 1157391 words/s, in_qsize 20, out_qsize 0 2024-10-21 13:39:03,087 : INFO : EPOCH 3 - PROGRESS: at 23.98% examples, 1167114 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:04,104 : INFO : EPOCH 3 - PROGRESS: at 28.08% examples, 1163052 words/s, in_qsize 20, out_qsize 0 2024-10-21 13:39:05,114 : INFO : EPOCH 3 - PROGRESS: at 31.76% examples, 1145845 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:06,115 : INFO : EPOCH 3 - PROGRESS: at 35.09% examples, 1130192 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:07,119 : INFO : EPOCH 3 - PROGRESS: at 38.50% examples, 1114251 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:08,121 : INFO : EPOCH 3 - PROGRESS: at 42.49% examples, 1113710 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:39:09,123 : INFO : EPOCH 3 - PROGRESS: at 46.99% examples, 1124263 words/s, in_qsize 20, out_qsize 0 2024-10-21 13:39:10,128 : INFO : EPOCH 3 - PROGRESS: at 51.34% examples, 1133014 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:11,129 : INFO : EPOCH 3 - PROGRESS: at 55.35% examples, 1136597 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:39:12,130 : INFO : EPOCH 3 - PROGRESS: at 59.59% examples, 1142253 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:13,134 : INFO : EPOCH 3 - PROGRESS: at 63.46% examples, 1140005 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:14,143 : INFO : EPOCH 3 - PROGRESS: at 67.20% examples, 1136319 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:39:15,143 : INFO : EPOCH 3 - PROGRESS: at 70.38% examples, 1127678 words/s, in_qsize 20, out_qsize 1 2024-10-21 13:39:16,148 : INFO : EPOCH 3 - PROGRESS: at 74.42% examples, 1129822 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:17,154 : INFO : EPOCH 3 - PROGRESS: at 77.78% examples, 1126748 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:39:18,159 : INFO : EPOCH 3 - PROGRESS: at 81.39% examples, 1125996 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:19,161 : INFO : EPOCH 3 - PROGRESS: at 85.39% examples, 1130344 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:20,172 : INFO : EPOCH 3 - PROGRESS: at 88.86% examples, 1122826 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:21,177 : INFO : EPOCH 3 - PROGRESS: at 92.21% examples, 1115874 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:22,209 : INFO : EPOCH 3 - PROGRESS: at 95.63% examples, 1110493 words/s, in_qsize 19, out_qsize 1 2024-10-21 13:39:23,216 : INFO : EPOCH 3 - PROGRESS: at 99.38% examples, 1109395 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:23,388 : INFO : EPOCH 3: training on 41519360 raw words (30350567 effective words) took 27.4s, 1108861 effective words/s 2024-10-21 13:39:24,403 : INFO : EPOCH 4 - PROGRESS: at 3.67% examples, 1130817 words/s, in_qsize 20, out_qsize 0 2024-10-21 13:39:25,405 : INFO : EPOCH 4 - PROGRESS: at 7.23% examples, 1111982 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:39:26,410 : INFO : EPOCH 4 - PROGRESS: at 10.41% examples, 1103892 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:39:27,415 : INFO : EPOCH 4 - PROGRESS: at 13.05% examples, 1071124 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:28,419 : INFO : EPOCH 4 - PROGRESS: at 16.48% examples, 1091129 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:29,430 : INFO : EPOCH 4 - PROGRESS: at 19.40% examples, 1091378 words/s, in_qsize 17, out_qsize 2 2024-10-21 13:39:30,434 : INFO : EPOCH 4 - PROGRESS: at 22.12% examples, 1067858 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:31,441 : INFO : EPOCH 4 - PROGRESS: at 24.45% examples, 1046910 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:39:32,442 : INFO : EPOCH 4 - PROGRESS: at 27.56% examples, 1025257 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:33,443 : INFO : EPOCH 4 - PROGRESS: at 30.95% examples, 1017190 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:39:34,449 : INFO : EPOCH 4 - PROGRESS: at 34.87% examples, 1026538 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:39:35,449 : INFO : EPOCH 4 - PROGRESS: at 39.11% examples, 1039891 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:36,455 : INFO : EPOCH 4 - PROGRESS: at 43.13% examples, 1045280 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:37,456 : INFO : EPOCH 4 - PROGRESS: at 46.87% examples, 1045197 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:38,458 : INFO : EPOCH 4 - PROGRESS: at 50.82% examples, 1050789 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:39,488 : INFO : EPOCH 4 - PROGRESS: at 55.16% examples, 1064560 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:40,493 : INFO : EPOCH 4 - PROGRESS: at 59.56% examples, 1075956 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:41,505 : INFO : EPOCH 4 - PROGRESS: at 63.56% examples, 1078669 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:42,509 : INFO : EPOCH 4 - PROGRESS: at 67.30% examples, 1079064 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:39:43,512 : INFO : EPOCH 4 - PROGRESS: at 71.11% examples, 1083653 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:39:44,518 : INFO : EPOCH 4 - PROGRESS: at 75.46% examples, 1092072 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:45,522 : INFO : EPOCH 4 - PROGRESS: at 79.31% examples, 1097447 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:46,526 : INFO : EPOCH 4 - PROGRESS: at 82.75% examples, 1094955 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:47,536 : INFO : EPOCH 4 - PROGRESS: at 85.82% examples, 1088716 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:48,539 : INFO : EPOCH 4 - PROGRESS: at 89.33% examples, 1084088 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:49,542 : INFO : EPOCH 4 - PROGRESS: at 93.27% examples, 1086669 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:50,561 : INFO : EPOCH 4 - PROGRESS: at 97.84% examples, 1094480 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:51,143 : INFO : EPOCH 4: training on 41519360 raw words (30350245 effective words) took 27.7s, 1093756 effective words/s 2024-10-21 13:39:51,144 : INFO : Word2Vec lifecycle event {'msg': 'training on 207596800 raw words (151750525 effective words) took 142.8s, 1062632 effective words/s', 'datetime': '2024-10-21T13:39:51.144242', 'gensim': '4.3.0', 'python': '3.11.7 (main, Dec 15 2023, 12:09:04) [Clang 14.0.6 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'train'} 2024-10-21 13:39:51,144 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec<vocab=70537, vector_size=150, alpha=0.025>', 'datetime': '2024-10-21T13:39:51.144842', 'gensim': '4.3.0', 'python': '3.11.7 (main, Dec 15 2023, 12:09:04) [Clang 14.0.6 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'created'} 2024-10-21 13:39:51,145 : WARNING : Effective 'alpha' higher than previous training cycles 2024-10-21 13:39:51,146 : INFO : Word2Vec lifecycle event {'msg': 'training model with 10 workers on 70537 vocabulary and 150 features, using sg=0 hs=0 sample=0.001 negative=5 window=10 shrink_windows=True', 'datetime': '2024-10-21T13:39:51.146283', 'gensim': '4.3.0', 'python': '3.11.7 (main, Dec 15 2023, 12:09:04) [Clang 14.0.6 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'train'} 2024-10-21 13:39:52,174 : INFO : EPOCH 0 - PROGRESS: at 3.53% examples, 1076344 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:39:53,192 : INFO : EPOCH 0 - PROGRESS: at 7.61% examples, 1156665 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:54,196 : INFO : EPOCH 0 - PROGRESS: at 11.04% examples, 1171985 words/s, in_qsize 17, out_qsize 2 2024-10-21 13:39:55,196 : INFO : EPOCH 0 - PROGRESS: at 14.10% examples, 1152393 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:56,198 : INFO : EPOCH 0 - PROGRESS: at 17.25% examples, 1143842 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:57,205 : INFO : EPOCH 0 - PROGRESS: at 20.16% examples, 1136847 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:39:58,213 : INFO : EPOCH 0 - PROGRESS: at 23.36% examples, 1133393 words/s, in_qsize 16, out_qsize 3 2024-10-21 13:39:59,214 : INFO : EPOCH 0 - PROGRESS: at 27.24% examples, 1140049 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:00,225 : INFO : EPOCH 0 - PROGRESS: at 31.11% examples, 1130888 words/s, in_qsize 17, out_qsize 2 2024-10-21 13:40:01,231 : INFO : EPOCH 0 - PROGRESS: at 34.63% examples, 1118194 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:40:02,234 : INFO : EPOCH 0 - PROGRESS: at 37.90% examples, 1101456 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:03,238 : INFO : EPOCH 0 - PROGRESS: at 40.98% examples, 1081523 words/s, in_qsize 17, out_qsize 2 2024-10-21 13:40:04,273 : INFO : EPOCH 0 - PROGRESS: at 45.09% examples, 1080740 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:40:05,274 : INFO : EPOCH 0 - PROGRESS: at 49.27% examples, 1088872 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:06,283 : INFO : EPOCH 0 - PROGRESS: at 53.01% examples, 1091319 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:07,295 : INFO : EPOCH 0 - PROGRESS: at 57.08% examples, 1095003 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:08,302 : INFO : EPOCH 0 - PROGRESS: at 60.97% examples, 1096525 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:09,303 : INFO : EPOCH 0 - PROGRESS: at 65.33% examples, 1103193 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:10,312 : INFO : EPOCH 0 - PROGRESS: at 69.28% examples, 1107338 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:11,313 : INFO : EPOCH 0 - PROGRESS: at 72.89% examples, 1106342 words/s, in_qsize 20, out_qsize 2 2024-10-21 13:40:12,328 : INFO : EPOCH 0 - PROGRESS: at 76.70% examples, 1108416 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:13,337 : INFO : EPOCH 0 - PROGRESS: at 80.21% examples, 1107292 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:14,348 : INFO : EPOCH 0 - PROGRESS: at 84.12% examples, 1109936 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:15,349 : INFO : EPOCH 0 - PROGRESS: at 88.22% examples, 1113661 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:16,350 : INFO : EPOCH 0 - PROGRESS: at 92.10% examples, 1113004 words/s, in_qsize 19, out_qsize 1 2024-10-21 13:40:17,352 : INFO : EPOCH 0 - PROGRESS: at 95.77% examples, 1111781 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:18,356 : INFO : EPOCH 0 - PROGRESS: at 99.40% examples, 1109470 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:18,475 : INFO : EPOCH 0: training on 41519360 raw words (30347454 effective words) took 27.3s, 1110768 effective words/s 2024-10-21 13:40:19,488 : INFO : EPOCH 1 - PROGRESS: at 3.43% examples, 1059767 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:20,497 : INFO : EPOCH 1 - PROGRESS: at 7.01% examples, 1069443 words/s, in_qsize 20, out_qsize 1 2024-10-21 13:40:21,500 : INFO : EPOCH 1 - PROGRESS: at 10.02% examples, 1056599 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:40:22,512 : INFO : EPOCH 1 - PROGRESS: at 12.55% examples, 1028391 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:23,523 : INFO : EPOCH 1 - PROGRESS: at 15.61% examples, 1023832 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:24,528 : INFO : EPOCH 1 - PROGRESS: at 18.88% examples, 1052430 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:25,530 : INFO : EPOCH 1 - PROGRESS: at 22.37% examples, 1079684 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:26,534 : INFO : EPOCH 1 - PROGRESS: at 26.35% examples, 1111383 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:27,537 : INFO : EPOCH 1 - PROGRESS: at 31.20% examples, 1134953 words/s, in_qsize 20, out_qsize 0 2024-10-21 13:40:28,545 : INFO : EPOCH 1 - PROGRESS: at 35.30% examples, 1139625 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:29,547 : INFO : EPOCH 1 - PROGRESS: at 39.57% examples, 1145801 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:30,561 : INFO : EPOCH 1 - PROGRESS: at 44.35% examples, 1157046 words/s, in_qsize 16, out_qsize 3 2024-10-21 13:40:31,572 : INFO : EPOCH 1 - PROGRESS: at 49.14% examples, 1171756 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:32,576 : INFO : EPOCH 1 - PROGRESS: at 53.23% examples, 1176544 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:33,578 : INFO : EPOCH 1 - PROGRESS: at 57.52% examples, 1179024 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:34,595 : INFO : EPOCH 1 - PROGRESS: at 61.66% examples, 1180016 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:40:35,602 : INFO : EPOCH 1 - PROGRESS: at 66.21% examples, 1185922 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:36,620 : INFO : EPOCH 1 - PROGRESS: at 70.30% examples, 1187900 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:37,646 : INFO : EPOCH 1 - PROGRESS: at 74.24% examples, 1183806 words/s, in_qsize 20, out_qsize 1 2024-10-21 13:40:38,651 : INFO : EPOCH 1 - PROGRESS: at 77.51% examples, 1176532 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:40:39,659 : INFO : EPOCH 1 - PROGRESS: at 80.86% examples, 1169465 words/s, in_qsize 20, out_qsize 1 2024-10-21 13:40:40,660 : INFO : EPOCH 1 - PROGRESS: at 84.53% examples, 1167298 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:41,663 : INFO : EPOCH 1 - PROGRESS: at 88.32% examples, 1163678 words/s, in_qsize 17, out_qsize 2 2024-10-21 13:40:42,666 : INFO : EPOCH 1 - PROGRESS: at 92.85% examples, 1169171 words/s, in_qsize 18, out_qsize 1 2024-10-21 13:40:43,669 : INFO : EPOCH 1 - PROGRESS: at 97.17% examples, 1172716 words/s, in_qsize 19, out_qsize 0 2024-10-21 13:40:44,334 : INFO : EPOCH 1: training on 41519360 raw words (30351434 effective words) took 25.9s, 1173964 effective words/s 2024-10-21 13:40:44,334 : INFO : Word2Vec lifecycle event {'msg': 'training on 83038720 raw words (60698888 effective words) took 53.2s, 1141227 effective words/s', 'datetime': '2024-10-21T13:40:44.334874', 'gensim': '4.3.0', 'python': '3.11.7 (main, Dec 15 2023, 12:09:04) [Clang 14.0.6 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'train'}
(60698888, 83038720)
Now, let's look at some output¶
This first example shows a simple case of looking up words similar to the word dirty
. All we need to do here is to call the most_similar
function and provide the word dirty
as the positive example. This returns the top 10 similar words.
w1 = "dirty"
model.wv.most_similar (positive=w1)
[('filthy', 0.8475331664085388), ('unclean', 0.8054439425468445), ('smelly', 0.7647028565406799), ('stained', 0.7615184187889099), ('dusty', 0.7610194087028503), ('grubby', 0.7402293682098389), ('grimy', 0.7398985624313354), ('soiled', 0.7357913255691528), ('mouldy', 0.7194872498512268), ('moldy', 0.7120541334152222)]
That looks pretty good, right? Let's look at a few more. Let's look at similarity for polite
, france
and shocked
.
# look up top 6 words similar to 'polite'
w1 = ["polite"]
model.wv.most_similar (positive=w1,topn=6)
[('courteous', 0.9040976762771606), ('friendly', 0.8224669098854065), ('cordial', 0.8160861134529114), ('curteous', 0.8024160265922546), ('freindly', 0.7767253518104553), ('curtious', 0.7668147087097168)]
# look up top 6 words similar to 'france'
w1 = ["france"]
model.wv.most_similar (positive=w1,topn=6)
[('canada', 0.6998118162155151), ('barcelona', 0.6668195724487305), ('spain', 0.666758120059967), ('germany', 0.662314772605896), ('austria', 0.6565243005752563), ('england', 0.6532105207443237)]
# look up top 6 words similar to 'shocked'
w1 = ["shocked"]
model.wv.most_similar (positive=w1,topn=6)
[('astonished', 0.7996638417243958), ('amazed', 0.7919517159461975), ('horrified', 0.7822665572166443), ('appalled', 0.7650083899497986), ('astounded', 0.7546373009681702), ('suprised', 0.73783940076828)]
That's, nice. You can even specify several positive examples to get things that are related in the provided context and provide negative examples to say what should not be considered as related. In the example below we are asking for all items that relate to bed only:
# get everything related to stuff on the bed
w1 = ["bed",'sheet','pillow']
w2 = ['couch']
model.wv.most_similar (positive=w1,negative=w2,topn=10)
[('duvet', 0.7354281544685364), ('quilt', 0.7208324670791626), ('matress', 0.7173582315444946), ('mattress', 0.7141720056533813), ('blanket', 0.7111974954605103), ('foam', 0.6904359459877014), ('pillowcase', 0.66414475440979), ('pillows', 0.6405133008956909), ('sheets', 0.6377283930778503), ('pillowcases', 0.6346890926361084)]
Similarity between two words in the vocabulary¶
You can even use the Word2Vec model to return the similarity between two words that are present in the vocabulary.
# similarity between two different words
model.wv.similarity(w1="dirty",w2="smelly")
0.7647028
# similarity between two identical words
model.wv.similarity(w1="dirty",w2="dirty")
0.99999994
# similarity between two unrelated words
model.wv.similarity(w1="dirty",w2="clean")
0.27562493
Under the hood, the above three snippets computes the cosine similarity between the two specified words using word vectors of each. From the scores, it makes sense that dirty
is highly similar to smelly
but dirty
is dissimilar to clean
. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity score will always be between [0.0-1.0]. You can read more about cosine similarity scoring here.
Find the odd one out¶
You can even use Word2Vec to find odd items given a list of items.
# Which one is the odd one out in this list?
model.wv.doesnt_match(["cat","dog","france"])
'france'
# Which one is the odd one out in this list?
model.wv.doesnt_match(["bed","pillow","duvet","shower"])
'shower'
Understanding some of the parameters¶
To train the model earlier, we had to set some parameters. Now, let's try to understand what some of them mean. For reference, this is the command that we used to train the model.
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
size
¶
The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. A value of 100-150 has worked well for me.
window
¶
The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its a decent sized window.
min_count
¶
Minimium frequency count of words. The model would ignore words that do not statisfy the min_count
. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.
workers
¶
How many threads to use behind the scenes?
When should you use Word2Vec?¶
There are many application scenarios for Word2Vec. Imagine if you need to build a sentiment lexicon. Training a Word2Vec model on large amounts of user reviews helps you achieve that. You have a lexicon for not just sentiment, but for most words in the vocabulary.
Beyond, raw unstructured text data, you could also use Word2Vec for more structured data. For example, if you had tags for a million stackoverflow questions and answers, you could find tags that are related to a given tag and recommend the related ones for exploration. You can do this by treating each set of co-occuring tags as a "sentence" and train a Word2Vec model on this data. Granted, you still need a large number of examples to make it work.