图学习参考资料词向量word2vec( 五 ) _生活百科

def subsampling(corpus, word2id_freq):#这个discard函数决定了一个词会不会被替换，这个函数是具有随机性的，每次调用结果不同#如果一个词的频率很大，那么它被遗弃的概率就很大def discard(word_id):return random.uniform(0, 1) < 1 - math.sqrt(1e-4 / word2id_freq[word_id] * len(corpus))corpus = [word for word in corpus if not discard(word)]return corpuscorpus = subsampling(corpus, word2id_freq)print("%d tokens in the corpus" % len(corpus))print(corpus[:50])在完成语料数据预处理之后，需要构造训练数据。根据上面的描述，我们需要使用一个滑动窗口对语料从左到右扫描，在每个窗口内，中心词需要预测它的上下文，并形成训练数据。
在实际操作中，由于词表往往很大（50000，100000等），对大词表的一些矩阵运算（如softmax）需要消耗巨大的资源，因此可以通过负采样的方式模拟softmax的结果。

给定一个中心词和一个需要预测的上下文词，把这个上下文词作为正样本。
通过词表随机采样的方式，选择若干个负样本。
把一个大规模分类问题转化为一个2分类问题，通过这种方式优化计算速度。

#max_window_size代表了最大的window_size的大小 ， 程序会根据max_window_size从左到右扫描整个语料#negative_sample_num代表了对于每个正样本，我们需要随机采样多少负样本用于训练，#一般来说 ， negative_sample_num的值越大，训练效果越稳定，但是训练速度越慢 。def build_data(corpus, word2id_dict, word2id_freq, max_window_size = 3, negative_sample_num = 4):#使用一个list存储处理好的数据dataset = []#从左到右，开始枚举每个中心点的位置for center_word_idx in range(len(corpus)):#以max_window_size为上限，随机采样一个window_size，这样会使得训练更加稳定window_size = random.randint(1, max_window_size)#当前的中心词就是center_word_idx所指向的词center_word = corpus[center_word_idx]#以当前中心词为中心，左右两侧在window_size内的词都可以看成是正样本positive_word_range = (max(0, center_word_idx - window_size), min(len(corpus) - 1, center_word_idx + window_size))positive_word_candidates = [corpus[idx] for idx in range(positive_word_range[0], positive_word_range[1]+1) if idx != center_word_idx]#对于每个正样本来说，随机采样negative_sample_num个负样本，用于训练for positive_word in positive_word_candidates:#首先把（中心词，正样本 ， label=1）的三元组数据放入dataset中，#这里label=1表示这个样本是个正样本dataset.append((center_word, positive_word, 1))#开始负采样i = 0while i < negative_sample_num:negative_word_candidate = random.randint(0, vocab_size-1)if negative_word_candidate not in positive_word_candidates:#把（中心词 ， 正样本，label=0）的三元组数据放入dataset中，#这里label=0表示这个样本是个负样本dataset.append((center_word, negative_word_candidate, 0))i += 1return datasetdataset = build_data(corpus, word2id_dict, word2id_freq)for _, (center_word, target_word, label) in zip(range(50), dataset):print("center_word %s, target %s, label %d" % (id2word_dict[center_word],id2word_dict[target_word], label))

训练数据准备好后，把训练数据都组装成mini-batch，并准备输入到网络中进行训练，代码如下：

#我们将不同类型的数据放到不同的tensor里，便于神经网络进行处理#并通过numpy的array函数，构造出不同的tensor来，并把这些tensor送入神经网络中进行训练def build_batch(dataset, batch_size, epoch_num):#center_word_batch缓存batch_size个中心词center_word_batch = []#target_word_batch缓存batch_size个目标词（可以是正样本或者负样本）target_word_batch = []#label_batch缓存了batch_size个0或1的标签，用于模型训练label_batch = []for epoch in range(epoch_num):#每次开启一个新epoch之前，都对数据进行一次随机打乱，提高训练效果random.shuffle(dataset)for center_word, target_word, label in dataset:#遍历dataset中的每个样本 ， 并将这些数据送到不同的tensor里center_word_batch.append([center_word])target_word_batch.append([target_word])label_batch.append(label)#当样本积攒到一个batch_size后，我们把数据都返回回来#在这里我们使用numpy的array函数把list封装成tensor#并使用python的迭代器机制，将数据yield出来#使用迭代器的好处是可以节省内存if len(center_word_batch) == batch_size:yield np.array(center_word_batch).astype("int64"), \np.array(target_word_batch).astype("int64"), \np.array(label_batch).astype("float32")center_word_batch = []target_word_batch = []label_batch = []if len(center_word_batch) > 0:yield np.array(center_word_batch).astype("int64"), \np.array(target_word_batch).astype("int64"), \np.array(label_batch).astype("float32")for _, batch in zip(range(10), build_batch(dataset, 128, 3)):print(batch)
上一页
1
2
3
4
5
6
下一页
		  	

    
    




    
    
    


推荐阅读

           
                  
              
                  手机评估价格查询在哪?  手机评估价格查询 
                
                   
                
              
            

                  
              
                  2023中山市三角四海学校招生简章 2023中山市三角四海学校招生简章公告 
                
                   
                
              
            

                  
              
                  生辰八字算风水看你住宅风水如何 
                
                   
                
              
            

                  
              
                  07年大众帕萨特是国几的 国家阶段机动车污染物排放标准的意义 
                
                   
                
              
            

                  
              
                  七巧板数字6的拼法图解 
                
                   
                
              
            

                  
              
                  2023年属兔郝姓男孩取什么名字好 代表祥瑞的男宝宝名 
                
                   
                
              
            

                  
              
                  1984鼠男人六月运势 
                
                   
                
              
            

                  
              
                  Ctrl+F怎么查找 
                
                   
                
              
            

                  
              
                  acpi  acp 
                
                   
                
              
            

                  
              
                  属狗双子座女在2018年运势 
                
                   
                
              
            

                  
              
                  vivox70和x60对比_vivox70和x60哪个好 
                
                   
                
              
            

                  
              
                  3月盆栽适合种什么蔬菜水果  3月成盆栽 
                
                   
                
              
            

                  
              
                  关于曾丽娟简述 曾丽娟 
                
                   
                
              
            

                  
              
                  王者荣耀下棋模式在哪 
                
                   
                
              
            

                  
              
                  提车一般检查什么 新车提车时要做哪些检查 
                
                   
                
              
            

                  
              
                  食用油标准号q和gb区别0024S 食用油标准号q和gb区别 
                
                   
                
              
            

                  
              
                  英雄的意思 英雄指的是什么 
                
                   
                
              
            

                  
              
                  我是双子座的,为什么有时会感到寂寞呢 
                
                   
                
              
            

                  
              
                  上古卷轴5 炼金配方 上古卷轴5 炼金值钱配方 
                
                   
                
              
            

                  
              
                  吃辣的好处有哪些 吃辣要注意哪些 
                
                   
                
              
            

          

超人怎么死的（超人怎么复活) 

米游社上传图片水印怎么关 

苹果se3参数与图片_苹果se3参数详细配置 

双叶h-单叶双曲面与双叶双曲面的图像区别 

dc超人咋死的 

许褚怎么牺牲的（三国名将许褚之死) 

CSS处理器-Less/Scss 

JUC学习笔记——进程与线程 

红警尤里怎么玩（红警2共和国之辉尤里怎么玩) 

斗地主怎么玩（斗地主怎么记牌最轻松)