基于遗传算法特征选择及单层感知机模型的IMDB电影评论文本分类案例

本文主要是介绍基于遗传算法特征选择及单层感知机模型的IMDB电影评论文本分类案例,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

基于遗传算法特征选择及单层感知机模型的IMDB电影评论文本分类案例

  • 1.数据载入及处理
  • 2.感知机模型建立
  • 3.模型训练
  • 4.遗传算法进行特征选择
    • 注意
  • 5.联系我们

1.数据载入及处理

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
from keras.datasets import imdb
from keras.preprocessing import sequence
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as pltmax_features = 10000
maxlen = 200
batch_size = 32# 加载IMDB数据集
print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')# 限定评论长度,并进行填充
print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)[:2000]
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)[:2000]
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)# 将整数序列转换为文本
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in input_train[0]])# 使用词袋模型表示文本
vectorizer = CountVectorizer(max_features=max_features)
X_train = vectorizer.fit_transform([' '.join([reverse_word_index.get(i - 3, '?') for i in sequence]) for sequence in input_train])
X_test = vectorizer.transform([' '.join([reverse_word_index.get(i - 3, '?') for i in sequence]) for sequence in input_test])# 转换数据为PyTorch张量
X_train_tensor = torch.tensor(X_train.toarray(), dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test.toarray(), dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)batch_size = 2000
train_iter = DataLoader(TensorDataset(X_train_tensor, y_train_tensor), batch_size)
test_iter = DataLoader(TensorDataset(X_test_tensor, y_test_tensor), batch_size)

2.感知机模型建立

# 定义感知机网络
class Perceptron(nn.Module):def __init__(self, input_size):super(Perceptron, self).__init__()self.fc = nn.Linear(input_size, 1)self.sigmoid = nn.Sigmoid()def forward(self, x):x = self.fc(x)x = self.sigmoid(x)return x# 训练感知机模型
def train(model, iterator, optimizer, criterion):model.train()for batch in iterator:optimizer.zero_grad()text, label = batchpredictions = model(text).squeeze(1)loss = criterion(predictions, label)loss.backward()optimizer.step()# 测试感知机模型
def evaluate(model, iterator, criterion):model.eval()total_loss = 0total_correct = 0with torch.no_grad():for batch in iterator:text, label = batchpredictions = model(text).squeeze(1)loss = criterion(predictions, label)total_loss += loss.item()rounded_preds = torch.round(predictions)total_correct += (rounded_preds == label).sum().item()return total_loss / len(iterator), total_correct / len(iterator.dataset)# 初始化感知机模型
input_size = X_train_tensor.shape[1]
model = Perceptron(input_size)

3.模型训练

# # 定义损失函数和优化器
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)N_EPOCHS = 10
eval_acc_list = []
for epoch in range(N_EPOCHS):train(model, train_iter, optimizer, criterion)eval_loss, eval_acc = evaluate(model, test_iter, criterion)eval_acc_list.append(eval_acc)print(f'Epoch: {epoch+1}, Test Loss: {eval_loss:.3f}, Test Acc: {eval_acc*100:.2f}%')plt.plot(range(N_EPOCHS), eval_acc_list)
plt.title('Test Accuracy')
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.show()

在这里插入图片描述

4.遗传算法进行特征选择

# 随机初始化染色体
def initialize_population(population_size, num_genes):# # Option 1:# p=np.array([0.05,0.95])# return np.random.choice([0, 1], size=(population_size, num_genes), p=p.ravel())# Option 2:return np.random.choice([0, 1], size=(population_size, num_genes))# 计算适应值,以分类器的准确度
def calculate_fitness(population, model, criterion):fitness = []for chromosome in population: # population: a 0-1 sequence selected_features = np.where(chromosome == 1)[0] # 更新模型输入维度input_dim = len(selected_features)model.fc = nn.Linear(input_dim, 1)optimizer = optim.Adam(model.parameters(), lr=0.001)idx = torch.tensor(selected_features)        train_iter = DataLoader(TensorDataset(X_train_tensor[:, idx], y_train_tensor), batch_size)test_iter = DataLoader(TensorDataset(X_test_tensor[:, idx], y_test_tensor), batch_size)# 训练并获取准确度N_EPOCHS = 10for epoch in range(N_EPOCHS):train(model, train_iter, optimizer, criterion)test_loss, test_acc = evaluate(model, test_iter, criterion)model.train() fitness.append(test_acc)return np.array(fitness)# 选择
def selection(population, fitness): # input populations and their accuracyprobabilities = fitness / sum(fitness) # the accuracy-based probability of selection# # Option 1: no random in selection, choose the top 2 as parents# probabilities_copy = probabilities.copy()# probabilities_copy.sort()# max_1 = probabilities_copy[-1]# max_2 = probabilities_copy[-2]# max_1_index = np.where(probabilities == max_1)# max_2_index = np.where(probabilities == max_2)# selected_indices = [max_1_index[0].tolist()[0], max_2_index[0].tolist()[0]] * 25# Option 2: random selected_indices = np.random.choice(range(len(population)), size=len(population), p=probabilities)return population[selected_indices]# 交叉
def crossover(parents, crossover_rate):children = []for i in range(0, len(parents), 2):parent1, parent2 = parents[i], parents[i + 1]if np.random.rand() < crossover_rate:crossover_point = np.random.randint(1, len(parent1))child1 = np.concatenate((parent1[:crossover_point], parent2[crossover_point:]))child2 = np.concatenate((parent2[:crossover_point], parent1[crossover_point:]))else:child1, child2 = parent1, parent2children.extend([child1, child2])return np.array(children)# 变异
def mutation(children, mutation_rate):for i in range(len(children)):mutation_points = np.where(np.random.rand(len(children[i])) < mutation_rate)[0]children[i][mutation_points] = 1 - children[i][mutation_points]  # keyreturn children# 定义遗传算法的主函数
def genetic_algorithm(population_size, num_genes, generations, crossover_rate, mutation_rate, model, criterion):# 初始化染色体population = initialize_population(population_size, num_genes)fitness_list = []for generation in range(generations):print('Generation', generation+1, ":")fitness = calculate_fitness(population, model, criterion) # return a list (1, population_size) with history test acc# 选择selected_population = selection(population, fitness) # return a list, (population_size, num_genes / input_size / sentence_length), each adjacent are parents# 交叉children = crossover(selected_population, crossover_rate)# 变异mutated_children = mutation(children, mutation_rate)# 形成新种群population = mutated_children# 输出当前最优解best_individual = population[np.argmax(fitness)]fitness_list.append(fitness.max())print(f"Generation {generation + 1}, Best Individual: {best_individual}, Fitness: {fitness.max()}")plt.plot(range(generations), fitness_list)plt.title('Test Accuracy with feature selection via genetic algorithm')plt.xlabel('epoch')plt.ylabel('accuracy')plt.show()# 返回最优解best_individual = population[np.argmax(fitness)]return best_individual# 调用遗传算法
model = Perceptron(input_size)
best_solution = genetic_algorithm(population_size=50, num_genes=input_size, generations=10, crossover_rate=0.8, mutation_rate=0.1, model=model, criterion=criterion)
print(f"Final Best Solution: {best_solution}")# 解释最优解
selected_features = np.where(best_solution == 1)[0]
print(f"Selected Features: {selected_features}")
print("Shape of Selected Features = ",selected_features.shape)

在这里插入图片描述

注意

  1. 在本任务中,selection函数中第一个option 1仅选择效果最好的两个染色体作为父母比option 2在population中随机选择的效率更高(10轮次后,验证集精度74%>71%);
  2. 在本任务中,初始化initialize_population函数中指定选择更多的特征(95%, Option 1)比随机选择特征(50%, Option 2)的效率更高;
  3. 每一次基于筛选输入特征的维度修改模型结构参数后,需要注意重申一下 optimizer变量,因为optimizer的声明中涉及model.parameters()

5.联系我们

Email: oceannedlg@outlook.com
在这里插入图片描述

这篇关于基于遗传算法特征选择及单层感知机模型的IMDB电影评论文本分类案例的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/529850

相关文章

Python中图片与PDF识别文本(OCR)的全面指南

《Python中图片与PDF识别文本(OCR)的全面指南》在数据爆炸时代,80%的企业数据以非结构化形式存在,其中PDF和图像是最主要的载体,本文将深入探索Python中OCR技术如何将这些数字纸张转... 目录一、OCR技术核心原理二、python图像识别四大工具库1. Pytesseract - 经典O

苹果macOS 26 Tahoe主题功能大升级:可定制图标/高亮文本/文件夹颜色

《苹果macOS26Tahoe主题功能大升级:可定制图标/高亮文本/文件夹颜色》在整体系统设计方面,macOS26采用了全新的玻璃质感视觉风格,应用于Dock栏、应用图标以及桌面小部件等多个界面... 科技媒体 MACRumors 昨日(6 月 13 日)发布博文,报道称在 macOS 26 Tahoe 中

Python实现精准提取 PDF中的文本,表格与图片

《Python实现精准提取PDF中的文本,表格与图片》在实际的系统开发中,处理PDF文件不仅限于读取整页文本,还有提取文档中的表格数据,图片或特定区域的内容,下面我们来看看如何使用Python实... 目录安装 python 库提取 PDF 文本内容:获取整页文本与指定区域内容获取页面上的所有文本内容获取

六个案例搞懂mysql间隙锁

《六个案例搞懂mysql间隙锁》MySQL中的间隙是指索引中两个索引键之间的空间,间隙锁用于防止范围查询期间的幻读,本文主要介绍了六个案例搞懂mysql间隙锁,具有一定的参考价值,感兴趣的可以了解一下... 目录概念解释间隙锁详解间隙锁触发条件间隙锁加锁规则案例演示案例一:唯一索引等值锁定存在的数据案例二:

MySQL 表的内外连接案例详解

《MySQL表的内外连接案例详解》本文给大家介绍MySQL表的内外连接,结合实例代码给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价值,需要的朋友参考下吧... 目录表的内外连接(重点)内连接外连接表的内外连接(重点)内连接内连接实际上就是利用where子句对两种表形成的笛卡儿积进行筛选,我

详解如何使用Python从零开始构建文本统计模型

《详解如何使用Python从零开始构建文本统计模型》在自然语言处理领域,词汇表构建是文本预处理的关键环节,本文通过Python代码实践,演示如何从原始文本中提取多尺度特征,并通过动态调整机制构建更精确... 目录一、项目背景与核心思想二、核心代码解析1. 数据加载与预处理2. 多尺度字符统计3. 统计结果可

SpringBoot整合Sa-Token实现RBAC权限模型的过程解析

《SpringBoot整合Sa-Token实现RBAC权限模型的过程解析》:本文主要介绍SpringBoot整合Sa-Token实现RBAC权限模型的过程解析,本文给大家介绍的非常详细,对大家的学... 目录前言一、基础概念1.1 RBAC模型核心概念1.2 Sa-Token核心功能1.3 环境准备二、表结

Java Stream.reduce()方法操作实际案例讲解

《JavaStream.reduce()方法操作实际案例讲解》reduce是JavaStreamAPI中的一个核心操作,用于将流中的元素组合起来产生单个结果,:本文主要介绍JavaStream.... 目录一、reduce的基本概念1. 什么是reduce操作2. reduce方法的三种形式二、reduce

Spring Boot 整合 Redis 实现数据缓存案例详解

《SpringBoot整合Redis实现数据缓存案例详解》Springboot缓存,默认使用的是ConcurrentMap的方式来实现的,然而我们在项目中并不会这么使用,本文介绍SpringB... 目录1.添加 Maven 依赖2.配置Redis属性3.创建 redisCacheManager4.使用Sp

springboot项目redis缓存异常实战案例详解(提供解决方案)

《springboot项目redis缓存异常实战案例详解(提供解决方案)》redis基本上是高并发场景上会用到的一个高性能的key-value数据库,属于nosql类型,一般用作于缓存,一般是结合数据... 目录缓存异常实践案例缓存穿透问题缓存击穿问题(其中也解决了穿透问题)完整代码缓存异常实践案例Red