基于遗传算法特征选择及单层感知机模型的IMDB电影评论文本分类案例

本文主要是介绍基于遗传算法特征选择及单层感知机模型的IMDB电影评论文本分类案例,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

基于遗传算法特征选择及单层感知机模型的IMDB电影评论文本分类案例

  • 1.数据载入及处理
  • 2.感知机模型建立
  • 3.模型训练
  • 4.遗传算法进行特征选择
    • 注意
  • 5.联系我们

1.数据载入及处理

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
from keras.datasets import imdb
from keras.preprocessing import sequence
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as pltmax_features = 10000
maxlen = 200
batch_size = 32# 加载IMDB数据集
print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')# 限定评论长度,并进行填充
print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)[:2000]
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)[:2000]
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)# 将整数序列转换为文本
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in input_train[0]])# 使用词袋模型表示文本
vectorizer = CountVectorizer(max_features=max_features)
X_train = vectorizer.fit_transform([' '.join([reverse_word_index.get(i - 3, '?') for i in sequence]) for sequence in input_train])
X_test = vectorizer.transform([' '.join([reverse_word_index.get(i - 3, '?') for i in sequence]) for sequence in input_test])# 转换数据为PyTorch张量
X_train_tensor = torch.tensor(X_train.toarray(), dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test.toarray(), dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)batch_size = 2000
train_iter = DataLoader(TensorDataset(X_train_tensor, y_train_tensor), batch_size)
test_iter = DataLoader(TensorDataset(X_test_tensor, y_test_tensor), batch_size)

2.感知机模型建立

# 定义感知机网络
class Perceptron(nn.Module):def __init__(self, input_size):super(Perceptron, self).__init__()self.fc = nn.Linear(input_size, 1)self.sigmoid = nn.Sigmoid()def forward(self, x):x = self.fc(x)x = self.sigmoid(x)return x# 训练感知机模型
def train(model, iterator, optimizer, criterion):model.train()for batch in iterator:optimizer.zero_grad()text, label = batchpredictions = model(text).squeeze(1)loss = criterion(predictions, label)loss.backward()optimizer.step()# 测试感知机模型
def evaluate(model, iterator, criterion):model.eval()total_loss = 0total_correct = 0with torch.no_grad():for batch in iterator:text, label = batchpredictions = model(text).squeeze(1)loss = criterion(predictions, label)total_loss += loss.item()rounded_preds = torch.round(predictions)total_correct += (rounded_preds == label).sum().item()return total_loss / len(iterator), total_correct / len(iterator.dataset)# 初始化感知机模型
input_size = X_train_tensor.shape[1]
model = Perceptron(input_size)

3.模型训练

# # 定义损失函数和优化器
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)N_EPOCHS = 10
eval_acc_list = []
for epoch in range(N_EPOCHS):train(model, train_iter, optimizer, criterion)eval_loss, eval_acc = evaluate(model, test_iter, criterion)eval_acc_list.append(eval_acc)print(f'Epoch: {epoch+1}, Test Loss: {eval_loss:.3f}, Test Acc: {eval_acc*100:.2f}%')plt.plot(range(N_EPOCHS), eval_acc_list)
plt.title('Test Accuracy')
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.show()

在这里插入图片描述

4.遗传算法进行特征选择

# 随机初始化染色体
def initialize_population(population_size, num_genes):# # Option 1:# p=np.array([0.05,0.95])# return np.random.choice([0, 1], size=(population_size, num_genes), p=p.ravel())# Option 2:return np.random.choice([0, 1], size=(population_size, num_genes))# 计算适应值,以分类器的准确度
def calculate_fitness(population, model, criterion):fitness = []for chromosome in population: # population: a 0-1 sequence selected_features = np.where(chromosome == 1)[0] # 更新模型输入维度input_dim = len(selected_features)model.fc = nn.Linear(input_dim, 1)optimizer = optim.Adam(model.parameters(), lr=0.001)idx = torch.tensor(selected_features)        train_iter = DataLoader(TensorDataset(X_train_tensor[:, idx], y_train_tensor), batch_size)test_iter = DataLoader(TensorDataset(X_test_tensor[:, idx], y_test_tensor), batch_size)# 训练并获取准确度N_EPOCHS = 10for epoch in range(N_EPOCHS):train(model, train_iter, optimizer, criterion)test_loss, test_acc = evaluate(model, test_iter, criterion)model.train() fitness.append(test_acc)return np.array(fitness)# 选择
def selection(population, fitness): # input populations and their accuracyprobabilities = fitness / sum(fitness) # the accuracy-based probability of selection# # Option 1: no random in selection, choose the top 2 as parents# probabilities_copy = probabilities.copy()# probabilities_copy.sort()# max_1 = probabilities_copy[-1]# max_2 = probabilities_copy[-2]# max_1_index = np.where(probabilities == max_1)# max_2_index = np.where(probabilities == max_2)# selected_indices = [max_1_index[0].tolist()[0], max_2_index[0].tolist()[0]] * 25# Option 2: random selected_indices = np.random.choice(range(len(population)), size=len(population), p=probabilities)return population[selected_indices]# 交叉
def crossover(parents, crossover_rate):children = []for i in range(0, len(parents), 2):parent1, parent2 = parents[i], parents[i + 1]if np.random.rand() < crossover_rate:crossover_point = np.random.randint(1, len(parent1))child1 = np.concatenate((parent1[:crossover_point], parent2[crossover_point:]))child2 = np.concatenate((parent2[:crossover_point], parent1[crossover_point:]))else:child1, child2 = parent1, parent2children.extend([child1, child2])return np.array(children)# 变异
def mutation(children, mutation_rate):for i in range(len(children)):mutation_points = np.where(np.random.rand(len(children[i])) < mutation_rate)[0]children[i][mutation_points] = 1 - children[i][mutation_points]  # keyreturn children# 定义遗传算法的主函数
def genetic_algorithm(population_size, num_genes, generations, crossover_rate, mutation_rate, model, criterion):# 初始化染色体population = initialize_population(population_size, num_genes)fitness_list = []for generation in range(generations):print('Generation', generation+1, ":")fitness = calculate_fitness(population, model, criterion) # return a list (1, population_size) with history test acc# 选择selected_population = selection(population, fitness) # return a list, (population_size, num_genes / input_size / sentence_length), each adjacent are parents# 交叉children = crossover(selected_population, crossover_rate)# 变异mutated_children = mutation(children, mutation_rate)# 形成新种群population = mutated_children# 输出当前最优解best_individual = population[np.argmax(fitness)]fitness_list.append(fitness.max())print(f"Generation {generation + 1}, Best Individual: {best_individual}, Fitness: {fitness.max()}")plt.plot(range(generations), fitness_list)plt.title('Test Accuracy with feature selection via genetic algorithm')plt.xlabel('epoch')plt.ylabel('accuracy')plt.show()# 返回最优解best_individual = population[np.argmax(fitness)]return best_individual# 调用遗传算法
model = Perceptron(input_size)
best_solution = genetic_algorithm(population_size=50, num_genes=input_size, generations=10, crossover_rate=0.8, mutation_rate=0.1, model=model, criterion=criterion)
print(f"Final Best Solution: {best_solution}")# 解释最优解
selected_features = np.where(best_solution == 1)[0]
print(f"Selected Features: {selected_features}")
print("Shape of Selected Features = ",selected_features.shape)

在这里插入图片描述

注意

  1. 在本任务中,selection函数中第一个option 1仅选择效果最好的两个染色体作为父母比option 2在population中随机选择的效率更高(10轮次后,验证集精度74%>71%);
  2. 在本任务中,初始化initialize_population函数中指定选择更多的特征(95%, Option 1)比随机选择特征(50%, Option 2)的效率更高;
  3. 每一次基于筛选输入特征的维度修改模型结构参数后,需要注意重申一下 optimizer变量,因为optimizer的声明中涉及model.parameters()

5.联系我们

Email: oceannedlg@outlook.com
在这里插入图片描述

这篇关于基于遗传算法特征选择及单层感知机模型的IMDB电影评论文本分类案例的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/529850

相关文章

Python通用唯一标识符模块uuid使用案例详解

《Python通用唯一标识符模块uuid使用案例详解》Pythonuuid模块用于生成128位全局唯一标识符,支持UUID1-5版本,适用于分布式系统、数据库主键等场景,需注意隐私、碰撞概率及存储优... 目录简介核心功能1. UUID版本2. UUID属性3. 命名空间使用场景1. 生成唯一标识符2. 数

PostgreSQL的扩展dict_int应用案例解析

《PostgreSQL的扩展dict_int应用案例解析》dict_int扩展为PostgreSQL提供了专业的整数文本处理能力,特别适合需要精确处理数字内容的搜索场景,本文给大家介绍PostgreS... 目录PostgreSQL的扩展dict_int一、扩展概述二、核心功能三、安装与启用四、字典配置方法

Python中re模块结合正则表达式的实际应用案例

《Python中re模块结合正则表达式的实际应用案例》Python中的re模块是用于处理正则表达式的强大工具,正则表达式是一种用来匹配字符串的模式,它可以在文本中搜索和匹配特定的字符串模式,这篇文章主... 目录前言re模块常用函数一、查看文本中是否包含 A 或 B 字符串二、替换多个关键词为统一格式三、提

Python get()函数用法案例详解

《Pythonget()函数用法案例详解》在Python中,get()是字典(dict)类型的内置方法,用于安全地获取字典中指定键对应的值,它的核心作用是避免因访问不存在的键而引发KeyError错... 目录简介基本语法一、用法二、案例:安全访问未知键三、案例:配置参数默认值简介python是一种高级编

MySQL中的索引结构和分类实战案例详解

《MySQL中的索引结构和分类实战案例详解》本文详解MySQL索引结构与分类,涵盖B树、B+树、哈希及全文索引,分析其原理与优劣势,并结合实战案例探讨创建、管理及优化技巧,助力提升查询性能,感兴趣的朋... 目录一、索引概述1.1 索引的定义与作用1.2 索引的基本原理二、索引结构详解2.1 B树索引2.2

从入门到精通MySQL 数据库索引(实战案例)

《从入门到精通MySQL数据库索引(实战案例)》索引是数据库的目录,提升查询速度,主要类型包括BTree、Hash、全文、空间索引,需根据场景选择,建议用于高频查询、关联字段、排序等,避免重复率高或... 目录一、索引是什么?能干嘛?核心作用:二、索引的 4 种主要类型(附通俗例子)1. BTree 索引(

HTML中meta标签的常见使用案例(示例详解)

《HTML中meta标签的常见使用案例(示例详解)》HTMLmeta标签用于提供文档元数据,涵盖字符编码、SEO优化、社交媒体集成、移动设备适配、浏览器控制及安全隐私设置,优化页面显示与搜索引擎索引... 目录html中meta标签的常见使用案例一、基础功能二、搜索引擎优化(seo)三、社交媒体集成四、移动

Python中图片与PDF识别文本(OCR)的全面指南

《Python中图片与PDF识别文本(OCR)的全面指南》在数据爆炸时代,80%的企业数据以非结构化形式存在,其中PDF和图像是最主要的载体,本文将深入探索Python中OCR技术如何将这些数字纸张转... 目录一、OCR技术核心原理二、python图像识别四大工具库1. Pytesseract - 经典O

苹果macOS 26 Tahoe主题功能大升级:可定制图标/高亮文本/文件夹颜色

《苹果macOS26Tahoe主题功能大升级:可定制图标/高亮文本/文件夹颜色》在整体系统设计方面,macOS26采用了全新的玻璃质感视觉风格,应用于Dock栏、应用图标以及桌面小部件等多个界面... 科技媒体 MACRumors 昨日(6 月 13 日)发布博文,报道称在 macOS 26 Tahoe 中

Python实现精准提取 PDF中的文本,表格与图片

《Python实现精准提取PDF中的文本,表格与图片》在实际的系统开发中,处理PDF文件不仅限于读取整页文本,还有提取文档中的表格数据,图片或特定区域的内容,下面我们来看看如何使用Python实... 目录安装 python 库提取 PDF 文本内容:获取整页文本与指定区域内容获取页面上的所有文本内容获取