Datacamp 笔记代码 Supervised Learning with scikit-learn 第四章 Preprocessing and pipelines

本文主要是介绍Datacamp 笔记代码 Supervised Learning with scikit-learn 第四章 Preprocessing and pipelines,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python

Datacamp track: Data Scientist with Python - Course 21 (4)

Exercise

Exploring categorical features

The Gapminder dataset that you worked with in previous chapters also contained a categorical 'Region' feature, which we dropped in previous exercises since you did not have the tools to deal with it. Now however, you do, so we have added it back in!

Your job in this exercise is to explore this feature. Boxplots are particularly useful for visualizing categorical features such as this.

Instruction

  • Import pandas as pd.
  • Read the CSV file 'gapminder.csv' into a DataFrame called df.
  • Use pandas to create a boxplot showing the variation of life expectancy ('life') by region ('Region'). To do so, pass the column names in to df.boxplot() (in that order).
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from urllib.request import urlretrieve
fn = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1939/datasets/gm_2008_region.csv'
urlretrieve(fn, 'gapminder.csv')
('gapminder.csv', <http.client.HTTPMessage at 0x11515f0b8>)
# Import pandas
import pandas as pd# Read 'gapminder.csv' into a DataFrame: df
df = pd.read_csv('gapminder.csv')# Create a boxplot of life expectancy per region
df.boxplot('life', 'Region', rot=60)# Show the plot
plt.show()

[外链图片转存失败(img-TzFIQdvn-1564048295763)(output_2_0.png)]

Exercise

Creating dummy variables

As Andy discussed in the video, scikit-learn does not accept non-numerical features. You saw in the previous exercise that the 'Region' feature contains very useful information that can predict life expectancy. For example, Sub-Saharan Africa has a lower life expectancy compared to Europe and Central Asia. Therefore, if you are trying to predict life expectancy, it would be preferable to retain the 'Region' feature. To do this, you need to binarize it by creating dummy variables, which is what you will do in this exercise.

Instruction

  • Use the pandas get_dummies() function to create dummy variables from the df DataFrame. Store the result as df_region.
  • Print the columns of df_region. This has been done for you.
  • Use the get_dummies() function again, this time specifying drop_first=True to drop the unneeded dummy variable (in this case, 'Region_America').
  • Hit 'Submit Answer to print the new columns of df_region and take note of how one column was dropped!
# Create dummy variables: df_region
df_region = pd.get_dummies(df)# Print the columns of df_region
print(df_region.columns)# Create dummy variables with drop_first=True: df_region
df_region = pd.get_dummies(df, drop_first=True)# Print the new columns of df_region
print(df_region.columns)
Index(['population', 'fertility', 'HIV', 'CO2', 'BMI_male', 'GDP','BMI_female', 'life', 'child_mortality', 'Region_America','Region_East Asia & Pacific', 'Region_Europe & Central Asia','Region_Middle East & North Africa', 'Region_South Asia','Region_Sub-Saharan Africa'],dtype='object')
Index(['population', 'fertility', 'HIV', 'CO2', 'BMI_male', 'GDP','BMI_female', 'life', 'child_mortality', 'Region_East Asia & Pacific','Region_Europe & Central Asia', 'Region_Middle East & North Africa','Region_South Asia', 'Region_Sub-Saharan Africa'],dtype='object')

Exercise

Regression with categorical features

Having created the dummy variables from the 'Region'feature, you can build regression models as you did before. Here, you’ll use ridge regression to perform 5-fold cross-validation.

The feature array X and target variable array y have been pre-loaded.

Instruction

  • Import Ridge from sklearn.linear_model and cross_val_score from sklearn.model_selection.
  • Instantiate a ridge regressor called ridge with alpha=0.5 and normalize=True.
  • Perform 5-fold cross-validation on X and y using the cross_val_score() function.
  • Print the cross-validated scores.
# modified/added by Jinny
import pandas as pd
import numpy as npfrom urllib.request import urlretrievefn = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1939/datasets/gm_2008_region.csv'
urlretrieve(fn, 'gapminder.csv')df = pd.read_csv('gapminder.csv')df_region = pd.get_dummies(df)df_region = df_region.drop('Region_America', axis=1)X = df_region.drop('life', axis=1).valuesy = df_region['life'].values
# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score# Instantiate a ridge regressor: ridge
ridge = Ridge(alpha=0.5, normalize=True)# Perform 5-fold cross-validation: ridge_cv
ridge_cv = cross_val_score(ridge, X, y, cv=5)# Print the cross-validated scores
print(ridge_cv)
[0.86808336 0.80623545 0.84004203 0.7754344  0.87503712]

Exercise

Dropping missing data

The voting dataset from Chapter 1 contained a bunch of missing values that we dealt with for you behind the scenes. Now, it’s time for you to take care of these yourself!

The unprocessed dataset has been loaded into a DataFrame df. Explore it in the IPython Shell with the .head()method. You will see that there are certain data points labeled with a '?'. These denote missing values. As you saw in the video, different datasets encode missing values in different ways. Sometimes it may be a '9999', other times a 0 - real-world data can be very messy! If you’re lucky, the missing values will already be encoded as NaN. We use NaNbecause it is an efficient and simplified way of internally representing missing data, and it lets us take advantage of pandas methods such as .dropna() and .fillna(), as well as scikit-learn’s Imputation transformer Imputer

这篇关于Datacamp 笔记代码 Supervised Learning with scikit-learn 第四章 Preprocessing and pipelines的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1028848

相关文章

Java集合之Iterator迭代器实现代码解析

《Java集合之Iterator迭代器实现代码解析》迭代器Iterator是Java集合框架中的一个核心接口,位于java.util包下,它定义了一种标准的元素访问机制,为各种集合类型提供了一种统一的... 目录一、什么是Iterator二、Iterator的核心方法三、基本使用示例四、Iterator的工

Java 线程池+分布式实现代码

《Java线程池+分布式实现代码》在Java开发中,池通过预先创建并管理一定数量的资源,避免频繁创建和销毁资源带来的性能开销,从而提高系统效率,:本文主要介绍Java线程池+分布式实现代码,需要... 目录1. 线程池1.1 自定义线程池实现1.1.1 线程池核心1.1.2 代码示例1.2 总结流程2. J

JS纯前端实现浏览器语音播报、朗读功能的完整代码

《JS纯前端实现浏览器语音播报、朗读功能的完整代码》在现代互联网的发展中,语音技术正逐渐成为改变用户体验的重要一环,下面:本文主要介绍JS纯前端实现浏览器语音播报、朗读功能的相关资料,文中通过代码... 目录一、朗读单条文本:① 语音自选参数,按钮控制语音:② 效果图:二、朗读多条文本:① 语音有默认值:②

Vue实现路由守卫的示例代码

《Vue实现路由守卫的示例代码》Vue路由守卫是控制页面导航的钩子函数,主要用于鉴权、数据预加载等场景,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着... 目录一、概念二、类型三、实战一、概念路由守卫(Navigation Guards)本质上就是 在路

uni-app小程序项目中实现前端图片压缩实现方式(附详细代码)

《uni-app小程序项目中实现前端图片压缩实现方式(附详细代码)》在uni-app开发中,文件上传和图片处理是很常见的需求,但也经常会遇到各种问题,下面:本文主要介绍uni-app小程序项目中实... 目录方式一:使用<canvas>实现图片压缩(推荐,兼容性好)示例代码(小程序平台):方式二:使用uni

JAVA实现Token自动续期机制的示例代码

《JAVA实现Token自动续期机制的示例代码》本文主要介绍了JAVA实现Token自动续期机制的示例代码,通过动态调整会话生命周期平衡安全性与用户体验,解决固定有效期Token带来的风险与不便,感兴... 目录1. 固定有效期Token的内在局限性2. 自动续期机制:兼顾安全与体验的解决方案3. 总结PS

C#中通过Response.Headers设置自定义参数的代码示例

《C#中通过Response.Headers设置自定义参数的代码示例》:本文主要介绍C#中通过Response.Headers设置自定义响应头的方法,涵盖基础添加、安全校验、生产实践及调试技巧,强... 目录一、基础设置方法1. 直接添加自定义头2. 批量设置模式二、高级配置技巧1. 安全校验机制2. 类型

Python屏幕抓取和录制的详细代码示例

《Python屏幕抓取和录制的详细代码示例》随着现代计算机性能的提高和网络速度的加快,越来越多的用户需要对他们的屏幕进行录制,:本文主要介绍Python屏幕抓取和录制的相关资料,需要的朋友可以参考... 目录一、常用 python 屏幕抓取库二、pyautogui 截屏示例三、mss 高性能截图四、Pill

使用MapStruct实现Java对象映射的示例代码

《使用MapStruct实现Java对象映射的示例代码》本文主要介绍了使用MapStruct实现Java对象映射的示例代码,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,... 目录一、什么是 MapStruct?二、实战演练:三步集成 MapStruct第一步:添加 Mave

Java抽象类Abstract Class示例代码详解

《Java抽象类AbstractClass示例代码详解》Java中的抽象类(AbstractClass)是面向对象编程中的重要概念,它通过abstract关键字声明,用于定义一组相关类的公共行为和属... 目录一、抽象类的定义1. 语法格式2. 核心特征二、抽象类的核心用途1. 定义公共接口2. 提供默认实