数据分析实例1(英文报告)--预测未来收入--SAS 逻辑回归--1994年美国人口普查数据

本文主要是介绍数据分析实例1(英文报告)--预测未来收入--SAS 逻辑回归--1994年美国人口普查数据,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

Prediction of Future Income by Using Logistic Regression

Matthew LaFrance, Yu Zhang

 

1 Introduction

Many factors could influence a person’s annual income, for example, age, gender, race, level of education, marriage status, nationality, etc. The authors tried to fit four models of influential factors that based on a census dataset and to find a precise prediction of annual income.

The dataset was found at the UCI Machine Learning Repository. The data consists of 48,844 observations from a 1994 U.S. Census. The target variable, “Salary”, has two levels: >50k and <=50k. There are 8 categorical features and 5
numeric features consisting of demographic, educational, and occupational information. Table 1 is the variables in SAS version.
 

Table 1 Variables in the Income Dataset
在这里插入图片描述
The target variable “Salary” is notably unbalanced (in table 2). As a result, we noted that using raw accuracy as a success metric could potentially be misleading because a model that classifies all observations as <=50k would achieve roughly 76% accuracy.
 
Table 2 Salary Overview
在这里插入图片描述

2 Data Preprocessing

2.1 Missing Values

3,622 observations contained missing values primarily in the “occupation” column. Because the occupation column had many factor levels, we decided many imputation methods wouldn’t retain enough information to justify the increase in bias, so we decided to take only complete cases. As a result, our conclusions may be biased, and we assume that the deleted observations had missing values at random. For future work, it may be worthwhile to explore other methods of handling the missing data. After deletion, the final dataset had 45,222 observations.
 

2.2 Multicollinearity Checks

None of our numeric features showed strong correlations between each other (see table 3). As a result, we were not particularly concerned about multicollinearity.
 

Table 3 The Result of Correlation Checks
在这里插入图片描述

2.3 Exploratory Data Analysis and Feature Engineering

In our initial looks at the data, we were able to make several noteworthy observations which will be detailed below by variables and summarized at the end.

2.3.1 Capital Gains and Capital Losses
Figure 1 Overview of Capital Gains
Figure 1 Overview of Capital Gains
在这里插入图片描述
Figure 2 Overview of Capital Losses
 

Regarding the capital gains and loss variables, it is worth noting that most individuals in our dataset do not have any investments (see Figure 1 and Figure 2).

2.3.2 Native Country
As expected, most observations are U.S. natives (see Figure 3). The native country variable consists of many factor levels. In order to avoid having too many dummy variables later on, we decided it would be necessary to rebin this feature into a “0, and 1 ” indicator variable of being a native-born citizen.
在这里插入图片描述
Figure 3 Overview of Native Countries
 

2.3.3 Marital Status
In looking into the marital status feature, we noticed several different levels all representing married (see Figure 4). We decided to combine all these levels into one level, “married”, for more convenient interpretation. Additionally, it was worth noting that married individuals appear to more consistently make greater than 50k.

这篇关于数据分析实例1(英文报告)--预测未来收入--SAS 逻辑回归--1994年美国人口普查数据的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/700653

相关文章

SQL Server修改数据库名及物理数据文件名操作步骤

《SQLServer修改数据库名及物理数据文件名操作步骤》在SQLServer中重命名数据库是一个常见的操作,但需要确保用户具有足够的权限来执行此操作,:本文主要介绍SQLServer修改数据... 目录一、背景介绍二、操作步骤2.1 设置为单用户模式(断开连接)2.2 修改数据库名称2.3 查找逻辑文件名

Python实例题之pygame开发打飞机游戏实例代码

《Python实例题之pygame开发打飞机游戏实例代码》对于python的学习者,能够写出一个飞机大战的程序代码,是不是感觉到非常的开心,:本文主要介绍Python实例题之pygame开发打飞机... 目录题目pygame-aircraft-game使用 Pygame 开发的打飞机游戏脚本代码解释初始化部

canal实现mysql数据同步的详细过程

《canal实现mysql数据同步的详细过程》:本文主要介绍canal实现mysql数据同步的详细过程,本文通过实例图文相结合给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价值,需要的... 目录1、canal下载2、mysql同步用户创建和授权3、canal admin安装和启动4、canal

使用SpringBoot整合Sharding Sphere实现数据脱敏的示例

《使用SpringBoot整合ShardingSphere实现数据脱敏的示例》ApacheShardingSphere数据脱敏模块,通过SQL拦截与改写实现敏感信息加密存储,解决手动处理繁琐及系统改... 目录痛点一:痛点二:脱敏配置Quick Start——Spring 显示配置:1.引入依赖2.创建脱敏

Spring组件实例化扩展点之InstantiationAwareBeanPostProcessor使用场景解析

《Spring组件实例化扩展点之InstantiationAwareBeanPostProcessor使用场景解析》InstantiationAwareBeanPostProcessor是Spring... 目录一、什么是InstantiationAwareBeanPostProcessor?二、核心方法解

详解如何使用Python构建从数据到文档的自动化工作流

《详解如何使用Python构建从数据到文档的自动化工作流》这篇文章将通过真实工作场景拆解,为大家展示如何用Python构建自动化工作流,让工具代替人力完成这些数字苦力活,感兴趣的小伙伴可以跟随小编一起... 目录一、Excel处理:从数据搬运工到智能分析师二、PDF处理:文档工厂的智能生产线三、邮件自动化:

java String.join()方法实例详解

《javaString.join()方法实例详解》String.join()是Java提供的一个实用方法,用于将多个字符串按照指定的分隔符连接成一个字符串,这一方法是Java8中引入的,极大地简化了... 目录bVARxMJava String.join() 方法详解1. 方法定义2. 基本用法2.1 拼接

Python数据分析与可视化的全面指南(从数据清洗到图表呈现)

《Python数据分析与可视化的全面指南(从数据清洗到图表呈现)》Python是数据分析与可视化领域中最受欢迎的编程语言之一,凭借其丰富的库和工具,Python能够帮助我们快速处理、分析数据并生成高质... 目录一、数据采集与初步探索二、数据清洗的七种武器1. 缺失值处理策略2. 异常值检测与修正3. 数据

pandas实现数据concat拼接的示例代码

《pandas实现数据concat拼接的示例代码》pandas.concat用于合并DataFrame或Series,本文主要介绍了pandas实现数据concat拼接的示例代码,具有一定的参考价值,... 目录语法示例:使用pandas.concat合并数据默认的concat:参数axis=0,join=

C#代码实现解析WTGPS和BD数据

《C#代码实现解析WTGPS和BD数据》在现代的导航与定位应用中,准确解析GPS和北斗(BD)等卫星定位数据至关重要,本文将使用C#语言实现解析WTGPS和BD数据,需要的可以了解下... 目录一、代码结构概览1. 核心解析方法2. 位置信息解析3. 经纬度转换方法4. 日期和时间戳解析5. 辅助方法二、L