数据分析实例1(英文报告)--预测未来收入--SAS 逻辑回归--1994年美国人口普查数据

本文主要是介绍数据分析实例1(英文报告)--预测未来收入--SAS 逻辑回归--1994年美国人口普查数据,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

Prediction of Future Income by Using Logistic Regression

Matthew LaFrance, Yu Zhang

 

1 Introduction

Many factors could influence a person’s annual income, for example, age, gender, race, level of education, marriage status, nationality, etc. The authors tried to fit four models of influential factors that based on a census dataset and to find a precise prediction of annual income.

The dataset was found at the UCI Machine Learning Repository. The data consists of 48,844 observations from a 1994 U.S. Census. The target variable, “Salary”, has two levels: >50k and <=50k. There are 8 categorical features and 5
numeric features consisting of demographic, educational, and occupational information. Table 1 is the variables in SAS version.
 

Table 1 Variables in the Income Dataset
在这里插入图片描述
The target variable “Salary” is notably unbalanced (in table 2). As a result, we noted that using raw accuracy as a success metric could potentially be misleading because a model that classifies all observations as <=50k would achieve roughly 76% accuracy.
 
Table 2 Salary Overview
在这里插入图片描述

2 Data Preprocessing

2.1 Missing Values

3,622 observations contained missing values primarily in the “occupation” column. Because the occupation column had many factor levels, we decided many imputation methods wouldn’t retain enough information to justify the increase in bias, so we decided to take only complete cases. As a result, our conclusions may be biased, and we assume that the deleted observations had missing values at random. For future work, it may be worthwhile to explore other methods of handling the missing data. After deletion, the final dataset had 45,222 observations.
 

2.2 Multicollinearity Checks

None of our numeric features showed strong correlations between each other (see table 3). As a result, we were not particularly concerned about multicollinearity.
 

Table 3 The Result of Correlation Checks
在这里插入图片描述

2.3 Exploratory Data Analysis and Feature Engineering

In our initial looks at the data, we were able to make several noteworthy observations which will be detailed below by variables and summarized at the end.

2.3.1 Capital Gains and Capital Losses
Figure 1 Overview of Capital Gains
Figure 1 Overview of Capital Gains
在这里插入图片描述
Figure 2 Overview of Capital Losses
 

Regarding the capital gains and loss variables, it is worth noting that most individuals in our dataset do not have any investments (see Figure 1 and Figure 2).

2.3.2 Native Country
As expected, most observations are U.S. natives (see Figure 3). The native country variable consists of many factor levels. In order to avoid having too many dummy variables later on, we decided it would be necessary to rebin this feature into a “0, and 1 ” indicator variable of being a native-born citizen.
在这里插入图片描述
Figure 3 Overview of Native Countries
 

2.3.3 Marital Status
In looking into the marital status feature, we noticed several different levels all representing married (see Figure 4). We decided to combine all these levels into one level, “married”, for more convenient interpretation. Additionally, it was worth noting that married individuals appear to more consistently make greater than 50k.

这篇关于数据分析实例1(英文报告)--预测未来收入--SAS 逻辑回归--1994年美国人口普查数据的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/700653

相关文章

SpringBoot多环境配置数据读取方式

《SpringBoot多环境配置数据读取方式》SpringBoot通过环境隔离机制,支持properties/yaml/yml多格式配置,结合@Value、Environment和@Configura... 目录一、多环境配置的核心思路二、3种配置文件格式详解2.1 properties格式(传统格式)1.

解决pandas无法读取csv文件数据的问题

《解决pandas无法读取csv文件数据的问题》本文讲述作者用Pandas读取CSV文件时因参数设置不当导致数据错位,通过调整delimiter和on_bad_lines参数最终解决问题,并强调正确参... 目录一、前言二、问题复现1. 问题2. 通过 on_bad_lines=‘warn’ 跳过异常数据3

MySQL多实例管理如何在一台主机上运行多个mysql

《MySQL多实例管理如何在一台主机上运行多个mysql》文章详解了在Linux主机上通过二进制方式安装MySQL多实例的步骤,涵盖端口配置、数据目录准备、初始化与启动流程,以及排错方法,适用于构建读... 目录一、什么是mysql多实例二、二进制方式安装MySQL1.获取二进制代码包2.安装基础依赖3.清

C#监听txt文档获取新数据方式

《C#监听txt文档获取新数据方式》文章介绍通过监听txt文件获取最新数据,并实现开机自启动、禁用窗口关闭按钮、阻止Ctrl+C中断及防止程序退出等功能,代码整合于主函数中,供参考学习... 目录前言一、监听txt文档增加数据二、其他功能1. 设置开机自启动2. 禁止控制台窗口关闭按钮3. 阻止Ctrl +

java如何实现高并发场景下三级缓存的数据一致性

《java如何实现高并发场景下三级缓存的数据一致性》这篇文章主要为大家详细介绍了java如何实现高并发场景下三级缓存的数据一致性,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下... 下面代码是一个使用Java和Redisson实现的三级缓存服务,主要功能包括:1.缓存结构:本地缓存:使

SpringBoot 异常处理/自定义格式校验的问题实例详解

《SpringBoot异常处理/自定义格式校验的问题实例详解》文章探讨SpringBoot中自定义注解校验问题,区分参数级与类级约束触发的异常类型,建议通过@RestControllerAdvice... 目录1. 问题简要描述2. 异常触发1) 参数级别约束2) 类级别约束3. 异常处理1) 字段级别约束

Apache Ignite缓存基本操作实例详解

《ApacheIgnite缓存基本操作实例详解》文章介绍了ApacheIgnite中IgniteCache的基本操作,涵盖缓存获取、动态创建、销毁、原子及条件更新、异步执行,强调线程池注意事项,避免... 目录一、获取缓存实例(Getting an Instance of a Cache)示例代码:二、动态

在MySQL中实现冷热数据分离的方法及使用场景底层原理解析

《在MySQL中实现冷热数据分离的方法及使用场景底层原理解析》MySQL冷热数据分离通过分表/分区策略、数据归档和索引优化,将频繁访问的热数据与冷数据分开存储,提升查询效率并降低存储成本,适用于高并发... 目录实现冷热数据分离1. 分表策略2. 使用分区表3. 数据归档与迁移在mysql中实现冷热数据分

C#解析JSON数据全攻略指南

《C#解析JSON数据全攻略指南》这篇文章主要为大家详细介绍了使用C#解析JSON数据全攻略指南,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下... 目录一、为什么jsON是C#开发必修课?二、四步搞定网络JSON数据1. 获取数据 - HttpClient最佳实践2. 动态解析 - 快速

MySQL逻辑删除与唯一索引冲突解决方案

《MySQL逻辑删除与唯一索引冲突解决方案》本文探讨MySQL逻辑删除与唯一索引冲突问题,提出四种解决方案:复合索引+时间戳、修改唯一字段、历史表、业务层校验,推荐方案1和方案3,适用于不同场景,感兴... 目录问题背景问题复现解决方案解决方案1.复合唯一索引 + 时间戳删除字段解决方案2:删除后修改唯一字