Data Mining数据挖掘—5. Association Analysis关联分析

2023-12-10 11:01

本文主要是介绍Data Mining数据挖掘—5. Association Analysis关联分析,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

6. Association Analysis

Given a set of records each of which contains some number of items from a given collection.
Produce dependency rules that will predict the occurrence of an item based on occurrences of other items.
Application area: Marketing and Sales Promotion, Content-based recommendation, Customer loyalty programs

Initially used for Market Basket Analysis to find how items purchased by customers are related. Later extended to more complex data structures: sequential patterns and subgraph patterns

6.1 Simple Approach: Pearson’s correlation coefficient

Pearson's correlation coefficient in Association Analysis

correlation not equals to causality

6.2 Definitoin

6.2.1 Frequent Itemset

Frequent Itemset

6.2.2 Association Rule

Association Rule

6.2.3 Evaluation Metrics

Evaluation Metrics

6.3 Associate Rule Mining Task

Given a set of transactions T, the goal of association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
minsup and minconf are provided by the user
Brute-force approach
Step1: List all possible association rules
Step2: Compute the support and confidence for each rule
Step3: Remove rules that fail the minsup and minconf thresholds

But Computationally prohibitive due to large number of candidates!

Brute-force Approach

Mining Association Rules

6.4 Apriori Algorithm

Two-step approach
Step1: Frequent Itemset Generation (Generate all itemsets whose support ≥ minsup)
Step2: Rule Generation (Generate high confidence rules from each frequent itemset; where each rule is a binary partitioning of a frequent itemset)

However, frequent itemset generation is still computationally expensive… Given d items, there are 2^d candidate itemsets!

Anti-Monotonicity of Support
Anti-Monotonicity of Support

Steps

  1. Start at k=1
  2. Generate frequent itemsets of length k=1
  3. Repeat until no new frequent itemsets are identified
    1. Generate length (k+1) candidate itemsets from length k frequent itemsets; increase k
    2. Prune candidate itemsets that cannot be frequent because they contain subsets of length k that are infrequent (Apriori Principle)
    3. Count the support of each remaining candidate by scanning the DB
    4. Eliminate candidates that are infrequent, leaving only those that are frequent

Illustrating the Apriori Principle

From Frequent Itemsets to Rules
From Frequent Itemsets to Rules

Challenge: Combinatorial Explosion1
Challenge: Combinatorial Explosion2

Rule Generation

Rule Generation for Apriori Algorithm

Complexity of Apriori Algorithm
Complexity of Apriori Algorithm

6.5 FP-growth Algorithm

usually faster than Apriori, requires at most two passes over the database
Use a compressed representation of the database using an FP-tree
Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets
FP-Tree Construction

FP-Tree Construction

FP-Growth(Summary)

6.6 Interestingness Measures

Interestingness measures can be used to prune or rank the derived rules
In the original formulation of association rules, support & confidence are the only interest measures used
various other measures have been proposed

Drawback of Confidence
Drawback of Confidence1

Drawback of Confidence2

6.6.1 Correlation

Correlation takes into account all data at once.
In our scenario: corr(tea,coffee) = -0.25
i.e., the correlation is negative
Interpretation: people who drink tea are less likely to drink coffee

6.6.2 Lift

Lift1

Lift2

Example: Lift

lift and correlation are symmetric [lift(tea → coffee) = lift(coffee → tea)]
confidence is asymmetric

6.6.3 Others

6.7 Handling Continuous and Categorical Attributes

6.7.1 Handling Categorical Attributes

Transform categorical attribute into asymmetric binary variables. Introduce a new “item” for each distinct attribute-value pair -> one-hot-encoding
Potential Issues
(1) Many attribute values
Many of the attribute values may have very low support
Potential solution: Aggregate the low-support attribute values -> bin for “other”
(2) Highly skewed attribute values
Example: 95% of the visitors have Buy = No
Most of the items will be associated with (Buy=No) item
Potential solution: drop the highly frequent items

6.7.2 Handling Continuous Attributes

Transform continuous attribute into binary variables using discretization:
Equal-width binning & Equal-frequency binning
Issue: Size of the intervals affects support & confidence - Too small intervals: not enough support but Too large intervals: not enough confidence

6.8 Effect of Support Distribution

Many real data sets have a skewed support distribution
How to set the appropriate minsup threshold?
If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products)
If minsup is set too low, it is computationally expensive and the number of itemsets is very large
Using a single minimum support threshold may not be effective
Multiple Minimum Support
Multiple Minimum Support

6.9 Association Rules with Temporal Components

Association Rules with Temporal Components

6.10 Subgroup Discovery

Association Rule Mining: Find all patterns in the data
Classification: Identify the best patterns that can predict a target variable
Find all patterns that can explain a target variable.
从数据集中发现具有特定属性和特征的子群或子集。这个任务的目标是识别数据中与感兴趣的属性或行为相关的子群,以便更深入地理解数据、做出预测或采取相关行动。在某些情况下,子群发现可以用于生成新的特征,然后将这些特征用于分类任务。
子群发现旨在发现数据中的子群,而分类旨在将数据分为已知的类别。子群发现通常更加探索性,而分类通常更加预测性。
we have strong predictor variables. But we are also interested in the weaker ones

Algorithms
Early algorithms: Learn unpruned decision tree; Extract rule; Compute measures for rules, rate and rank
Newer algorithms: Based on association rule mining; Based on evolutionary algorithms

Rating Rules
Goals: rules should be covering many examples & Accurate
Rules of both high coverage and accuracy are interesting

Subgroup Discovery – Rating Rules

Subgroup Discovery – Metrics
Subgroup Discovery – Metrics

WRacc1

WRacc2

WRacc3

Subgroup Discovery – Summary

6.11 Summary

Association AnalysisApriori & FP-GrowthSubgroup Discovery
discovering patterns in data; patterns are described by rulesFinds rules with minimum support (i.e., number of transactions) and minimum confidence (i.e., strength of the implication)Learn rules for a particular target variable; Create a comprehensive model of a class

这篇关于Data Mining数据挖掘—5. Association Analysis关联分析的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/476920

相关文章

怎样通过分析GC日志来定位Java进程的内存问题

《怎样通过分析GC日志来定位Java进程的内存问题》:本文主要介绍怎样通过分析GC日志来定位Java进程的内存问题,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录一、GC 日志基础配置1. 启用详细 GC 日志2. 不同收集器的日志格式二、关键指标与分析维度1.

MySQL中的表连接原理分析

《MySQL中的表连接原理分析》:本文主要介绍MySQL中的表连接原理分析,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录1、背景2、环境3、表连接原理【1】驱动表和被驱动表【2】内连接【3】外连接【4编程】嵌套循环连接【5】join buffer4、总结1、背景

解析C++11 static_assert及与Boost库的关联从入门到精通

《解析C++11static_assert及与Boost库的关联从入门到精通》static_assert是C++中强大的编译时验证工具,它能够在编译阶段拦截不符合预期的类型或值,增强代码的健壮性,通... 目录一、背景知识:传统断言方法的局限性1.1 assert宏1.2 #error指令1.3 第三方解决

python中Hash使用场景分析

《python中Hash使用场景分析》Python的hash()函数用于获取对象哈希值,常用于字典和集合,不可变类型可哈希,可变类型不可,常见算法包括除法、乘法、平方取中和随机数哈希,各有优缺点,需根... 目录python中的 Hash除法哈希算法乘法哈希算法平方取中法随机数哈希算法小结在Python中,

Java Stream的distinct去重原理分析

《JavaStream的distinct去重原理分析》Javastream中的distinct方法用于去除流中的重复元素,它返回一个包含过滤后唯一元素的新流,该方法会根据元素的hashcode和eq... 目录一、distinct 的基础用法与核心特性二、distinct 的底层实现原理1. 顺序流中的去重

关于MyISAM和InnoDB对比分析

《关于MyISAM和InnoDB对比分析》:本文主要介绍关于MyISAM和InnoDB对比分析,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录开篇:从交通规则看存储引擎选择理解存储引擎的基本概念技术原理对比1. 事务支持:ACID的守护者2. 锁机制:并发控制的艺

MyBatis Plus 中 update_time 字段自动填充失效的原因分析及解决方案(最新整理)

《MyBatisPlus中update_time字段自动填充失效的原因分析及解决方案(最新整理)》在使用MyBatisPlus时,通常我们会在数据库表中设置create_time和update... 目录前言一、问题现象二、原因分析三、总结:常见原因与解决方法对照表四、推荐写法前言在使用 MyBATis

Python主动抛出异常的各种用法和场景分析

《Python主动抛出异常的各种用法和场景分析》在Python中,我们不仅可以捕获和处理异常,还可以主动抛出异常,也就是以类的方式自定义错误的类型和提示信息,这在编程中非常有用,下面我将详细解释主动抛... 目录一、为什么要主动抛出异常?二、基本语法:raise关键字基本示例三、raise的多种用法1. 抛

github打不开的问题分析及解决

《github打不开的问题分析及解决》:本文主要介绍github打不开的问题分析及解决,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录一、找到github.com域名解析的ip地址二、找到github.global.ssl.fastly.net网址解析的ip地址三

Mysql的主从同步/复制的原理分析

《Mysql的主从同步/复制的原理分析》:本文主要介绍Mysql的主从同步/复制的原理分析,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录为什么要主从同步?mysql主从同步架构有哪些?Mysql主从复制的原理/整体流程级联复制架构为什么好?Mysql主从复制注意