FuzzyKmeans的Mahout实现

2024-06-18 18:08
文章标签 实现 mahout fuzzykmeans

本文主要是介绍FuzzyKmeans的Mahout实现,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

不得不说,google更靠谱,比google更更靠谱的是官网!!!

so要好好利用google and official website!!!

https://mahout.apache.org/users/clustering/fuzzy-k-means.html

Fuzzy K-Means

Fuzzy K-Means (also called Fuzzy C-Means) is an extension of K-Means , the popular simple clustering technique. While K-Means discovers hard clusters (a point belong to only one cluster), Fuzzy K-Means is a more statistically formalized method and discovers soft clusters where a particular point can belong to more than one cluster with certain probability.

Algorithm

Like K-Means, Fuzzy K-Means works on those objects which can be represented in n-dimensional vector space and a distance measure is defined. The algorithm is similar to k-means.

  • Initialize k clusters
  • Until converged
    • Compute the probability of a point belong to a cluster for every pair
    • Recompute the cluster centers using above probability membership values of points to clusters

Design Implementation

The design is similar to K-Means present in Mahout. It accepts an input file containing vector points. User can either provide the cluster centers as input or can allow canopy algorithm to run and create initial clusters.

Similar to K-Means, the program doesn't modify the input directories. And for every iteration, the cluster output is stored in a directory cluster-N. The code has set number of reduce tasks equal to number of map tasks. So, those many part-0

Files are created in clusterN directory. The code uses driver/mapper/combiner/reducer as follows:

FuzzyKMeansDriver - This is similar to  KMeansDriver. It iterates over input points and cluster points for specified number of iterations or until it is converged.During every iteration i, a new cluster-i directory is created which contains the modified cluster centers obtained during FuzzyKMeans iteration. This will be feeded as input clusters in the next iteration.  Once Fuzzy KMeans is run for specified number of iterations or until it is converged, a map task is run to output "the point and the cluster membership to each cluster" pair as final output to a directory named "points".

FuzzyKMeansMapper - reads the input cluster during its configure() method, then  computes cluster membership probability of a point to each cluster.Cluster membership is inversely propotional to the distance. Distance is computed using  user supplied distance measure. Output key is encoded clusterId. Output values are ClusterObservations containing observation statistics.

FuzzyKMeansCombiner - receives all key:value pairs from the mapper and produces partial sums of the cluster membership probability times input vectors for each cluster. Output key is: encoded cluster identifier. Output values are ClusterObservations containing observation statistics.

FuzzyKMeansReducer - Multiple reducers receives certain keys and all values associated with those keys. The reducer sums the values to produce a new centroid for the cluster which is output. Output key is: encoded cluster identifier (e.g. "C14". Output value is: formatted cluster identifier (e.g. "C14"). The reducer encodes unconverged clusters with a 'Cn' cluster Id and converged clusters with 'Vn' clusterId.

Running Fuzzy k-Means Clustering

The Fuzzy k-Means clustering algorithm may be run using a command-line invocation on FuzzyKMeansDriver.main or by making a Java call to FuzzyKMeansDriver.run().

Invocation using the command line takes the form:

bin/mahout fkmeans \-i <input vectors directory> \-c <input clusters directory> \-o <output working directory> \-dm <DistanceMeasure> \-m <fuzziness argument >1> \-x <maximum number of iterations> \-k <optional number of initial clusters to sample from input vectors> \-cd <optional convergence delta. Default is 0.5> \-ow <overwrite output directory if present>-cl <run input vector clustering after computing Clusters>-e <emit vectors to most likely cluster during clustering>-t <threshold to use for clustering if -e is false>-xm <execution method: sequential or mapreduce>

Note: if the -k argument is supplied, any clusters in the -c directory will be overwritten and -k random points will be sampled from the input vectors to become the initial cluster centers.

Invocation using Java involves supplying the following arguments:

  1. input: a file path string to a directory containing the input data set a SequenceFile(WritableComparable, VectorWritable). The sequence file key is not used.
  2. clustersIn: a file path string to a directory containing the initial clusters, a SequenceFile(key, SoftCluster | Cluster | Canopy). Fuzzy k-Means SoftClusters, k-Means Clusters and Canopy Canopies may be used for the initial clusters.
  3. output: a file path string to an empty directory which is used for all output from the algorithm.
  4. measure: the fully-qualified class name of an instance of DistanceMeasure which will be used for the clustering.
  5. convergence: a double value used to determine if the algorithm has converged (clusters have not moved more than the value in the last iteration)
  6. max-iterations: the maximum number of iterations to run, independent of the convergence specified
  7. m: the "fuzzyness" argument, a double > 1. For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1. When m is close to 1, then the cluster center closest to the point is given much more weight than the others, and the algorithm is similar to k-means.
  8. runClustering: a boolean indicating, if true, that the clustering step is to be executed after clusters have been determined.
  9. emitMostLikely: a boolean indicating, if true, that the clustering step should only emit the most likely cluster for each clustered point.
  10. threshold: a double indicating, if emitMostLikely is false, the cluster probability threshold used for emitting multiple clusters for each point. A value of 0 will emit all clusters with their associated probabilities for each vector.
  11. runSequential: a boolean indicating, if true, that the algorithm is to use the sequential reference implementation running in memory.

After running the algorithm, the output directory will contain: 1. clusters-N: directories containing SequenceFiles(Text, SoftCluster) produced by the algorithm for each iteration. The Text key is a cluster identifier string. 1. clusteredPoints: (if runClustering enabled) a directory containing SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable key is the clusterId. The WeightedVectorWritable value is a bean containing a double weight and a VectorWritable vector where the weights are computed as 1/(1+distance) where the distance is between the cluster center and the vector using the chosen DistanceMeasure.

Examples

The following images illustrate Fuzzy k-Means clustering applied to a set of randomly-generated 2-d data points. The points are generated using a normal distribution centered at a mean location and with a constant standard deviation. See the README file in the /examples/src/main/java/org/apache/mahout/clustering/display/README.txt for details on running similar examples.

The points are generated as follows:

  • 500 samples m=[1.0, 1.0](1.0,-1.0.html) sd=3.0
  • 300 samples m=[1.0, 0.0](1.0,-0.0.html) sd=0.5
  • 300 samples m=[0.0, 2.0](0.0,-2.0.html) sd=0.1

In the first image, the points are plotted and the 3-sigma boundaries of their generator are superimposed.

fuzzy

In the second image, the resulting clusters (k=3) are shown superimposed upon the sample data. As Fuzzy k-Means is an iterative algorithm, the centers of the clusters in each recent iteration are shown using different colors. Bold red is the final clustering and previous iterations are shown in [orange, yellow, green, blue, violet and gray](orange,-yellow,-green,-blue,-violet-and-gray.html) . Although it misses a lot of the points and cannot capture the original, superimposed cluster centers, it does a decent job of clustering this data.

fuzzy

The third image shows the results of running Fuzzy k-Means on a different data set which is generated using asymmetrical standard deviations. Fuzzy k-Means does a fair job handling this data set as well.

fuzzy


这篇关于FuzzyKmeans的Mahout实现的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1072722

相关文章

使用Python实现IP地址和端口状态检测与监控

《使用Python实现IP地址和端口状态检测与监控》在网络运维和服务器管理中,IP地址和端口的可用性监控是保障业务连续性的基础需求,本文将带你用Python从零打造一个高可用IP监控系统,感兴趣的小伙... 目录概述:为什么需要IP监控系统使用步骤说明1. 环境准备2. 系统部署3. 核心功能配置系统效果展

Python实现微信自动锁定工具

《Python实现微信自动锁定工具》在数字化办公时代,微信已成为职场沟通的重要工具,但临时离开时忘记锁屏可能导致敏感信息泄露,下面我们就来看看如何使用Python打造一个微信自动锁定工具吧... 目录引言:当微信隐私遇到自动化守护效果展示核心功能全景图技术亮点深度解析1. 无操作检测引擎2. 微信路径智能获

Python中pywin32 常用窗口操作的实现

《Python中pywin32常用窗口操作的实现》本文主要介绍了Python中pywin32常用窗口操作的实现,pywin32主要的作用是供Python开发者快速调用WindowsAPI的一个... 目录获取窗口句柄获取最前端窗口句柄获取指定坐标处的窗口根据窗口的完整标题匹配获取句柄根据窗口的类别匹配获取句

在 Spring Boot 中实现异常处理最佳实践

《在SpringBoot中实现异常处理最佳实践》本文介绍如何在SpringBoot中实现异常处理,涵盖核心概念、实现方法、与先前查询的集成、性能分析、常见问题和最佳实践,感兴趣的朋友一起看看吧... 目录一、Spring Boot 异常处理的背景与核心概念1.1 为什么需要异常处理?1.2 Spring B

Python位移操作和位运算的实现示例

《Python位移操作和位运算的实现示例》本文主要介绍了Python位移操作和位运算的实现示例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一... 目录1. 位移操作1.1 左移操作 (<<)1.2 右移操作 (>>)注意事项:2. 位运算2.1

如何在 Spring Boot 中实现 FreeMarker 模板

《如何在SpringBoot中实现FreeMarker模板》FreeMarker是一种功能强大、轻量级的模板引擎,用于在Java应用中生成动态文本输出(如HTML、XML、邮件内容等),本文... 目录什么是 FreeMarker 模板?在 Spring Boot 中实现 FreeMarker 模板1. 环

Qt实现网络数据解析的方法总结

《Qt实现网络数据解析的方法总结》在Qt中解析网络数据通常涉及接收原始字节流,并将其转换为有意义的应用层数据,这篇文章为大家介绍了详细步骤和示例,感兴趣的小伙伴可以了解下... 目录1. 网络数据接收2. 缓冲区管理(处理粘包/拆包)3. 常见数据格式解析3.1 jsON解析3.2 XML解析3.3 自定义

SpringMVC 通过ajax 前后端数据交互的实现方法

《SpringMVC通过ajax前后端数据交互的实现方法》:本文主要介绍SpringMVC通过ajax前后端数据交互的实现方法,本文给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价... 在前端的开发过程中,经常在html页面通过AJAX进行前后端数据的交互,SpringMVC的controll

Spring Security自定义身份认证的实现方法

《SpringSecurity自定义身份认证的实现方法》:本文主要介绍SpringSecurity自定义身份认证的实现方法,下面对SpringSecurity的这三种自定义身份认证进行详细讲解,... 目录1.内存身份认证(1)创建配置类(2)验证内存身份认证2.JDBC身份认证(1)数据准备 (2)配置依

利用python实现对excel文件进行加密

《利用python实现对excel文件进行加密》由于文件内容的私密性,需要对Excel文件进行加密,保护文件以免给第三方看到,本文将以Python语言为例,和大家讲讲如何对Excel文件进行加密,感兴... 目录前言方法一:使用pywin32库(仅限Windows)方法二:使用msoffcrypto-too