Hive，order by ,distribute by ,sort by ,cluster by 作用与区别（转载）

本文主要是介绍Hive，order by ,distribute by ,sort by ,cluster by 作用与区别（转载），希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

具有相同 Distribute By 列的所有行将进入相同的 reducer
https://www.docs4dev.com/docs/zh/apache-hive/3.1.1/reference/LanguageManual_SortBy.html

---------------

1、order by

hive中的order by 会对查询结果集执行一个全局排序，这也就是说所有的数据都通过一个reduce进行处理的过程，对于大数据集，这个过程将消耗很大的时间来执行。

2、sort by

hive的sort by 也就是执行一个局部排序过程。这可以保证每个reduce的输出数据都是有序的(但并非全局有效)。这样就可以提高后面进行的全局排序的效率了。对于这两种情况，语法区别仅仅是，一个关键字是order，另一个关键字是sort。用户可以指定任意期望进行排序的字段，并可以在字段后面加上asc关键字(默认)表示升序，desc关键字是降序排序。

在使用sort by之前，需要先设置Reduce的数量>1，才会做局部排序，如果Reduce数量是1，作用与order by一样，全局排序。

3、distribute by

distribute by 控制 map的输出在reduer中是如何划分的，mapreduce job 中传输的所有数据都是按照键-值对的方式进行组织的，因此hive在将用户的查询语句转换成mapreduce job时，其必须在内部使用这个功能。默认情况下，MapReduce计算框架会依据map输入的键计算相应的哈希值，然后按照得到的哈希值将键-值对均匀分发到多个reducer中去，不过不幸的是，这也是意味着当我们使用sort by 时，不同reducer的输出内容会有明显的重叠，至少对于排序顺序而已只这样，即使每个reducer的输出的数据都有序的。如果我们想让同一年的数据一起处理，那么就可以使用distribute by 来保证具有相同年份的数据分发到同一个reducer中进行处理，然后使用sort by 来安装我们的期望对数据进行排序:

4、cluster by

cluster by 除了distribute by 的功能外，还会对该字段进行排序，所以cluster by = distribute by +sort by 。

eg：select * from table cluster by year;

等价于：select * from table distribute by year sort by year;

https://zhuanlan.zhihu.com/p/93747613

----------------

hive中的distribute by是控制在map端如何拆分数据给reduce端的。
hive会根据distribute by后面列，根据reduce的个数进行数据分发，默认是采用hash算法。

对于distribute by进行测试，一定要分配多reduce进行处理，否则无法看到distribute by的效果。

hive> select * from test09;
OK
100 tom
200 mary
300 kate
400 tim
Time taken: 0.061 seconds

hive> insert overwrite local directory ‘/home/hjl/sunwg/ooo’ select * from test09 distribute by id;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201105020924_0070, Tracking URL = http://hadoop00:50030/jobdetails.jsp?jobid=job_201105020924_0070
Kill Command = /home/hjl/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=hadoop00:9001 -kill job_201105020924_0070
2011-05-03 06:12:36,644 Stage-1 map = 0%, reduce = 0%
2011-05-03 06:12:37,656 Stage-1 map = 50%, reduce = 0%
2011-05-03 06:12:39,673 Stage-1 map = 100%, reduce = 0%
2011-05-03 06:12:44,713 Stage-1 map = 100%, reduce = 50%
2011-05-03 06:12:46,733 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201105020924_0070
Copying data to local directory /home/hjl/sunwg/ooo
Copying data to local directory /home/hjl/sunwg/ooo
4 Rows loaded to /home/hjl/sunwg/ooo
OK
Time taken: 17.663 seconds

第一次执行根据id字段来做分发，结果如下：

[hjl@sunwg src]$ cat /home/hjl/sunwg/ooo/attempt_201105020924_0070_r_000000_0
400tim
200mary
[hjl@sunwg src]$ cat /home/hjl/sunwg/ooo/attempt_201105020924_0070_r_000001_0
300kate
100tom

这次我们换个分发的方式，采用length(id)的结果，因为这几条记录的id字段的长度都相同，所以应该会被分布到同一个reduce中。

hive> insert overwrite local directory ‘/home/hjl/sunwg/lll’ select * from test09 distribute by length(id);
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201105020924_0071, Tracking URL = http://hadoop00:50030/jobdetails.jsp?jobid=job_201105020924_0071
Kill Command = /home/hjl/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=hadoop00:9001 -kill job_201105020924_0071
2011-05-03 06:15:21,430 Stage-1 map = 0%, reduce = 0%
2011-05-03 06:15:24,454 Stage-1 map = 100%, reduce = 0%
2011-05-03 06:15:31,509 Stage-1 map = 100%, reduce = 50%
2011-05-03 06:15:34,539 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201105020924_0071
Copying data to local directory /home/hjl/sunwg/lll
Copying data to local directory /home/hjl/sunwg/lll
4 Rows loaded to /home/hjl/sunwg/lll
OK
Time taken: 20.632 seconds

在查看下结果是否和我们的预期相同：
[hjl@sunwg src]$ cat /home/hjl/sunwg/lll/attempt_201105020924_0071_r_000000_0
[hjl@sunwg src]$ cat /home/hjl/sunwg/lll/attempt_201105020924_0071_r_000001_0
100tom
200mary
300kate
400tim

文件attempt_201105020924_0071_r_000000_0中没有记录，而全部的记录都在attempt_201105020924_0071_r_000001_0中。

https://blog.csdn.net/cjlion/article/details/80879469

这篇关于Hive，order by ,distribute by ,sort by ,cluster by 作用与区别（转载）的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！