Oracle Grid Infrastructure: Understanding Split-Brain Node Eviction (Doc ID 1546004.1)

本文主要是介绍Oracle Grid Infrastructure: Understanding Split-Brain Node Eviction (Doc ID 1546004.1),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

In this Document

 Purpose
 Scope
 Details
 What does "split brain" mean?
 Why is this a problem?
 How does the clusterware resolve a "split brain" situation?
 Identifying a split-brain eviction
 Finding the cohort
 Understanding the cohort message
 Using the cohort message to identify interconnect network issues
 Follow-up Action
 Community Discussions

 References

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Information in this document applies to any platform.

PURPOSE

The purpose of this note is to explain split-brain node evictions in Oracle Clusterware release 11.2

SCOPE

The intended audience of this note is Oracle Clusterware 11.2 administrators at any level of expertise. As written, this note applies only to 11.2.

DETAILS

Missed network heartbeat (NHB) evictions happen when ocssd of the surviving node loses contact with the evicted node over the interconnect. The nodes must be able to communicate over the interconnect to avoid a "split brain" situation. In the case of a "split brain" node eviction, one node aborted itself to avoid "split brain" when communication over the interconnect was compromised.

What does "split brain" mean?

"Split brain" means that there are 2 or more distinct sets of nodes, or "cohorts", with no communication between the two cohorts.

For example:
Suppose there are 4 nodes named A, B, C, D, in the following situation
* Nodes A,B can talk to each other; nodes C,D can talk to each other
* But A and B cannot talk to C or D, and vice versa
Then there are two cohorts: {A, B} and {C, D}.

Why is this a problem?

In a split-brain situation, there are in a sense two (or more) separate clusters working on the same shared storage. This has the potential for data corruption. So the split-brain must be resolved.

How does the clusterware resolve a "split brain" situation?

Oracle Clusterware handles the split-brain by terminating all the nodes in the SMALLER cohort.
If both of the cohorts are the same size, the cohort with the lowest numbered node in it survives.

The clusterware identifies the LARGEST cohort, and aborts all the nodes which do NOT belong to that cohort.

Identifying a split-brain eviction

In a split-brain node eviction, the following message is present in the ocssd log ($GRID_HOME/log/<hostname>/ocssd/ocssd.log) of the evicted node:

clssnmCheckDskInfo: Aborting local node to avoid splitbrain.

And earlier in the same log, within 10 minutes prior to "clssnmCheckDskInfo: Aborting local node" message:

clssnmPollingThread: node %s (%n) at <X>% heartbeat fatal, removal in...

 

Finding the cohort

The split-brain message in the ocssd.log will show "cohort" information. For example:

2012-12-28 20:26:25.803: [    CSSD][1111296320]clssnmCheckDskInfo: My cohort: 1
2012-12-28 20:26:25.803: [    CSSD][1111296320]clssnmCheckDskInfo: Surviving cohort: 2,3,4
2012-12-28 20:26:25.803: [    CSSD][1111296320](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 1, sprora01, is smaller than cohort of 3 nodes led by node 2, sprora02, based on map type 2

Understanding the cohort message

In a split-brain situation, ocssd on each node records on the voting disk the set of nodes it can communicate with. Each set is known as a "cohort". When there are two (or more) mutually non-intersecting sets, we have a "split-brain" situation. It means that there are two (or more) separate sets of nodes which cannot talk to each other over the interconnect. 

For example, in the above quote

My cohort: 1
Surviving cohort: 2,3,4


The meaning of these messages is

* "My cohort: 1" => The list of nodes I can communicate with: 1
* "Surviving cohort: 2,3,4" => From the voting disk, I know that nodes 2,3,4 can all communicate with each other.
* "Cohort of 1 nodes with leader 1, sprora01, is smaller than cohort of 3 nodes led by node 2, sprora02"
=> Oracle Clusterware has identified that the cohort {1} is smaller than the cohort {2,3,4}.

Oracle Clusterware handles the split-brain by terminating all the nodes in the SMALLER cohort. In this case, the smaller cohort is {1}. Therefore, ocssd on node {1} aborts the node.

Using the cohort message to identify interconnect network issues

The cohort message describes which nodes can communicate with each other.

Each cohort is a set of nodes that can talk to each other, and cannot talk to the nodes NOT in the cohort.

In the above example, the cohort message tells us that nodes {2,3,4} are all in communication; node 1 is not in communication with any of them.

 

Follow-up Action


The private network between node 1 and the other 3 nodes should be checked.

Please refer to the following note to check private interconnect network:  Document 1534949.1 - Oracle Grid Infrastructure: How to Troubleshoot Missed Network Heartbeat Evictions

这篇关于Oracle Grid Infrastructure: Understanding Split-Brain Node Eviction (Doc ID 1546004.1)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/632532

相关文章

Oracle 数据库数据操作如何精通 INSERT, UPDATE, DELETE

《Oracle数据库数据操作如何精通INSERT,UPDATE,DELETE》在Oracle数据库中,对表内数据进行增加、修改和删除操作是通过数据操作语言来完成的,下面给大家介绍Oracle数... 目录思维导图一、插入数据 (INSERT)1.1 插入单行数据,指定所有列的值语法:1.2 插入单行数据,指

Springboot3+将ID转为JSON字符串的详细配置方案

《Springboot3+将ID转为JSON字符串的详细配置方案》:本文主要介绍纯后端实现Long/BigIntegerID转为JSON字符串的详细配置方案,s基于SpringBoot3+和Spr... 目录1. 添加依赖2. 全局 Jackson 配置3. 精准控制(可选)4. OpenAPI (Spri

Oracle修改端口号之后无法启动的解决方案

《Oracle修改端口号之后无法启动的解决方案》Oracle数据库更改端口后出现监听器无法启动的问题确实较为常见,但并非必然发生,这一问题通常源于​​配置错误或环境冲突​​,而非端口修改本身,以下是系... 目录一、问题根源分析​​​二、保姆级解决方案​​​​步骤1:修正监听器配置文件 (listener.

MySQL查看表的最后一个ID的常见方法

《MySQL查看表的最后一个ID的常见方法》在使用MySQL数据库时,我们经常会遇到需要查看表中最后一个id值的场景,无论是为了调试、数据分析还是其他用途,了解如何快速获取最后一个id都是非常实用的技... 目录背景介绍方法一:使用MAX()函数示例代码解释适用场景方法二:按id降序排序并取第一条示例代码解

使用雪花算法产生id导致前端精度缺失问题解决方案

《使用雪花算法产生id导致前端精度缺失问题解决方案》雪花算法由Twitter提出,设计目的是生成唯一的、递增的ID,下面:本文主要介绍使用雪花算法产生id导致前端精度缺失问题的解决方案,文中通过代... 目录一、问题根源二、解决方案1. 全局配置Jackson序列化规则2. 实体类必须使用Long封装类3.

Oracle 通过 ROWID 批量更新表的方法

《Oracle通过ROWID批量更新表的方法》在Oracle数据库中,使用ROWID进行批量更新是一种高效的更新方法,因为它直接定位到物理行位置,避免了通过索引查找的开销,下面给大家介绍Orac... 目录oracle 通过 ROWID 批量更新表ROWID 基本概念性能优化建议性能UoTrFPH优化建议注

PostgreSQL 序列(Sequence) 与 Oracle 序列对比差异分析

《PostgreSQL序列(Sequence)与Oracle序列对比差异分析》PostgreSQL和Oracle都提供了序列(Sequence)功能,但在实现细节和使用方式上存在一些重要差异,... 目录PostgreSQL 序列(Sequence) 与 oracle 序列对比一 基本语法对比1.1 创建序

VSCode中配置node.js的实现示例

《VSCode中配置node.js的实现示例》本文主要介绍了VSCode中配置node.js的实现示例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着... 目录一.node.js下载安装教程二.配置npm三.配置环境变量四.VSCode配置五.心得一.no

全解析CSS Grid 的 auto-fill 和 auto-fit 内容自适应

《全解析CSSGrid的auto-fill和auto-fit内容自适应》:本文主要介绍了全解析CSSGrid的auto-fill和auto-fit内容自适应的相关资料,详细内容请阅读本文,希望能对你有所帮助... css  Grid 的 auto-fill 和 auto-fit/* 父元素 */.gri

Java注解之超越Javadoc的元数据利器详解

《Java注解之超越Javadoc的元数据利器详解》本文将深入探讨Java注解的定义、类型、内置注解、自定义注解、保留策略、实际应用场景及最佳实践,无论是初学者还是资深开发者,都能通过本文了解如何利用... 目录什么是注解?注解的类型内置注编程解自定义注解注解的保留策略实际用例最佳实践总结在 Java 编程