YI-MapReduce

Enhanced Reliability

2024-11-05 07:06:50

YI-MapReduce, a comprehensive open-source big data platform product offered by eSurfing Cloud , comes with a basic platform for big data storage and computing, and a platform for big data O&M management. Its main focus is on improving and tuning the reliability and performance of big data components.

System Reliability

HA Implementation for Master Nodes

In the open-source version of Component , data and computing nodes are designed based on a distributed system, so the failure of a single node does not impact the overall operation of the system. However, the potential single point of failure occurring in master nodes operating in a centralized mode becomes the weak point in the overall system reliability.

YI-MapReduce, an eSurfing Cloud big data platform, provides a dual-machine-like mechanism for the master nodes of all components, including Doris FE, Doris BE, Elasticsearch Nodedata, and more. All adopt active/standby or load-sharing configurations, effectively avoiding the impact of single-point failure scenarios on system reliability.

Reliability Assurance in Exception Scenarios

Through the use of reliability analysis, measures to handle software and hardware exception scenarios are sorted out, enhancing the reliability of the system.

         In the event of an unexpected power outage, data reliability is guaranteed. Whether it's an accidental power outage of a single node or an unexpected power outage for the entire cluster, the system can recover the service normally after power is restored. Key data will not be lost unless the hard drive medium is damaged.

         Sub-health detection of hard drives and fault handling will not cause actual impact on the service.

         The system will automatically handle file system failures and recover affected services.

         The system will automatically handle process and node failures and recover affected services.

         The system will automatically handle network failures and recover affected services.

Node Reliability

Monitoring of Operating System Health Status

It routinely gathers data regarding the utilization rates of operating system hardware resources, including the usage status of CPU, memory, hard drive, and network resources.

Process Health Status Monitoring

YI-MapReduce provides health checks on the status of business instances and the health indicators of the processes within business instances, enabling users to promptly perceive the health status of the processes.

Automatic Hard Drive Troubleshooting

YI-MapReduce, an eSurfing Cloud big data platform, has enhanced the open source version. It is capable of monitoring the status of hard drives and file systems on each node. If an exception occurs, remove the relevant partitions immediately from the storage pool. If a hard drive returns to normal operation (typically because the user has replaced the faulty hard drive with a new one), the new hard drive will also be integrated into business operations. This significantly reduces the load on O&M personnel as hard drive replacements due to failure can be performed online. At the same time, users have the option to establish hot standby disks, which greatly decreases the repair time for faulty hard drives, thereby enhancing system reliability.

Node Disk LVM Configuration

YI-MapReduce, an eSurfing Cloud big data platform, supports configuring multiple disks into LVM (Logic Volume Management), and organizing multiple disks into a single logical volume group. The configuration of LVM avoids uneven disk usage, maintaining balanced usage across all disks, which is particularly crucial for components such as HDFS and Kafka that can leverage multiple disk capabilities. Furthermore, LVM supports disk expansion without the need for remounting, thereby preventing service disruptions.

Data Reliability

YI-MapReduce, a big data platform product of eSurfing Cloud, utilizes the anti-affinity node group and placement group features provided by the Elastic Cloud Server (ECS). In conjunction with the rack awareness capability of Hadoop, it replicates data across multiple physical hosts, effectively preventing data loss due to physical hardware failures.


YORqrYpM8niV