Operation class

2024-07-08 09:05:16

Abnormal Server

1. Address conflict. "Address already in use" error occurs when the broker is started.

Solution: Change the value of listenPort in the configuration file and then restart the broker.

2. Mismatching brokerName. "broker-b does not match the expected group name: broker-a" error occurs when broker is started.

Cause: The previously used store and bdb directories are directly copied to other master brokers for startup and only the brokerName is modified.

Solution: Since versions later than 2.0, brokerName cannot be changed once created. To change brokerName, you must delete the store directory.

3. service not available

When the number of messages sent reaches a certain amount, "create maped file failed, please make sure OS and JDK both 64bit" error occurs. When the number of topic queues reaches 1024, "service not available now, maybe disk full, maybe your broker machine memory too small" error occurs.

Solution: Run the ulimit-a command to query system parameters and check whether there are more than 655350 open files and whether max memory size is unlimited. If not, adjust the system parameters based on the instructions in the installation manual.

4. Insufficient disk space

When the disk space usage is greater than 85%, "CODE: 14 DESC: service not available now, maybe disk full, CL: 0.87 CQ: 0.87 INDEX: 0.87, maybe your broker machine memory too small." error occurs.

Solution: The message middleware provides two policies, namely, high data security and high service reliability. The details are shown as follows:

If the disk usage is greater than 85%, the policy is set to high data security, and there are no expired files, you can shorten the data retention time to trigger message deletion more frequently to free up disk space.

(1) Run the updateBrokerConfig command to modify the fileReservedTime property, which is the message retention time in hours. You can reduce the retention time based on your requirements to free up disk space.

(2) You need to modify both the master and slave brokers.

5. An error reported when pulling broker through Deamon

"Fail to queryBrokerMaxOffset" error is reported in the deamon.log.

Causes:

(1) The configuration file is incorrect.

(2) You have performed a master/slave switchover and then manually intervened with or restarted the cluster. The address and role of the startup process are different from those stored in ZooKeeper. As a result, the startup fails.

(3) The error data is not cleared up after the last startup failure.

Solution:

(1) Delete the cluster information from the ZooKeeper.

(2) Check the configuration file to verify whether the port and path are valid.

(3) Delete the running directory.

(4) Restart the process through auto or manual deployment.

6. Stagnant consumption progress

If you query the consumption progress by running the consumerProgress command, some queues have no change in progress but the customer is consuming messages properly.

Cause: Some messages are not acked in certain queues. As a result, the consumption progress has no change.

Solution: Run the consumerProgress command. In the execution result, find consumer offset to locate the un-acked messages. Then:

(1) If it is BDB consumption mode, restart the application or use api void com.ctg.mq.api.IMQAckHandler.ackMessageSuccess(String msgID) to ack stuck messages.

(2) If it is the native ordered consumption mode, restart the application.

(3) If it is the native unordered consumption mode, start an instance in the same consumer group to ack stuck messages.

7. Failing to delete a topic

When a topic is being deleted, "topic **** is consuming by consumer ****" or "topic *** is publishing by producer ***" error occurs.

Cause: To delete a topic, it must have no producer or consumer subscription and all producers and consumers associated with it must be offline. Otherwise, it fails to be deleted.

Solution:

(1) Go to Console > Topic Management > Details > Producer Group|Consumer Group > Connect Instances to check whether any client is connected to this topic.

(2) Run a command similar to sh mqadmin deleteTopic -n 10.142.90.33:9876 -c mq_cluster -t mytesttopic to delete an ordered queue.

(3) Run a command similar to sh mqadmin deleteTopic -n 10.142.90.33:9876 -b 10.142.90.33:10911 -t mytesttopic to delete an unordered queue.

8. A BDB error when a broker is started

Cause: You have migrated the store directory or changed the group name, address, or port of the broker.

Solution: Delete the consumeStore directory under the store directory and restart the broker.

9. The slave broker is started but cannot be found in clusterList

The slave broker is started but cannot be added to a cluster and cannot be found in clusterList.

Causes:

(1) Check the /etc/hosts file to verify whether the mapping between the machine name and IP address is incorrect.

(2) Check the firewall settings to see whether any port is not open. The ports from listenPort to listenPort+2 must be open. If the master broker listenPort=10911, then ports 10911, 10912, 10913 must be open.

10. An ordered topic created in CLI is displayed as an unordered topic in the web console

You specified the -o true option when using the updateTopic command to create an ordered topic, but it is displayed as unordered when queried on the web console.

Cause: The cluster has multiple namesrvs, but you have entered only one namesrv when creating the topic.

Solution: Enter the namesrvs of the broker cluster and separate them by semicolons. Example: sh mqadmin updateTopic -n "10.142.90.30:9876;10.142.90.28:9876" -t crmTopic -o true

11. The consumer subscription does not exist

"The consumer's subscription not exist, group: consumerAepIdealLogGroup" occurs in broker.log

Cause: When the same subscription group is used to consume different topics at the same time, the subscription will be overwritten.

Solution: You are not advised to use the consumer group in the same subscription group to subscribe to different topics. To change the subscription, disable the original consumer.

12. When you use the clusterList for query, the master broker TPS is not 0, but the slave broker TPS is always 0

This is most likely a result of a synchronization error. You can check the store.log or stoererror.log for persistent error messages. To solve the problem, you can delete the store directory of the slave broker and re-synchronize data.

Note: An HA module should be deployed and the master broker’s brokerRole=ASYNC_MASTER. Otherwise, an error is reported for production when stopping the slave broker.

Solution:

(1) Manually stop the slave broker (run kill pid without attaching -9. If the auto pull broker parameter is set to true, stop the deamon of the slave broker first).

(2) Delete or back up the store directory of the slave broker. If the space allows, you are advised to back up the directory by running mv.

(3) Manually start the broker by running sh sh/broker_*.sh.

(4) After the slave broker is started, view it with the clusterList. The slave broker has a high TPS because it is being synchronized.

13. Fault recovery of the master broker

The fault recovery process is usually triggered when the consumequeue is wrong and messages cannot be pulled. (Note: the problem may not necessarily be caused by the consumequeue fault). You can query the fault information by offset.

Fault recovery process:

(1) Stop the master-slave deamons and master-slave brokers of the group to be recovered.

(2) Delete the checkpoint consumequeue consumeStore index in the store directory of the master broker (you can also back up the data by moving and renaming it).

(3) Check whether the abort file exists in the store directory. If not, create one (touch abort).

(4) Start the master broker and view store.log. You can see the recovery process log. If no error is reported, the recovery succeeded.

(5) If there are many commitlog files, it may take a long time for recovery. You can check the recovery progress by verifying whether the store.log or broker port is started.

(6) After the master broker is started, you can verify whether the recovery is successful by consuming messages or based on the offset.

(7) After the master broker recovers, start the slave broker and the master/slave deamons.

14. RPC errors, which may occur on all server components

Cause: A decode error occurs when non-component RPC protocols, such as Http or Telnet, are used for access.

Solution: It does not affect the RPC requests for the application server and client and can be ignored.

Application Client

1. Connect faild.

Solution: If the current network is normal, check whether ulimit –a openfiles is 1024, and change it to 65535.

2. RemotingTimeoutException.

"RemotingTimeoutException: wait response on the channel <10.4.246.198:10911> timeout, 3000ms" error occurs in the server log.

Solution: It is usually caused by the faulty communication between the client and the server. You can run ping Ip and telnet ip port to troubleshoot this problem, and check the firewall problem.

3. No route info of this topic.

Causes:

(1) A topic is not created.

(2) You have entered the wrong name server.

(3) It is unable to obtain routes due to network issues.

Solution:

(1) Create a topic in the Console.

(2) Check whether the namesrv address of the client is configured incorrectly.

(3) Check whether the network runs as expected.

4. SLAVE_NOT_AVAILABLE

When the producer sends a message, "status:SLAVE_NOT_AVAILABLE" error occurs, indicating that the slave node is faulty.

Solution:

(1) The slave machine is faulty. Restart the slave node, and check the network connectivity.

(2) In case of multiple NICs, add code to the properties configuration file of the broker. Code examples:

brokerIP1=10.4.246.130

brokerIP2=10.4.246.130

(3) Avoid incorrect read of the IP address of the NIC and information of the slave node.

5. Oversized message body

"Fail to send message, for: message body size over max message size, max:" occurs in the client. 524288

Solution:

(1) Check the max message body size of the server, that is, themaxMessageSize parameter of the broker’s configuration profile. If the parameter is not configured, the default size is 512K.

(2) Check whether the max message body set by the client (default size: 128k) is smaller than the size of the current message body.

Note: RocketMQ recommends a size of 50K or less (after compression) for a message body.

6. The group has been created

"The producer/consumer group has been created" error occurs when the producer/consumer is running.

Cause: Only one producerGroupName can be loaded in the same JVM (also applicable to consumerGroupName). Otherwise, an error will be reported.

Solution:

(1) To use the same producerGroupName, deploy multiple instances or start multiple processes.

(2) In a process, start multiple threads to share a producer object instance.

7. Subscription group not exist or %retry% topics have no routing info

Cause: You have not established a consumer relationship or created a subscription group.

Solution: Create a subscription group on the Console or CLI.

8. Message already acked, ackMessage failed

Solution: It indicates that the message has been acked and can be ignored.

9. Number of calls for ackMessageRetry

It indicates whether ackMessageRetry is called once or multiple times. For example, if the ack fails, the user calls ackMessageRetry. If the retry fails, it indicates whether the user needs to call ackMessageRetry again or retry is performed automatically.

In the current version, the API will not automatically retry ack. If the retry ack fails, the user needs to call the API again.

10. When an uncertain exception occurs during ack, such as timeout or network exception, the application needs to determine whether the message has been acked successfully.

Solution:

(1) In the Instant Query module of the Console, check whether the message has been acked successfully, and then handle the problem based on the query result.

(2) Retry ack. If it is already acked, an ack exception is thrown. It depends on the applications to handle the problem.

11. Client registration failed

"No matched consumer for the PullRequest PullRequest" error occurs in the client log.

Cause: The client instance registration failed.

Solution: Check the client code and restart the client process.

12. the consumer message buffer is full, so do flow control

"The consumer message buffer is full, so do flow control" error occurs in the client log.

Cause: The push client consumes messages too slowly and the local cache queue is full, and messages are not pulled from the server temporarily. Slow consumption may be caused by network issues, excessive topic queues, insufficient consumers, or inadequate memory.

Solution:

(1) Check whether the network is abnormal and works slowly.

(2) Add consumer instances.

(3) For an unimportant message, you can reduce the number of topic queues if it is not convenient to add consumer instances.

13. system busy, start flow control for a while

"[REJECTREQUEST]system busy, start flow control for a while" or "[PCBUSY_CLEAN_QUEUE]broker busy, start flow control for a while, period in queue" error occurs in the client log.

Causes:

(1) The producer instance is used to send messages when it is being closed, and netty rejects the request when the connection is closed.

(2) There are few threads to send and process requests.

Solution:

(1) Optimize the usage flow and do not use a producer after it is closed.

(2) If the broker is a synchronous master node, change it to an asynchronous master node or set sendMessageThreadPoolNums=32 and waitTimeMillsInSendQueue=1000.

14. Consumers cannot consume messages

On the Console, go to the subscription management menu. Check whether the subscription group has an online consumer instance. If not, check whether the consumer client log has a connection error.

Check the logic of the consumption client for any inconsistent subscription.

15. Whether the restart of a consumer's machine due to downtime will cause message loss

RocketMQ stores message data and subscription information permanently. When a consumer goes offline and becomes online again, consumption will start from the offset persisted by the broker before the consumer is offline. Message loss will not occur.

16. Whether the message tag can be null when subscribing to a message

If the tag is null at the time of topic subscription, consumers cannot consume messages. If you do not want to filter messages by tags, you can set the tag to *, as shown in the following example:

consumer.subscribe(topic, "*");

17."Signature validate by dauth failed" error occurs during client Connection

The error is usually caused by an ACL authentication failure, such as incorrect AccessKey and SecretKey on the client. Check whether the AccessKey and SecretKey are incorrect.

Distributed Message Service RocketMQ

Operation class