You don’t need to be an expert to perform a basic vSAN troubleshooting and many problems could be easily resolved without the need to open a case. All you need is a good process.
Here is a simple procedure that might be useful when you find out you have a problem
First of all, you need to identify the problem, and ideally, you want to provide as much accurate information. For that, you need at least, to ask the following questions:
- What is the problem?
- What is the expectation
- What was working and has stopped?
- When was the problem observed?
- Is there any change on the platform?
- Who is affected? (The business impact)
- How is the problem being measured?
The vSphere Client UI provides information, such as warnings, errors, and notifications, that can help you troubleshoot.
The vSAN health service is a good start for diagnosing the cluster health as it checks a lot of aspects of the vSAN cluster such as the hardware, the network, the connectivity, the storage device health,
Looking information in the logs is also very important especially given that every event is recorded in the /var/log directory on each host part of the vSAN cluster. The most important logs are:
You can use the grep command to search the log file for keywords that indicate a problem (error, inaccessible, unregistered, absent, offline…..).
If you find any error related to vSAN, always check the VMware knowledge base which has many KB articles on vSAN health check. See VMware knowledge base article 2114803.
Try to identify issues without assumptions. Otherwise, you can also use PNOMA approach. It is the VMware OSI model for troubleshooting a vSAN environment.
Below are some of the useful commands with the PNOMA approach:
- vmware -vl: vSphere build version.
- esxcli vsan cluster get: Local UUID, Node type, Sub-Cluster UUID, Member count, Unicast enabled, and Maintenance mode.
- esxcli vsan storage list: Provides storage info. about the disk vSAN is using. (Disk Identifiers, CMMDS status, dedup/compression, and encryption).
- esxcli vsan debug controller list: Provides info. about installed controllers, hardware ID, driver name, and driver version.
- esxcli vsan debug disk list: Lists info. about the status of the vSAN disks.
- vdq -i H: vSAN disk group mappings.
- esxcli vsan network list: vSAN vmk, Multicast settings, port info.
- esxcli network nic list: Print info. about installed physical NICs.
- esxcli network ip interface ipv4 get: Print info. about vmk s, IPs, subnets, and gateways.
- vmkping -I vmk# x.x.x.x: Ping an IP address specifying a vmk.
- vmkping -I vmk# -s 8972 x.x.x.x -d: Ping an IP address specifying a vmk with jumbo frames.
- esxcli vsan cluster unicastagent list: Print vSAN unicast table.
- esxcli vsan debug object list: Print detailed information about the objects health and object information.
- esxcli vsan debug object health summary get: Print the health summary status of all the vSAN objects.
- esxcli vsan debug object: debug command for vsan object.
- esxcli vsan policy: command vSAN Storage policy configuration
- esxcli vsan debug disk list: command to display the overall health of individual disks.
- esxcli vsan debug limit get: command returns limit details for disk groups provided by the host.
Useful Python scripts are included on every host. You can use it to introduce faults into the system for testing. Several of them are applicable to vSAN.
You can run Python scripts with the –help operator to display the options for the script.
One thing you have to be careful is not forget to document the problem and the RCA. We will be happy if the issue ever pops up again.
Finally, everyone has its own preference and approch in troubleshooting. However, keep in mind that the most important concept is to be calm. and methodic. There is no need to panic.
The ability to manage stress and maintain a good methodology even in the face of chaos is what makes an engineer a good engineer.