HomeLab – SuperMicro 5028D-TNT4 Storage Driver Performance Issues and Fix

Ok, i’ll admit it…i’ve had serious lab withdrawals since having to give up the awesome Zettagrid Labs. Having a lab to tinker with goes hand in hand with being able to generate tech related content…point and case, my new homelab got delivered on Monday and I have been working to get things setup so that I can deploy my new NestedESXi lab environment.

By way of an quick intro (longer first impression post to follow) I purchased a SuperMicro SYS-5028D-TN4T that I based off this TinkerTry Bundle which has become a very popular system for vExpert homelabers. It’s got an Intel Xeon D-1541 CPU and I loaded it up with 128GB or RAM. The system comes with an embedded Lynx Point AHCI Controller that allows up to six SATA devices and is listed on the VMware Compatibility Guide for ESXi 6.5.

The issue that I came across was to do with storage performance and the native driver that comes bundled with ESXi 6.5. With the release of vSphere 6.5 yesterday, the timing was perfect to install ESXI 6.5 and start to build my management VMs. I first noticed some issues when uploading the Windows 2016 ISO to the datastore with the ISO taking about 30 minutes to upload. From there I created a new VM and installed Windows…this took about two hours to complete which I knew was not as I had expected…especially with the datastore being a decent class SSD.

I created a new VM and kicked off a new install, but this time I opened ESXTOP to see what was going on, and as you can see from the screen shots below, the Kernel and disk write latencies where off the charts topping 2000ms and 700-1000ms respectivly…In throuput terms I was getting about 10-20MB/s when I should have been getting 400-500MB/s. 

ESXTOP was showing the VM with even worse write latency.

I thought to myself if I had bought a lemon of a storage controller and checked the Queue Depth of the card. It’s listed with a QD of 31 which isn’t horrible for a homelab so my attention turned to the driver. Again referencing the VMware Compatability Guide the listed driver for the conrtoller the device driver is listed as ahci version 3.0.22vmw.

I searched for the installed device driver modules and found that the one listed above was present, however there was also a native VMware device drive as well.

I confirmed that the storage controller was using the native VMware driver and went about disabling it as per this VMwareKB (thanks to @fbuechsel who pointed me in the right direction in the vExpert Slack Homelab Channel) as shown below.

After the host rebooted I checked to see if the storage controller was using the device driver listed in the compatability guide. As you can see below not only was it using that driver, but it was now showing the six HBA ports as opposed to just the one seen in the first snippet above.

I once again created a new VM and installed Windows and this time the install completed in a little under five minutes! Quiet a difference! Upon running a crystal disk mark I was now getting the expected speeds from the SSDs and things are moving along quiet nicely.

Hopefully this post saves anyone else who might by this, or other SuperMicro SuperServers some time and not get caught out by poor storage performance caused by the native VMware driver packaged with ESXi 6.5.


References
:

http://www.supermicro.com/products/system/midtower/5028/SYS-5028D-TN4T.cfm

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2044993

18 comments

  • Hi, thanks for the info!! I had the same problem with an old NUC, after disabling the vmw_ahci I see again the six HBA ports and the speed comes back to the normal. The model is a: “Panther Point AHCI Controller”

  • Thankyou! I had the same issue on a Supermicro 5018A-FTN4 (Atom C2758 CPU). I had been tearing my hair out all day wondering why the disk performance had deteriorated. I was also seeing lost datastore errors on my two local disks. I checked the compatability guide and it is the same driver for this system as you have above. I disabled the VMware driver and I seem to be back in business.

  • Anthony, thank you so much for your careful documentation on this issue, which I’m also carefully tracking as well. As you know, turns out it doesn’t seem to affect everybody with Xeon D.

    While the exact systems affected haven’t yet been identified, nor the exact upgrade and install scenarios that can cause this, your fix seems to resolve the issue for everybody who has the need, which is great!

    Seems most of the right people area already looped in as well, see:
    https://twitter.com/ErikBussink/status/799662926449770496

  • Thanks for sharing. I had the same problem on my Intel NUC 5th gen.

  • Thanks for sharing indeed. The problem CAN go all the way back to Sandy Bridge and the Cougar Point SATA-AHCI controller, as I can attest. Couldn’t get more than 1MB/s out of an Intel 730 SSD 480GB, which had been speedy under 6.0U2.

  • Very good detective work!! Worked like a charm for me. can’t thank you enough.

  • I believe this is my exact issue on my newly upgraded C602-based Patsburg SATA ports. I didn’t notice it on my main box because I only use LSI controllers. But one of my other boxes is just using the on-board SATA…so slow. Trying your fix now. Hoping this does the trick!

  • And that did the trick! You rock! Thank you so much!

  • And I spoke too soon. This works fine on one of my systems. But on my main box, when I disable that driver, all hell breaks loose. I can’t even reboot without a hard reset. What am I missing?

  • Anthony, I cant thank you enough for this post! It fixed the issue on my Lenovo TS140 post an install of ESXi 6.5 and enable me to go to bed at a sensible hour 🙂

  • It looks like the content between the lessthan and greaterthan characters was stripped.

    The post should have read –

    Thanks for your post. Went from less than 16Mbps to greater than 500Mbps.Wellsburg AHCI Controller.

    If you can fix it, that would be great. Sorry for the bother.

  • Drivers released on 3/14/2017 does not seem to help either:
    1.0.0-34vmw.650.0.14.5146846

  • For those of you who were helped by this fix, what are you seeing in terms of SSD performance?

    After implementing the fix, I am getting 500+ MB/s read, but only 60 MB/s write. Using CrystalDiskMark to test. Verified SSD could get 500 MB/s read/write outside ESXi.

  • Thank you thank you. I was getting only 2Mbps upload speed and disabling the AHCI module helped alot. Now I’m back at avg’ing @ 900Mbps when I’m uploading files to the datastore. #$%^&*()

  • Although not the same issue I was experiencing, this article was hugely helpful! Thanks for the write-up!

    This past weekend, I decided to upgrade my ESXi instance from 6.0.0 to 6.5.0d. All went well during the upgrade and post upgrade I didn’t notice any major performance issues with my VM’s…

    That is until I tried to access my 3TB WD drive that I pass through to a Windows guest as an RDM. Most of my files on the nearly 90% full drive would return an NTFS incorrect parameter error.

    Chkdsk reported errors but was also unable to correct them as it reported no space available. I was scratching my head for the first couple of days as I knew the drive was physically ok (I regularly log the smart output) but I hadn’t accessed it in a while and I did have a power outage earlier that week so I couldn’t rule out corruption. Plus I forgot to remove the drive when doing the ESXi upgrade which is something I like to do just to ensure everything “plays nice” and the psuedo-linux ESXi runs on doesn’t step on my NTFS formatted drive.

    So I fruitlessly chased after the “incorrect parameter error” thinking first it was an issue with my USN journal (chkdsk first reported an inability to read the journal) then my MFT (after restarting the journal, chkdsk complained it couldn’t write the backup mft) and then finally the partition itself. I was dreading having to rewrite the partition table and/or reformat and restore the data… Plus I was going to lose about 20-30 gigs of stuff I didnt have backed-up for one reason or another.

    After 3 days (chkdsk on a 3TB drive takes a while) I was getting ready to bite the bullet and use a disk recovery tool to rewrite the partition information, hopefully with the correct info to give me access again. That’s when I noticed they were all reporting the disk as 750 GB. That gave me a bit of pause because I didnt want to write a 750 GB partition header, I wanted 2700GB and it made me question if it was an MBR/GPT issue with the hypervisor.

    That’s when I started really investigating the hypervisor. My first thought was that VMware pulled the plug on RDMs they are after all unsupported and even my disks say they dont support it (at least according to VMware). I tried to create a “new” rdm and the vmfstools command failed with an error about the disk not supporting the mode. A few tries later, I was able to successfully create a new RDM with slightly different info then the old drive but still no luck; same NTFS errors.

    Finally I tried restoring back to 6.0 but since I was remote, I’d have to do a downgrade, I couldn’t just reboot and select the previous boot parameters. The downgrade failed as vmware will not let you go back to the 6.0.0 version I was on from 6.5.0d. While looking at the logs to find out why it wouldn’t let you downgrade, I saw a ton of messages about disks in the vmkwarning log.

    Every one of my disks. Both SSD’s, my mechanical 1.5TB datastore, my usb thumbdrive install and my 3tb drive were reporting WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “” state in doubt; requested fast path state update…

    It was logging the message once ever 20-30 seconds per drive. I couldn’t find anything on the message specifically. It clearly has to do with Multipath which my ICH10 SATA is not setup for nor does it support it but nobody commented on experiencing a similar message/error. Through a lot of searching I stumbled on your page though while not exact, I figured “new” drivers could explain it so figured it was worth a shot.

    After disabling the vmw_ahci VIB and rebooting, the messages went away; better yet, a full chkdsk of my RDM in my windows VM returned with no errors and all of my files were available with no corruption!

    Now the only annoying message I get is when the hypervisor decides to query SMART on my usb-thumb drive I use as a boot partition. For some reason it detects it as a scsi device and tries to query it once every 40 minutes logging to the vmkernal log the following message:

    2017-05-20T02:29:31.868Z cpu5:65574)ScsiDeviceIO: 2948: Cmd(0x439500ac7f80) 0x1a, CmdSN 0x69be from world 0 to dev “drive info” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

    Decoding it, the Host (H) is ok, Plugin (P) is ok but the disk/device (D) is returning check status with data “Illegal request.” Since its just a generic usb thumbdrive it doesnt have SMART nor does it think it’s a scsi device so that makes sense… Now if you could tell me how to make VMware stop logging that message WITHOUT disabling smart for all my other drives (other bloggers seem to thing disabling smart on all drives is an acceptable response to an otherwise harmless nuisance log message about 1 drive), you would just be really awesome.

Leave a Reply