Using TRIM and DISCARD with SSDs attached to RAID controllers

Learn to groom your SSDs to maintain the best performance.

Posted: October 28, 2020 by Vincent Cojot (Red Hat)

Maintaining disk performance — Photo by **Markus Spiske** from **Pexels**

SSDs are now commonplace and have been the default choice for performance-oriented disks in the enterprise and consumer environments for the past few years. SSDs are cool and fast but most people on high-end machines face this dilemma: My SSD is behind a RAID controller which doesn't expose the device's DISCARD or TRIM capabilities. How do I discard the blocks to keep the best SSD performance? Here's a trick to do just that without having to disassemble your machine. Recent improvements in SSD firmware have made the need for the applications writing to SSDs less stringent to use DISCARD/TRIM.

There are, however, some cases in which you may need to have the filesystem inform the drive of the blocks which it discarded. Perhaps you have TLC (3bits per cell) or QLC (4bits per cell) drives instead of the usually more expensive enterprise-class SLC or MLC drives (the latter are less susceptible to a performance drop since they put aside more extra blocks to help with overwrites when the drive is at capacity). Or maybe you once filled your SSD to 100%, and now you cannot get the original performance/IOPS back.

On most systems, getting the performance back is usually a simple matter of issuing a filesystem trim (fstrim) command. Here's an example using a Red Hat Enterprise Linux (RHEL) system:

[root@System_A ~]# fstrim -av
/export/home: 130.5 GiB (140062863360 bytes) trimmed
/var: 26.1 GiB (28062511104 bytes) trimmed
/opt: 17.6 GiB (18832797696 bytes) trimmed
/export/shared: 31.6 GiB (33946275840 bytes) trimmed
/usr/local: 5.6 GiB (5959331840 bytes) trimmed
/boot: 678.6 MiB (711565312 bytes) trimmed
/usr: 36.2 GiB (38831017984 bytes) trimmed
/: 3 GiB (3197743104 bytes) trimmed
[root@System_A ~]#

[ Readers also liked: Linux hardware: Converting to solid-state disks (SSDs) on the desktop ]

There's one catch, though...

If your SSDs are behind a RAID volume attached to a RAID controller (HPE's SmartArray, Dell's PERC, or anything based on LSI/Avago's MegaRAID), here's what happens:

[root@System_B ~]# fstrim -av
[root@System_B ~]#

Just nothing. Nothing will happen. At the end of the SCSI I/O chain, the capabilities of a device boil down to the device itself, and the RAID driver your drive is attached to.

Let's take a closer look. Here's an SSD (a Samsung EVO 860 2Tb drive) attached to a SATA connector on a RHEL system (we will name that system System_A in the rest of this document):

[root@System_A ~]# lsscsi 
[3:0:0:0]    disk    ATA      Samsung SSD 860  3B6Q  /dev/sda

Here's an identical drive (same model, same firmware) behind a RAID controller (a PERC H730P) on a different system (let's call that system System_B in the rest of this document):

[root@System_B ~]# lsscsi 
[0:2:0:0]    disk    DELL     PERC H730P Adp   4.30  /dev/sda

How do I know it's the same drive? Thanks to the use of megaclisas-status, the RAID HBA can be queried. It shows this:

[root@System_B ~]# megaclisas-status
-- Controller information --
-- ID | H/W Model          | RAM    | Temp | BBU    | Firmware     
c0    | PERC H730P Adapter | 2048MB | 60C  | Good   | FW: 25.5.7.0005 

-- Array information --
-- ID | Type   |    Size |  Strpsz |   Flags | DskCache |   Status |  OS Path | CacheCade |InProgress   
c0u0  | RAID-0 |   1818G |  512 KB | ADRA,WB |  Enabled |  Optimal | /dev/sda | None      |None         

-- Disk information --
-- ID   | Type | Drive Model                                      | Size     | Status          | Speed    | Temp | Slot ID  | LSI ID  
c0u0p0  | SSD  | S3YUNB0KC09340D Samsung SSD 860 EVO 2TB RVT03B6Q | 1.818 TB | Online, Spun Up | 6.0Gb/s  | 23C  | [32:0]   | 0

Yes, it's the same drive (Samsung EVO 860) and the same firmware (3B6Q).

Using lsblk, we'll expose the DISCARD capabilities of those two devices:

[root@System_A ~]# lsblk -dD
NAME     DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda             0      512B       2G         1

[root@System_B ~]# lsblk -dD
NAME      DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda              0        0B       0B         0

Here's the culprit. All of the values are zero. The SSD in a RAID 0 behind a PERC H730P on System_B does not expose any DISCARD capabilities. This is the reason why fstrim on System_B did not do or return anything.

HPQ SmartArray systems are affected in a similar way. Here's an HPE DL360Gen10 with a high-end SmartArray RAID card:

[root@dl360gen10 ~]# lsblk -dD
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda         0        0B       0B         0
sdc         0        0B       0B         0
sdd         0        0B       0B         0
sde         0        0B       0B         0
sdf         0        0B       0B         0
sdg         0        0B       0B         0
sdh         0        0B       0B         0

All LSI-based (megaraid_sas driver) and SmartArray-based (hpsa driver) systems suffer from this problem. If you want to TRIM your SSDs, you would have to shutdown System_B, pull the drive out, connect it to a SAS/SATA-capable system, and fstrim there.

Fortunately for us, there's a small trick to temporarily expose the native capabilities of your device and TRIM it. This requires taking down the application that uses your RAID drive, but at least it does not require you to walk to a DataCenter to pull some hardware out of a system.

The trick is to stop using the RAID drive through the RAID driver, expose the SSD as a JBOD, re-mount the filesystem, and then TRIM it there. Once it DISCARDs the blocks, simply put the drive back in RAID mode, mount the filesystem, and then restart your applications.

There are a couple of caveats:

The RAID hardware you are using must allow devices to be put in JBOD mode.
You cannot do this on your boot disk as it would require taking down the OS.

Walking through the process

Here is a small walk-through created on a system with a Dell PERC H730P and a Samsung SSD. We'll call this system System_C.

1) The SSD is at [32:2] on HBA a0, and we'll create a single RAID 0 drive from it:

[root@System_C ~]# MegaCli -CfgLdAdd -r0 [32:2] WB RA CACHED -strpsz 512 -a0

2) The new logical drive pops up as /dev/sdd and shows no DISCARD capabilities:

[root@System_C ~]# lsblk -dD
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
[....]
sdd         0        0B       0B         0

3) Next, create a volume group (VG), a volume, and a 128G filesystem on top of that device:

[root@System_C ~]# parted /dev/sdd
[root@System_C ~]# pvcreate /dev/sdd1
[root@System_C ~]# vgcreate testdg /dev/sdd1
[root@System_C ~]# lvcreate -L 128G -n lv_test testdg
[root@System_C ~]# mount /dev/testdg/lv_test /mnt
[root@System_C ~]# mke2fs -t ext4 /dev/testdg/lv_test 
[root@System_C ~]# mount /dev/testdg/lv_test /mnt

For the sake of this demonstration, we'll copy some data to /mnt.

4) Stop using the system and export the volume group:

[root@System_C ~]# umount /mnt
[root@System_C ~]# vgchange -a n testdg
  0 logical volume(s) in volume group "testdg" now active
[root@System_C ~]# vgexport testdg
  Volume group "testdg" successfully exported

5) Enable JBOD mode on the HBA:

[root@System_C ~]# MegaCli -AdpSetProp -EnableJBOD -1 -a0

Adapter 0: Set JBOD to Enable success.

Exit Code: 0x00

6) Delete the logical drive and make the drive JBOD. On most RAID controllers, safety checks prevent you from creating a JBOD with a drive that is part of a logical volume:

[root@System_C ~]# MegaCli -PDMakeJBOD -PhysDrv[32:2] -a0

Adapter: 0: Failed to change PD state at EnclId-32 SlotId-2.

Exit Code: 0x01

The solution here is to delete the logical volume. This is a simple logical operation, and it will not touch our data. However, you must have written down the command used to create the RAID 0 array in the first place.

[root@System_C ~]# MegaCli -CfgLdDel -L3 -a0
                                     
Adapter 0: Deleted Virtual Drive-3(target id-3)

Exit Code: 0x00
[root@System_C ~]# MegaCli -PDMakeJBOD -PhysDrv[32:2] -a0
                                     
Adapter: 0: EnclId-32 SlotId-2 state changed to JBOD.

Exit Code: 0x00

7) Refresh the kernel's view of the disks and import your data:

[root@System_C ~]# partprobe
[root@System_C ~]# vgscan 
  Reading volume groups from cache.
  Found exported volume group "testdg" using metadata type lvm2
  Found volume group "rootdg" using metadata type lvm2

[root@System_C ~]# vgimport testdg
  Volume group "testdg" successfully imported

[root@System_C ~]# vgchange -a y testdg
  1 logical volume(s) in volume group "testdg" now active

[root@System_C ~]# mount /dev/testdg/lv_test /mnt

[root@System_C ~]# fstrim -v /mnt
/mnt: 125.5 GiB (134734139392 bytes) trimmed

We have discarded the empty blocks on our filesystem. Let's put it back in a RAID 0 logical drive.

8) umount the filesystem and export the volume group:

[root@System_C ~]# umount /mnt
[root@System_C ~]# vgchange -a n testdg
  0 logical volume(s) in volume group "testdg" now active
[root@System_C ~]# vgexport testdg
  Volume group "testdg" successfully exported

9) Disable JBOD mode on the RAID controller:

[root@System_C ~]# MegaCli -AdpSetProp -EnableJBOD -0 -a0

Adapter 0: Set JBOD to Disable success.

Exit Code: 0x00

10) Re-create your logical drive:

[root@System_C ~]# MegaCli -CfgLdAdd -r0 [32:2] WB RA CACHED -strpsz 512 -a0

11) Ask the kernel to probe the disks and re-mount your filesystem:

[root@System_C ~]# partprobe
[root@System_C ~]# vgscan 
  Reading volume groups from cache.
  Found exported volume group "testdg" using metadata type lvm2
  Found volume group "rootdg" using metadata type lvm2

[root@System_C ~]# vgimport testdg
  Volume group "testdg" successfully imported

[root@System_C ~]# vgchange -a y testdg
  1 logical volume(s) in volume group "testdg" now active

[root@System_C ~]# mount /dev/testdg/lv_test /mnt

Your data should be there, and the performance of your SSD should be back to its original figures.

[ Free online course: Red Hat Enterprise Linux technical overview. ]

Wrapping up

Here are a few additional notes:

This procedure should be taken with a grain of salt and with a large warning: DO NOT perform this unless you are confident you can identify logical drives and JBODs on a Linux system.
I have only tested this procedure using RAID 0 logical drives. It seems unlikely that it would function for other types of RAID (5, 6, 1+0, etc.) as the structure of the filesystem will be hidden from the Linux OS.
Please do not perform this procedure without verified backups.