====== 4k HDD Partition Alignment Primer ====== Although HDD storage densities have increased dramatically over the years, one of the most elemental aspects of hard disk drive design, the logical block format size known as a sector, has remained constant. Beginning in late 2009, accelerating in 2010 and hitting mainstream in 2011, all major manufacturers are migrating away from the legacy sector size of 512 bytes to a larger, more efficient sector size of 4096 bytes, generally referred to as 4k or AF (Advanced Format). While researching the benefits and consequences of a 512->4k transition, many reports of "partition misalignment issues" were found, that could lead to a severe performance impact which led to a closer investigation to verify the alleged problem and the proposed correct partition alignment. The result is obvious: Misaligned partitions on 4k harddisks introduce a severe performance impact, in this test case by a factor of 5.5 (Aligned: 83MB/s vs. misaligned: 15.5MB/s). ===== ===== ===== Test Setup ===== The following test setup was used to verify the partition misalignment impact introduced by 4k and to evaluate the performance/usability of USB 3.0 - bridging common of-the-shelf low-power (green) 4k SATA harddisks to serve as Apollo's primary/secondary data vault. It also serves well to debunk & demystify common FUD about USB 3.0 and "painfully slow" external storage on cheap 5400rpm eco-friendly hard disk drives. * MS-Tech LP-06U USB 3.0 PCIe (1x) Controller (NEC Chip)[(As of Jan 2011 there is only one USB 3.0 Controller available, the NEC D720200.)] * Sharkoon SATA QuickPort Duo USB 3.0 V2[(The Sharkoon USB 3.0 to SATAII Bridge (JMicron chip) has two slots for two drives. Until now it is impossible to use two disks at the same time. As soon as two disks are inserted linux will see either a random one or none at all. For all tests only one drive was inserted at a time, the other slot was left empty.)] * Western Digital Caviar Green 2TB Desktop (WD20EARS-00MVWB0)[(There are currently two versions of this model out there, this 3x667GB platter Version and another (older) one with 4x500GB platter instead. The 4 Platter disk is supposed to be much slower, to identify it, have a look onto the bottom, the casing should not be indented to leave more room for the 4th platter.)] * Platters: 3x 667GB * Cache: 64MB * Speed: Dynamic 5400 - 7200 RPM (IntelliPower) * Max Power Rating: 5V@0.7A / 12V@0.55A * Weight: 0.730Kg * Production Date: Sep 2010 * Made in: Malaysia * Samsung SpinPoint EcoGreen F4 2TB (HD204UI)[(Samsungs HD204UI default firmware suffers from a very nasty bug, if the drive receives an "IDENTIFY DEVICE" while writing data it sometimes flushes the cache even though the data has not been written to the disk yet and leaves the operating system completely unaware of that. [[http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks|SamsungF4EGBadBlocks]])] * Platters: 3x 667GB * Cache: 32MB * Speed: 5400 RPM * Max Power Rating: 5V@0.85A / 12V@0.5A * Weight: 0.650Kg * Production Date: Dec 2010 * Made in: China * Gentoo-Kernel 2.6.37 (non-genkernel)[(USB 3.0 performs pretty well with 2.6.37 but it is a pain with anything older due to power management issues that can freeze the entire system when the usb device is woken up)] ~~REFNOTES~~ ===== Partition Alignment ===== So why is 4k sector size such a bad thing that we have to take care of partition alignment all of a sudden with rotational disks too? In fact it's not. The culprit here is the 512-byte emulation the industry was forced to implement so that some OS like Microsoft Windows can handle the disks at all. Both disk drives in this test do have a physical sector size of 4k but present 512-byte physical sectors to the OS, so degraded performance will result when the drive's (hidden) internal 4k sector boundaries do not coincide with the 4k logical blocks, clusters and virtual memory pages common in many operating- and file-systems. The drive is thereby forced to perform two read-modify-write operations to satisfy a single misaligned 4k write operation. Recent Kernels and user-land tools support disk alignment and would work like a charm if the disk would present itself in native 4k mode. The key to misalignment lies in the partition table which consumes either 512 byte (LBA 0) in case of a legacy msdos type mbr or LBA0-33 for the Primary GUID Partition Table (GPT). **Misalignment will occur by default if the first partition is placed immediately after the partition table, as the next block is LBA 1 for msdos type MBRs and LBA 34 for GPT**. In order to align the 4k logical block with the physical 4k on the platter the sectors following the partition table have to be left empty until a sector is reached that is divisible by 8 (sector 8 for msdos and sector 40 for GPT). Until the situation will fix itself in the future, when the industry finally manages to let go of the 512-byte compatibility in favor of native 4k, people need to be aware of this issue, otherwise they might experience heavy performance impacts by partitioning the disks like they were used to, without proper alignment. If you want to test your own hardware to verify these results, go ahead with the next section: ==== Benchmark Code ==== The following code was offered by no.op on gentoo-forums [[http://forums.gentoo.org/viewtopic-t-848978.html|Samsung F4 HD204UI 2TB & best parted alignment]], only size was changed from 512M to 8G, in order to have a better perspective on the timescale and to avoid hidden caching interference. #define _FILE_OFFSET_BITS 64 #include #include #include #include #include unsigned char buffer[4096]; int main(int argc, char **argv) { int fd; int opt; off_t off; off_t base = 0; off_t stride = sizeof buffer; size_t size = 1024 * 1024 * 1024; const char *device = NULL; int do_sync = 0; while ((opt = getopt(argc, argv, "d:b:i:s:S")) != -1) { switch (opt) { case 'd': device = optarg; break; case 'b': base = atoll(optarg) * 512; break; case 'i': stride = atoll(optarg); break; case 's': size = atoll(optarg); break; case 'S': do_sync = 1; break; default: fprintf(stderr, "Usage: part-align-bench " "[-b ] [-i ] " "[-s ] [-S] -d \n"); return 1; } } if (device == NULL) { fprintf(stderr, "missing device name argument\n"); return 1; } fd = open(device, O_RDWR); if (fd < 0) { perror("open"); return 1; } off = base; printf("part-align-bench: %s base sector=%lld " "stride=%lld size=%lld %s\n", device, base / 512ll, (long long)stride, (long long)size, do_sync ? "do_sync" : "no_sync"); while (size > 0) { if (lseek(fd, off, SEEK_SET) == (off_t)-1) { perror("lseek"); close(fd); return 1; } if (write(fd, buffer, sizeof buffer) != sizeof buffer) { perror("write"); close(fd); return 1; } if (do_sync) if (fdatasync(fd) < 0) { perror("fdatasync"); close(fd); return 1; } off += stride; size -= sizeof buffer; } close(fd); return 0; } ==== Compile ==== CFLAGS="-march=native -O2 -pipe -fomit-frame-pointer" gcc -o part-align-bench part-align-bench.c ==== Run benchmark ==== The following tests **will erase the complete disk**. Only continue when you feel confident that you know what you are doing. Take special care to check that your hdd really is /dev/sda or change the -d option according to your own setup! $ time ./part-align-bench -d /dev/sda -b 0 part-align-bench: /dev/sda base sector=0 stride=4096 size=8589934592 no_sync real 1m20.195s user 0m0.172s sys 0m6.973s $ time ./part-align-bench -d /dev/sda -b 34 part-align-bench: /dev/sda base sector=34 stride=4096 size=8589934592 no_sync real 7m51.229s user 0m0.660s sys 0m34.763s ==== Results ==== ^Device^Sector 0^Sector 8^Sector 34^Sector 40^Sector 42^ |WD |12.545s|12.436s|65.792s|12.211s|66.341s| |HD204UI|10.141s|10.153s|59.064s|10.126s|59.010s| misaligned cp: morpheus / # time cp -a /usr /mnt/usb/ real 19m13.331s user 0m5.086s sys 1m9.511s morpheus / # du -sch /usr 13G /usr 13G total ==== Conclusion ==== Looking at the results, the impact of misaligned partitions is clearly visible: **WD:** * aligned: 83MB/s * misaligned: 15.5MB/s **Samsung:** * aligned: 102MB/s * misaligned: 17MB/s Partitions should begin on sectors that can be divided by 8 (8 sectors == 4 kB internal sector of HDD) Alignment: No partition table, write filesystem directly to /dev/sda starting at sector 0 GPT partition table, one partition starting at sector 40 GPT partition table, multiple partitions, start first at sector 40 following partitions must start at sectors that can be divided by 8. The msdos partition table/mbr is 512bytes long, so theoretically sector 8 would be the start sector for the first partition. ==== Create Partition ==== create gpt partition table: $ parted --script /dev/sda mklabel gpt align primary partition to sector 40 for best performance: $ time parted --align=min --script /dev/sda mkpart primary 40s 100% real 0m0.021s user 0m0.002s sys 0m0.001s $ parted --script /dev/sda unit s print Model: WDC WD20 EARS-00MVWB0 (scsi) Disk /dev/sda: 3907029168s Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 40s 3907029134s 3907029095s primary [[http://forums.gentoo.org/viewtopic-t-848978.html?sid=ec0b458cb2c449f8a050c7912fa023cf]] ===== File system Alignment ===== time mkfs.ext4 -v -b 4096 -E stride=128,stripe-width=128 -O ^has_journal /dev/sda1 mke2fs 1.41.12 (17-May-2010) fs_types for mke2fs.conf resolution: 'ext4', 'default' Calling BLKDISCARD from 0 to 2000398893056 failed. Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=128 blocks, Stripe width=128 blocks 122101760 inodes, 488378636 blocks 24418931 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 14905 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Writing inode tables: done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 39 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. real 9m27.444s user 0m1.696s sys 0m40.024s bonnie++ -d /mnt/usb/ -u root Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP morpheus 8G 1203 98 95010 9 43288 7 3663 96 125455 12 161.2 3 Latency 7299us 226ms 227ms 27785us 22094us 237ms Version 1.96 ------Sequential Create------ --------Random Create-------- morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 668us 580us 1033us 612us 69us 611us tuned mount options: Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP morpheus 8G 1186 98 92841 9 43259 7 3638 96 120936 12 159.7 3 Latency 7389us 226ms 147ms 23610us 25269us 274ms Version 1.96 ------Sequential Create------ --------Random Create-------- morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 658us 605us 627us 617us 12us 636us 1.96,1.96,morpheus,1,1295728732,8G,,1186,98,92841,9,43259,7,3638,96,120936,12,159.7,3,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,++ s,226ms,147ms,23610us,25269us,274ms,658us,605us,627us,617us,12us,636us default mkfs.ext4 und mount with no options Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP morpheus 8G 736 98 93696 11 42589 7 3660 97 120755 12 162.5 3 Latency 11837us 226ms 227ms 23548us 23374us 286ms Version 1.96 ------Sequential Create------ --------Random Create-------- morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 612us 2088us 657us 602us 334us 620us 1.96,1.96,morpheus,1,1295730229,8G,,736,98,93696,11,42589,7,3660,97,120755,12,162.5,3,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,11837us,226ms,227ms,23548us,23374us,286ms,612us,2088us,657us,602us,334us,620us mkfs.ext4 -v -b 4096 -m 0 -E stride=16,stripe-width=32 /dev/sda1 Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP morpheus 8G 747 98 94455 10 42634 7 3808 97 119933 12 154.1 4 Latency 11611us 226ms 226ms 22920us 26877us 253ms Version 1.96 ------Sequential Create------ --------Random Create-------- morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 621us 503us 543us 637us 13us 615us 1.96,1.96,morpheus,1,1295759965,8G,,747,98,94455,10,42634,7,3808,97,119933,12,154.1,4,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,11611us,226ms,226ms,22920us,26877us,253ms,621us,503us,543us,637us,13us,615us Largefile support !!! very fast mkfs !!! morpheus / # time mkfs.ext4 -v -m 0 -T largefile4 -O ^has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize /dev/sda1 mke2fs 1.41.12 (17-May-2010) fs_types for mke2fs.conf resolution: 'ext4', 'largefile4' Calling BLKDISCARD from 0 to 2000398893056 failed. Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 476960 inodes, 488378636 blocks 0 blocks (0.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 14905 block groups 32768 blocks per group, 32768 fragments per group 32 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Writing inode tables: done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 33 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. real 0m19.368s user 0m0.089s sys 0m0.456s mount -o noatime,data=writeback,barrier=0 /dev/sda1 /mnt/usb/ wd: Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP morpheus 8G 1140 98 94574 9 43097 7 3642 98 119887 12 158.3 3 Latency 7905us 226ms 227ms 13808us 16982us 217ms Version 1.96 ------Sequential Create------ --------Random Create-------- morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 675us 453us 617us 601us 25us 697us 1.96,1.96,morpheus,1,1295758867,8G,,1140,98,94574,9,43097,7,3642,98,119887,12,158.3,3,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7905us,226ms,227ms,13808us,16982us,217ms,675us,453us,617us,601us,25us,697us samsung Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP morpheus 8G 1198 98 106494 11 52066 8 3744 96 132966 13 140.7 2 Latency 7266us 128ms 126ms 31119us 31659us 313ms Version 1.96 ------Sequential Create------ --------Random Create-------- morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 601us 409us 541us 97us 64us 82us 1.96,1.96,morpheus,1,1295814353,8G,,1198,98,106494,11,52066,8,3744,96,132966,13,140.7,2,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7266us,128ms,126ms,31119us,31659us,313ms,601us,409us,541us,97us,64us,82us ==== USB Mass Storage Tuning ==== Linux 2.6 gives you the ability to see and to change the max_sectors value for each USB storage device, independently. Assuming you have a sysfs filesystem mounted on /sys and assuming /dev/sda is a USB drive, you can see the max_sectors value for /dev/sda simply by running: $ cat /sys/block/sda/device/max_sectors and you can set max_sectors to 2048 by running (as root): $ echo 2048 > /sys/block/sda/device/max_sectors Values should be positive multiples of 8 (16 on the Alpha and other 64-bit platforms). There is no upper limit, but you probably shouldn't make max_sectors much bigger than 2048 (corresponding to 1 MB, which is quite a lot). wd: Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP morpheus 8G 1174 98 104857 10 45213 6 3613 97 122696 10 160.4 2 Latency 7457us 226ms 226ms 25125us 14871us 264ms Version 1.96 ------Sequential Create------ --------Random Create-------- morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 631us 412us 656us 507us 95us 82us 1.96,1.96,morpheus,1,1295757422,8G,,1174,98,104857,10,45213,6,3613,97,122696,10,160.4,2,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7457us,226ms,226ms,25125us,14871us,264ms,631us,412us,656us,507us,95us,82us samsung: Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP morpheus 8G 1211 98 114716 11 55022 7 3741 97 141067 12 141.2 3 Latency 7277us 127ms 126ms 25340us 29201us 258ms Version 1.96 ------Sequential Create------ --------Random Create-------- morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 533us 415us 602us 644us 58us 580us 1.96,1.96,morpheus,1,1295812838,8G,,1211,98,114716,11,55022,7,3741,97,141067,12,141.2,3,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7277us,127ms,126ms,25340us,29201us,258ms,533us,415us,602us,644us,58us,580us ===== USB 3.0 Performance ===== samsung write: $ time dd if=/dev/zero bs=4096 count=10000000 of=/mnt/usb/40Gb.file 10000000+0 records in 10000000+0 records out 40960000000 bytes (41 GB) copied, 369.997 s, 111 MB/s real 6m10.016s user 0m1.317s read: $ time dd if=/mnt/usb/40Gb.file bs=64k of=/dev/null 625000+0 records in 625000+0 records out 40960000000 bytes (41 GB) copied, 312.399 s, 131 MB/s real 5m12.446s user 0m0.252s sys 0m34.811s {{tag>linux 4k partition benchmark hdd usb test}} ~~DISCUSSION~~