Although HDD storage densities have increased dramatically over the years, one of the most elemental aspects of hard disk drive design, the logical block format size known as a sector, has remained constant. Beginning in late 2009, accelerating in 2010 and hitting mainstream in 2011, all major manufacturers are migrating away from the legacy sector size of 512 bytes to a larger, more efficient sector size of 4096 bytes, generally referred to as 4k or AF (Advanced Format).
While researching the benefits and consequences of a 512→4k transition, many reports of “partition misalignment issues” were found, that could lead to a severe performance impact which led to a closer investigation to verify the alleged problem and the proposed correct partition alignment. The result is obvious: Misaligned partitions on 4k harddisks introduce a severe performance impact, in this test case by a factor of 5.5 (Aligned: 83MB/s vs. misaligned: 15.5MB/s).
The following test setup was used to verify the partition misalignment impact introduced by 4k and to evaluate the performance/usability of USB 3.0 - bridging common of-the-shelf low-power (green) 4k SATA harddisks to serve as Apollo's primary/secondary data vault. It also serves well to debunk & demystify common FUD about USB 3.0 and “painfully slow” external storage on cheap 5400rpm eco-friendly hard disk drives.
So why is 4k sector size such a bad thing that we have to take care of partition alignment all of a sudden with rotational disks too? In fact it's not. The culprit here is the 512-byte emulation the industry was forced to implement so that some OS like Microsoft Windows can handle the disks at all.
Both disk drives in this test do have a physical sector size of 4k but present 512-byte physical sectors to the OS, so degraded performance will result when the drive's (hidden) internal 4k sector boundaries do not coincide with the 4k logical blocks, clusters and virtual memory pages common in many operating- and file-systems. The drive is thereby forced to perform two read-modify-write operations to satisfy a single misaligned 4k write operation. Recent Kernels and user-land tools support disk alignment and would work like a charm if the disk would present itself in native 4k mode.
The key to misalignment lies in the partition table which consumes either 512 byte (LBA 0) in case of a legacy msdos type mbr or LBA0-33 for the Primary GUID Partition Table (GPT).
Misalignment will occur by default if the first partition is placed immediately after the partition table, as the next block is LBA 1 for msdos type MBRs and LBA 34 for GPT.
In order to align the 4k logical block with the physical 4k on the platter the sectors following the partition table have to be left empty until a sector is reached that is divisible by 8 (sector 8 for msdos and sector 40 for GPT).
Until the situation will fix itself in the future, when the industry finally manages to let go of the 512-byte compatibility in favor of native 4k, people need to be aware of this issue, otherwise they might experience heavy performance impacts by partitioning the disks like they were used to, without proper alignment.
If you want to test your own hardware to verify these results, go ahead with the next section:
The following code was offered by no.op on gentoo-forums Samsung F4 HD204UI 2TB & best parted alignment, only size was changed from 512M to 8G, in order to have a better perspective on the timescale and to avoid hidden caching interference.
#define _FILE_OFFSET_BITS 64 #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <stdio.h> unsigned char buffer[4096]; int main(int argc, char **argv) { int fd; int opt; off_t off; off_t base = 0; off_t stride = sizeof buffer; size_t size = 1024 * 1024 * 1024; const char *device = NULL; int do_sync = 0; while ((opt = getopt(argc, argv, "d:b:i:s:S")) != -1) { switch (opt) { case 'd': device = optarg; break; case 'b': base = atoll(optarg) * 512; break; case 'i': stride = atoll(optarg); break; case 's': size = atoll(optarg); break; case 'S': do_sync = 1; break; default: fprintf(stderr, "Usage: part-align-bench " "[-b <base sector>] [-i <stride>] " "[-s <size>] [-S] -d <block device>\n"); return 1; } } if (device == NULL) { fprintf(stderr, "missing device name argument\n"); return 1; } fd = open(device, O_RDWR); if (fd < 0) { perror("open"); return 1; } off = base; printf("part-align-bench: %s base sector=%lld " "stride=%lld size=%lld %s\n", device, base / 512ll, (long long)stride, (long long)size, do_sync ? "do_sync" : "no_sync"); while (size > 0) { if (lseek(fd, off, SEEK_SET) == (off_t)-1) { perror("lseek"); close(fd); return 1; } if (write(fd, buffer, sizeof buffer) != sizeof buffer) { perror("write"); close(fd); return 1; } if (do_sync) if (fdatasync(fd) < 0) { perror("fdatasync"); close(fd); return 1; } off += stride; size -= sizeof buffer; } close(fd); return 0; }
CFLAGS="-march=native -O2 -pipe -fomit-frame-pointer" gcc -o part-align-bench part-align-bench.c
The following tests will erase the complete disk. Only continue when you feel confident that you know what you are doing. Take special care to check that your hdd really is /dev/sda or change the -d option according to your own setup!
$ time ./part-align-bench -d /dev/sda -b 0
part-align-bench: /dev/sda base sector=0 stride=4096 size=8589934592 no_sync real 1m20.195s user 0m0.172s sys 0m6.973s
$ time ./part-align-bench -d /dev/sda -b 34
part-align-bench: /dev/sda base sector=34 stride=4096 size=8589934592 no_sync real 7m51.229s user 0m0.660s sys 0m34.763s
Device | Sector 0 | Sector 8 | Sector 34 | Sector 40 | Sector 42 |
---|---|---|---|---|---|
WD | 12.545s | 12.436s | 65.792s | 12.211s | 66.341s |
HD204UI | 10.141s | 10.153s | 59.064s | 10.126s | 59.010s |
misaligned cp:
morpheus / # time cp -a /usr /mnt/usb/
real 19m13.331s user 0m5.086s sys 1m9.511s morpheus / # du -sch /usr 13G /usr 13G total
Looking at the results, the impact of misaligned partitions is clearly visible:
WD:
Samsung:
Partitions should begin on sectors that can be divided by 8 (8 sectors == 4 kB internal sector of HDD)
Alignment:
No partition table, write filesystem directly to /dev/sda starting at sector 0 GPT partition table, one partition starting at sector 40 GPT partition table, multiple partitions, start first at sector 40 following partitions must start at sectors that can be divided by 8.
The msdos partition table/mbr is 512bytes long, so theoretically sector 8 would be the start sector for the first partition.
create gpt partition table:
$ parted --script /dev/sda mklabel gpt
align primary partition to sector 40 for best performance:
$ time parted --align=min --script /dev/sda mkpart primary 40s 100% real 0m0.021s user 0m0.002s sys 0m0.001s $ parted --script /dev/sda unit s print Model: WDC WD20 EARS-00MVWB0 (scsi) Disk /dev/sda: 3907029168s Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 40s 3907029134s 3907029095s primary
http://forums.gentoo.org/viewtopic-t-848978.html?sid=ec0b458cb2c449f8a050c7912fa023cf
time mkfs.ext4 -v -b 4096 -E stride=128,stripe-width=128 -O ^has_journal /dev/sda1 mke2fs 1.41.12 (17-May-2010) fs_types for mke2fs.conf resolution: 'ext4', 'default' Calling BLKDISCARD from 0 to 2000398893056 failed. Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=128 blocks, Stripe width=128 blocks 122101760 inodes, 488378636 blocks 24418931 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 14905 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Writing inode tables: done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 39 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. real 9m27.444s user 0m1.696s sys 0m40.024s
bonnie++ -d /mnt/usb/ -u root Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP morpheus 8G 1203 98 95010 9 43288 7 3663 96 125455 12 161.2 3 Latency 7299us 226ms 227ms 27785us 22094us 237ms Version 1.96 ------Sequential Create------ --------Random Create-------- morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 668us 580us 1033us 612us 69us 611us
tuned mount options:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP morpheus 8G 1186 98 92841 9 43259 7 3638 96 120936 12 159.7 3 Latency 7389us 226ms 147ms 23610us 25269us 274ms Version 1.96 ------Sequential Create------ --------Random Create-------- morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 658us 605us 627us 617us 12us 636us 1.96,1.96,morpheus,1,1295728732,8G,,1186,98,92841,9,43259,7,3638,96,120936,12,159.7,3,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,++ s,226ms,147ms,23610us,25269us,274ms,658us,605us,627us,617us,12us,636us
default mkfs.ext4 und mount with no options
Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP morpheus 8G 736 98 93696 11 42589 7 3660 97 120755 12 162.5 3 Latency 11837us 226ms 227ms 23548us 23374us 286ms Version 1.96 ------Sequential Create------ --------Random Create-------- morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 612us 2088us 657us 602us 334us 620us 1.96,1.96,morpheus,1,1295730229,8G,,736,98,93696,11,42589,7,3660,97,120755,12,162.5,3,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,11837us,226ms,227ms,23548us,23374us,286ms,612us,2088us,657us,602us,334us,620us
mkfs.ext4 -v -b 4096 -m 0 -E stride=16,stripe-width=32 /dev/sda1
Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP morpheus 8G 747 98 94455 10 42634 7 3808 97 119933 12 154.1 4 Latency 11611us 226ms 226ms 22920us 26877us 253ms Version 1.96 ------Sequential Create------ --------Random Create-------- morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 621us 503us 543us 637us 13us 615us 1.96,1.96,morpheus,1,1295759965,8G,,747,98,94455,10,42634,7,3808,97,119933,12,154.1,4,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,11611us,226ms,226ms,22920us,26877us,253ms,621us,503us,543us,637us,13us,615us
Largefile support
!!! very fast mkfs !!!
morpheus / # time mkfs.ext4 -v -m 0 -T largefile4 -O ^has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize /dev/sda1 mke2fs 1.41.12 (17-May-2010) fs_types for mke2fs.conf resolution: 'ext4', 'largefile4' Calling BLKDISCARD from 0 to 2000398893056 failed. Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 476960 inodes, 488378636 blocks 0 blocks (0.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 14905 block groups 32768 blocks per group, 32768 fragments per group 32 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Writing inode tables: done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 33 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. real 0m19.368s user 0m0.089s sys 0m0.456s
mount -o noatime,data=writeback,barrier=0 /dev/sda1 /mnt/usb/
wd:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP morpheus 8G 1140 98 94574 9 43097 7 3642 98 119887 12 158.3 3 Latency 7905us 226ms 227ms 13808us 16982us 217ms Version 1.96 ------Sequential Create------ --------Random Create-------- morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 675us 453us 617us 601us 25us 697us 1.96,1.96,morpheus,1,1295758867,8G,,1140,98,94574,9,43097,7,3642,98,119887,12,158.3,3,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7905us,226ms,227ms,13808us,16982us,217ms,675us,453us,617us,601us,25us,697us
samsung
Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP morpheus 8G 1198 98 106494 11 52066 8 3744 96 132966 13 140.7 2 Latency 7266us 128ms 126ms 31119us 31659us 313ms Version 1.96 ------Sequential Create------ --------Random Create-------- morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 601us 409us 541us 97us 64us 82us 1.96,1.96,morpheus,1,1295814353,8G,,1198,98,106494,11,52066,8,3744,96,132966,13,140.7,2,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7266us,128ms,126ms,31119us,31659us,313ms,601us,409us,541us,97us,64us,82us
Linux 2.6 gives you the ability to see and to change the max_sectors value for each USB storage device, independently. Assuming you have a sysfs filesystem mounted on /sys and assuming /dev/sda is a USB drive, you can see the max_sectors value for /dev/sda simply by running:
$ cat /sys/block/sda/device/max_sectors
and you can set max_sectors to 2048 by running (as root):
$ echo 2048 > /sys/block/sda/device/max_sectors
Values should be positive multiples of 8 (16 on the Alpha and other 64-bit platforms). There is no upper limit, but you probably shouldn't make max_sectors much bigger than 2048 (corresponding to 1 MB, which is quite a lot).
wd:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP morpheus 8G 1174 98 104857 10 45213 6 3613 97 122696 10 160.4 2 Latency 7457us 226ms 226ms 25125us 14871us 264ms Version 1.96 ------Sequential Create------ --------Random Create-------- morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 631us 412us 656us 507us 95us 82us 1.96,1.96,morpheus,1,1295757422,8G,,1174,98,104857,10,45213,6,3613,97,122696,10,160.4,2,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7457us,226ms,226ms,25125us,14871us,264ms,631us,412us,656us,507us,95us,82us
samsung:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP morpheus 8G 1211 98 114716 11 55022 7 3741 97 141067 12 141.2 3 Latency 7277us 127ms 126ms 25340us 29201us 258ms Version 1.96 ------Sequential Create------ --------Random Create-------- morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 533us 415us 602us 644us 58us 580us 1.96,1.96,morpheus,1,1295812838,8G,,1211,98,114716,11,55022,7,3741,97,141067,12,141.2,3,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7277us,127ms,126ms,25340us,29201us,258ms,533us,415us,602us,644us,58us,580us
samsung
write:
$ time dd if=/dev/zero bs=4096 count=10000000 of=/mnt/usb/40Gb.file 10000000+0 records in 10000000+0 records out 40960000000 bytes (41 GB) copied, 369.997 s, 111 MB/s real 6m10.016s user 0m1.317s
read:
$ time dd if=/mnt/usb/40Gb.file bs=64k of=/dev/null 625000+0 records in 625000+0 records out 40960000000 bytes (41 GB) copied, 312.399 s, 131 MB/s real 5m12.446s user 0m0.252s sys 0m34.811s