====== 4k HDD Partition Alignment Primer ======
Although HDD storage densities have increased dramatically over the years, one of the most elemental aspects of hard disk drive design, the logical block format size known as a sector, has remained constant. Beginning in late 2009, accelerating in 2010 and hitting mainstream in 2011, all major manufacturers are migrating away from the legacy sector size of 512 bytes to a larger, more efficient sector size of 4096 bytes, generally referred to as 4k or AF (Advanced Format).
While researching the benefits and consequences of a 512->4k transition, many reports of "partition misalignment issues" were found, that could lead to a severe performance impact which led to a closer investigation to verify the alleged problem and the proposed correct partition alignment. The result is obvious: Misaligned partitions on 4k harddisks introduce a severe performance impact, in this test case by a factor of 5.5 (Aligned: 83MB/s vs. misaligned: 15.5MB/s).
===== =====
===== Test Setup =====
The following test setup was used to verify the partition misalignment impact introduced by 4k and to evaluate the performance/usability of USB 3.0 - bridging common of-the-shelf low-power (green) 4k SATA harddisks to serve as Apollo's primary/secondary data vault. It also serves well to debunk & demystify common FUD about USB 3.0 and "painfully slow" external storage on cheap 5400rpm eco-friendly hard disk drives.
* MS-Tech LP-06U USB 3.0 PCIe (1x) Controller (NEC Chip)[(As of Jan 2011 there is only one USB 3.0 Controller available, the NEC D720200.)]
* Sharkoon SATA QuickPort Duo USB 3.0 V2[(The Sharkoon USB 3.0 to SATAII Bridge (JMicron chip) has two slots for two drives. Until now it is impossible to use two disks at the same time. As soon as two disks are inserted linux will see either a random one or none at all. For all tests only one drive was inserted at a time, the other slot was left empty.)]
* Western Digital Caviar Green 2TB Desktop (WD20EARS-00MVWB0)[(There are currently two versions of this model out there, this 3x667GB platter Version and another (older) one with 4x500GB platter instead. The 4 Platter disk is supposed to be much slower, to identify it, have a look onto the bottom, the casing should not be indented to leave more room for the 4th platter.)]
* Platters: 3x 667GB
* Cache: 64MB
* Speed: Dynamic 5400 - 7200 RPM (IntelliPower)
* Max Power Rating: 5V@0.7A / 12V@0.55A
* Weight: 0.730Kg
* Production Date: Sep 2010
* Made in: Malaysia
* Samsung SpinPoint EcoGreen F4 2TB (HD204UI)[(Samsungs HD204UI default firmware suffers from a very nasty bug, if the drive receives an "IDENTIFY DEVICE" while writing data it sometimes flushes the cache even though the data has not been written to the disk yet and leaves the operating system completely unaware of that. [[http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks|SamsungF4EGBadBlocks]])]
* Platters: 3x 667GB
* Cache: 32MB
* Speed: 5400 RPM
* Max Power Rating: 5V@0.85A / 12V@0.5A
* Weight: 0.650Kg
* Production Date: Dec 2010
* Made in: China
* Gentoo-Kernel 2.6.37 (non-genkernel)[(USB 3.0 performs pretty well with 2.6.37 but it is a pain with anything older due to power management issues that can freeze the entire system when the usb device is woken up)]
~~REFNOTES~~
===== Partition Alignment =====
So why is 4k sector size such a bad thing that we have to take care of partition alignment all of a sudden with rotational disks too? In fact it's not. The culprit here is the 512-byte emulation the industry was forced to implement so that some OS like Microsoft Windows can handle the disks at all.
Both disk drives in this test do have a physical sector size of 4k but present 512-byte physical sectors to the OS, so degraded performance will result when the drive's (hidden) internal 4k sector boundaries do not coincide with the 4k logical blocks, clusters and virtual memory pages common in many operating- and file-systems. The drive is thereby forced to perform two read-modify-write operations to satisfy a single misaligned 4k write operation. Recent Kernels and user-land tools support disk alignment and would work like a charm if the disk would present itself in native 4k mode.
The key to misalignment lies in the partition table which consumes either 512 byte (LBA 0) in case of a legacy msdos type mbr or LBA0-33 for the Primary GUID Partition Table (GPT).
**Misalignment will occur by default if the first partition is placed immediately after the partition table, as the next block is LBA 1 for msdos type MBRs and LBA 34 for GPT**.
In order to align the 4k logical block with the physical 4k on the platter the sectors following the partition table have to be left empty until a sector is reached that is divisible by 8 (sector 8 for msdos and sector 40 for GPT).
Until the situation will fix itself in the future, when the industry finally manages to let go of the 512-byte compatibility in favor of native 4k, people need to be aware of this issue, otherwise they might experience heavy performance impacts by partitioning the disks like they were used to, without proper alignment.
If you want to test your own hardware to verify these results, go ahead with the next section:
==== Benchmark Code ====
The following code was offered by no.op on gentoo-forums [[http://forums.gentoo.org/viewtopic-t-848978.html|Samsung F4 HD204UI 2TB & best parted alignment]], only size was changed from 512M to 8G, in order to have a better perspective on the timescale and to avoid hidden caching interference.
#define _FILE_OFFSET_BITS 64
#include
#include
#include
#include
#include
unsigned char buffer[4096];
int main(int argc, char **argv)
{
int fd;
int opt;
off_t off;
off_t base = 0;
off_t stride = sizeof buffer;
size_t size = 1024 * 1024 * 1024;
const char *device = NULL;
int do_sync = 0;
while ((opt = getopt(argc, argv, "d:b:i:s:S")) != -1) {
switch (opt) {
case 'd':
device = optarg;
break;
case 'b':
base = atoll(optarg) * 512;
break;
case 'i':
stride = atoll(optarg);
break;
case 's':
size = atoll(optarg);
break;
case 'S':
do_sync = 1;
break;
default:
fprintf(stderr, "Usage: part-align-bench "
"[-b ] [-i ] "
"[-s ] [-S] -d \n");
return 1;
}
}
if (device == NULL) {
fprintf(stderr, "missing device name argument\n");
return 1;
}
fd = open(device, O_RDWR);
if (fd < 0) {
perror("open");
return 1;
}
off = base;
printf("part-align-bench: %s base sector=%lld "
"stride=%lld size=%lld %s\n",
device, base / 512ll, (long long)stride, (long long)size,
do_sync ? "do_sync" : "no_sync");
while (size > 0) {
if (lseek(fd, off, SEEK_SET) == (off_t)-1) {
perror("lseek");
close(fd);
return 1;
}
if (write(fd, buffer, sizeof buffer) != sizeof buffer) {
perror("write");
close(fd);
return 1;
}
if (do_sync)
if (fdatasync(fd) < 0) {
perror("fdatasync");
close(fd);
return 1;
}
off += stride;
size -= sizeof buffer;
}
close(fd);
return 0;
}
==== Compile ====
CFLAGS="-march=native -O2 -pipe -fomit-frame-pointer" gcc -o part-align-bench part-align-bench.c
==== Run benchmark ====
The following tests **will erase the complete disk**.
Only continue when you feel confident that you know what you are doing. Take special care to check that your hdd really is /dev/sda or change the -d option according to your own setup!
$ time ./part-align-bench -d /dev/sda -b 0
part-align-bench: /dev/sda base sector=0 stride=4096 size=8589934592 no_sync
real 1m20.195s
user 0m0.172s
sys 0m6.973s
$ time ./part-align-bench -d /dev/sda -b 34
part-align-bench: /dev/sda base sector=34 stride=4096 size=8589934592 no_sync
real 7m51.229s
user 0m0.660s
sys 0m34.763s
==== Results ====
^Device^Sector 0^Sector 8^Sector 34^Sector 40^Sector 42^
|WD |12.545s|12.436s|65.792s|12.211s|66.341s|
|HD204UI|10.141s|10.153s|59.064s|10.126s|59.010s|
misaligned cp:
morpheus / # time cp -a /usr /mnt/usb/
real 19m13.331s
user 0m5.086s
sys 1m9.511s
morpheus / # du -sch /usr
13G /usr
13G total
==== Conclusion ====
Looking at the results, the impact of misaligned partitions is clearly visible:
**WD:**
* aligned: 83MB/s
* misaligned: 15.5MB/s
**Samsung:**
* aligned: 102MB/s
* misaligned: 17MB/s
Partitions should begin on sectors that can be divided by 8 (8 sectors == 4 kB internal sector of HDD)
Alignment:
No partition table, write filesystem directly to /dev/sda starting at sector 0
GPT partition table, one partition starting at sector 40
GPT partition table, multiple partitions, start first at sector 40 following partitions must start at sectors that can be divided by 8.
The msdos partition table/mbr is 512bytes long, so theoretically sector 8 would be the start sector for the first partition.
==== Create Partition ====
create gpt partition table:
$ parted --script /dev/sda mklabel gpt
align primary partition to sector 40 for best performance:
$ time parted --align=min --script /dev/sda mkpart primary 40s 100%
real 0m0.021s
user 0m0.002s
sys 0m0.001s
$ parted --script /dev/sda unit s print
Model: WDC WD20 EARS-00MVWB0 (scsi)
Disk /dev/sda: 3907029168s
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number Start End Size File system Name Flags
1 40s 3907029134s 3907029095s primary
[[http://forums.gentoo.org/viewtopic-t-848978.html?sid=ec0b458cb2c449f8a050c7912fa023cf]]
===== File system Alignment =====
time mkfs.ext4 -v -b 4096 -E stride=128,stripe-width=128 -O ^has_journal /dev/sda1
mke2fs 1.41.12 (17-May-2010)
fs_types for mke2fs.conf resolution: 'ext4', 'default'
Calling BLKDISCARD from 0 to 2000398893056 failed.
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=128 blocks, Stripe width=128 blocks
122101760 inodes, 488378636 blocks
24418931 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
14905 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848
Writing inode tables: done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 39 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
real 9m27.444s
user 0m1.696s
sys 0m40.024s
bonnie++ -d /mnt/usb/ -u root
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
morpheus 8G 1203 98 95010 9 43288 7 3663 96 125455 12 161.2 3
Latency 7299us 226ms 227ms 27785us 22094us 237ms
Version 1.96 ------Sequential Create------ --------Random Create--------
morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency 668us 580us 1033us 612us 69us 611us
tuned mount options:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
morpheus 8G 1186 98 92841 9 43259 7 3638 96 120936 12 159.7 3
Latency 7389us 226ms 147ms 23610us 25269us 274ms
Version 1.96 ------Sequential Create------ --------Random Create--------
morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency 658us 605us 627us 617us 12us 636us
1.96,1.96,morpheus,1,1295728732,8G,,1186,98,92841,9,43259,7,3638,96,120936,12,159.7,3,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,++
s,226ms,147ms,23610us,25269us,274ms,658us,605us,627us,617us,12us,636us
default mkfs.ext4 und mount with no options
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
morpheus 8G 736 98 93696 11 42589 7 3660 97 120755 12 162.5 3
Latency 11837us 226ms 227ms 23548us 23374us 286ms
Version 1.96 ------Sequential Create------ --------Random Create--------
morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency 612us 2088us 657us 602us 334us 620us
1.96,1.96,morpheus,1,1295730229,8G,,736,98,93696,11,42589,7,3660,97,120755,12,162.5,3,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,11837us,226ms,227ms,23548us,23374us,286ms,612us,2088us,657us,602us,334us,620us
mkfs.ext4 -v -b 4096 -m 0 -E stride=16,stripe-width=32 /dev/sda1
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
morpheus 8G 747 98 94455 10 42634 7 3808 97 119933 12 154.1 4
Latency 11611us 226ms 226ms 22920us 26877us 253ms
Version 1.96 ------Sequential Create------ --------Random Create--------
morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency 621us 503us 543us 637us 13us 615us
1.96,1.96,morpheus,1,1295759965,8G,,747,98,94455,10,42634,7,3808,97,119933,12,154.1,4,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,11611us,226ms,226ms,22920us,26877us,253ms,621us,503us,543us,637us,13us,615us
Largefile support
!!! very fast mkfs !!!
morpheus / # time mkfs.ext4 -v -m 0 -T largefile4 -O ^has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize /dev/sda1
mke2fs 1.41.12 (17-May-2010)
fs_types for mke2fs.conf resolution: 'ext4', 'largefile4'
Calling BLKDISCARD from 0 to 2000398893056 failed.
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
476960 inodes, 488378636 blocks
0 blocks (0.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
14905 block groups
32768 blocks per group, 32768 fragments per group
32 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848
Writing inode tables: done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 33 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
real 0m19.368s
user 0m0.089s
sys 0m0.456s
mount -o noatime,data=writeback,barrier=0 /dev/sda1 /mnt/usb/
wd:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
morpheus 8G 1140 98 94574 9 43097 7 3642 98 119887 12 158.3 3
Latency 7905us 226ms 227ms 13808us 16982us 217ms
Version 1.96 ------Sequential Create------ --------Random Create--------
morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency 675us 453us 617us 601us 25us 697us
1.96,1.96,morpheus,1,1295758867,8G,,1140,98,94574,9,43097,7,3642,98,119887,12,158.3,3,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7905us,226ms,227ms,13808us,16982us,217ms,675us,453us,617us,601us,25us,697us
samsung
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
morpheus 8G 1198 98 106494 11 52066 8 3744 96 132966 13 140.7 2
Latency 7266us 128ms 126ms 31119us 31659us 313ms
Version 1.96 ------Sequential Create------ --------Random Create--------
morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency 601us 409us 541us 97us 64us 82us
1.96,1.96,morpheus,1,1295814353,8G,,1198,98,106494,11,52066,8,3744,96,132966,13,140.7,2,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7266us,128ms,126ms,31119us,31659us,313ms,601us,409us,541us,97us,64us,82us
==== USB Mass Storage Tuning ====
Linux 2.6 gives you the ability to see and to change the max_sectors value for each USB storage device, independently. Assuming you have a sysfs filesystem mounted on /sys and assuming /dev/sda is a USB drive, you can see the max_sectors value for /dev/sda simply by running:
$ cat /sys/block/sda/device/max_sectors
and you can set max_sectors to 2048 by running (as root):
$ echo 2048 > /sys/block/sda/device/max_sectors
Values should be positive multiples of 8 (16 on the Alpha and other 64-bit platforms). There is no upper limit, but you probably shouldn't make max_sectors much bigger than 2048 (corresponding to 1 MB, which is quite a lot).
wd:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
morpheus 8G 1174 98 104857 10 45213 6 3613 97 122696 10 160.4 2
Latency 7457us 226ms 226ms 25125us 14871us 264ms
Version 1.96 ------Sequential Create------ --------Random Create--------
morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency 631us 412us 656us 507us 95us 82us
1.96,1.96,morpheus,1,1295757422,8G,,1174,98,104857,10,45213,6,3613,97,122696,10,160.4,2,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7457us,226ms,226ms,25125us,14871us,264ms,631us,412us,656us,507us,95us,82us
samsung:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
morpheus 8G 1211 98 114716 11 55022 7 3741 97 141067 12 141.2 3
Latency 7277us 127ms 126ms 25340us 29201us 258ms
Version 1.96 ------Sequential Create------ --------Random Create--------
morpheus -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency 533us 415us 602us 644us 58us 580us
1.96,1.96,morpheus,1,1295812838,8G,,1211,98,114716,11,55022,7,3741,97,141067,12,141.2,3,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7277us,127ms,126ms,25340us,29201us,258ms,533us,415us,602us,644us,58us,580us
===== USB 3.0 Performance =====
samsung
write:
$ time dd if=/dev/zero bs=4096 count=10000000 of=/mnt/usb/40Gb.file
10000000+0 records in
10000000+0 records out
40960000000 bytes (41 GB) copied, 369.997 s, 111 MB/s
real 6m10.016s
user 0m1.317s
read:
$ time dd if=/mnt/usb/40Gb.file bs=64k of=/dev/null
625000+0 records in
625000+0 records out
40960000000 bytes (41 GB) copied, 312.399 s, 131 MB/s
real 5m12.446s
user 0m0.252s
sys 0m34.811s
{{tag>linux 4k partition benchmark hdd usb test}}
~~DISCUSSION~~