Table of Contents

4k HDD Partition Alignment Primer

Although HDD storage densities have increased dramatically over the years, one of the most elemental aspects of hard disk drive design, the logical block format size known as a sector, has remained constant. Beginning in late 2009, accelerating in 2010 and hitting mainstream in 2011, all major manufacturers are migrating away from the legacy sector size of 512 bytes to a larger, more efficient sector size of 4096 bytes, generally referred to as 4k or AF (Advanced Format).

While researching the benefits and consequences of a 512→4k transition, many reports of “partition misalignment issues” were found, that could lead to a severe performance impact which led to a closer investigation to verify the alleged problem and the proposed correct partition alignment. The result is obvious: Misaligned partitions on 4k harddisks introduce a severe performance impact, in this test case by a factor of 5.5 (Aligned: 83MB/s vs. misaligned: 15.5MB/s).

Test Setup

The following test setup was used to verify the partition misalignment impact introduced by 4k and to evaluate the performance/usability of USB 3.0 - bridging common of-the-shelf low-power (green) 4k SATA harddisks to serve as Apollo's primary/secondary data vault. It also serves well to debunk & demystify common FUD about USB 3.0 and “painfully slow” external storage on cheap 5400rpm eco-friendly hard disk drives.


1) As of Jan 2011 there is only one USB 3.0 Controller available, the NEC D720200.
2) The Sharkoon USB 3.0 to SATAII Bridge (JMicron chip) has two slots for two drives. Until now it is impossible to use two disks at the same time. As soon as two disks are inserted linux will see either a random one or none at all. For all tests only one drive was inserted at a time, the other slot was left empty.
3) There are currently two versions of this model out there, this 3x667GB platter Version and another (older) one with 4x500GB platter instead. The 4 Platter disk is supposed to be much slower, to identify it, have a look onto the bottom, the casing should not be indented to leave more room for the 4th platter.
4) Samsungs HD204UI default firmware suffers from a very nasty bug, if the drive receives an “IDENTIFY DEVICE” while writing data it sometimes flushes the cache even though the data has not been written to the disk yet and leaves the operating system completely unaware of that. SamsungF4EGBadBlocks
5) USB 3.0 performs pretty well with 2.6.37 but it is a pain with anything older due to power management issues that can freeze the entire system when the usb device is woken up

Partition Alignment

So why is 4k sector size such a bad thing that we have to take care of partition alignment all of a sudden with rotational disks too? In fact it's not. The culprit here is the 512-byte emulation the industry was forced to implement so that some OS like Microsoft Windows can handle the disks at all.

Both disk drives in this test do have a physical sector size of 4k but present 512-byte physical sectors to the OS, so degraded performance will result when the drive's (hidden) internal 4k sector boundaries do not coincide with the 4k logical blocks, clusters and virtual memory pages common in many operating- and file-systems. The drive is thereby forced to perform two read-modify-write operations to satisfy a single misaligned 4k write operation. Recent Kernels and user-land tools support disk alignment and would work like a charm if the disk would present itself in native 4k mode.

The key to misalignment lies in the partition table which consumes either 512 byte (LBA 0) in case of a legacy msdos type mbr or LBA0-33 for the Primary GUID Partition Table (GPT).

Misalignment will occur by default if the first partition is placed immediately after the partition table, as the next block is LBA 1 for msdos type MBRs and LBA 34 for GPT.

In order to align the 4k logical block with the physical 4k on the platter the sectors following the partition table have to be left empty until a sector is reached that is divisible by 8 (sector 8 for msdos and sector 40 for GPT).

Until the situation will fix itself in the future, when the industry finally manages to let go of the 512-byte compatibility in favor of native 4k, people need to be aware of this issue, otherwise they might experience heavy performance impacts by partitioning the disks like they were used to, without proper alignment.

If you want to test your own hardware to verify these results, go ahead with the next section:

Benchmark Code

The following code was offered by no.op on gentoo-forums Samsung F4 HD204UI 2TB & best parted alignment, only size was changed from 512M to 8G, in order to have a better perspective on the timescale and to avoid hidden caching interference.

#define _FILE_OFFSET_BITS 64 

#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>

unsigned char buffer[4096];

int main(int argc, char **argv)
{
   int fd;
   int opt;
   off_t off;
   off_t base = 0;
   off_t stride = sizeof buffer;
   size_t size = 1024 * 1024 * 1024;
   const char *device = NULL;
   int do_sync = 0;

   while ((opt = getopt(argc, argv, "d:b:i:s:S")) != -1) {
      switch (opt) {
      case 'd':
         device = optarg;
         break;
      case 'b':
         base = atoll(optarg) * 512;
         break;
      case 'i':
         stride = atoll(optarg);
         break;
      case 's':
         size = atoll(optarg);
         break;
      case 'S':
         do_sync = 1;
         break;
      default:
         fprintf(stderr, "Usage: part-align-bench "
            "[-b <base sector>] [-i <stride>] "
            "[-s <size>] [-S] -d <block device>\n");
         return 1;
      }
   }
   if (device == NULL) {
      fprintf(stderr, "missing device name argument\n");
      return 1;
   }
   fd = open(device, O_RDWR);
   if (fd < 0) {
      perror("open");
      return 1;
   }


   off = base;
   printf("part-align-bench: %s base sector=%lld "
      "stride=%lld size=%lld %s\n",
      device, base / 512ll, (long long)stride, (long long)size,
      do_sync ? "do_sync" : "no_sync");
   while (size > 0) {
      if (lseek(fd, off, SEEK_SET) == (off_t)-1) {
         perror("lseek");
         close(fd);
         return 1;
      }
      if (write(fd, buffer, sizeof buffer) != sizeof buffer) {
         perror("write");
         close(fd);
         return 1;
      }
      if (do_sync)
         if (fdatasync(fd) < 0) {
            perror("fdatasync");
            close(fd);
            return 1;
         }
      off += stride;
      size -= sizeof buffer;
   }
   close(fd);
   return 0;
}

Compile

CFLAGS="-march=native -O2 -pipe -fomit-frame-pointer" gcc -o part-align-bench part-align-bench.c 

Run benchmark

The following tests will erase the complete disk. Only continue when you feel confident that you know what you are doing. Take special care to check that your hdd really is /dev/sda or change the -d option according to your own setup!

$ time ./part-align-bench -d /dev/sda -b 0
part-align-bench: /dev/sda base sector=0 stride=4096 size=8589934592 no_sync
 
real	1m20.195s
user	0m0.172s
sys	0m6.973s
$ time ./part-align-bench -d /dev/sda -b 34
part-align-bench: /dev/sda base sector=34 stride=4096 size=8589934592 no_sync
 
real	7m51.229s
user	0m0.660s
sys	0m34.763s

Results

DeviceSector 0Sector 8Sector 34Sector 40Sector 42
WD 12.545s12.436s65.792s12.211s66.341s
HD204UI10.141s10.153s59.064s10.126s59.010s

misaligned cp:

morpheus / # time cp -a /usr /mnt/usb/

real 19m13.331s user 0m5.086s sys 1m9.511s morpheus / # du -sch /usr 13G /usr 13G total

Conclusion

Looking at the results, the impact of misaligned partitions is clearly visible:

WD:

Samsung:

Partitions should begin on sectors that can be divided by 8 (8 sectors == 4 kB internal sector of HDD)

Alignment:

No partition table, write filesystem directly to /dev/sda starting at sector 0 GPT partition table, one partition starting at sector 40 GPT partition table, multiple partitions, start first at sector 40 following partitions must start at sectors that can be divided by 8.

The msdos partition table/mbr is 512bytes long, so theoretically sector 8 would be the start sector for the first partition.

Create Partition

create gpt partition table:

$ parted --script /dev/sda mklabel gpt 

align primary partition to sector 40 for best performance:

$ time parted --align=min --script /dev/sda mkpart primary 40s 100%
 
real	0m0.021s
user	0m0.002s
sys	0m0.001s
 
$ parted --script /dev/sda unit s print
Model: WDC WD20 EARS-00MVWB0 (scsi)
Disk /dev/sda: 3907029168s
Sector size (logical/physical): 512B/512B
Partition Table: gpt
 
Number  Start  End          Size         File system  Name     Flags
 1      40s    3907029134s  3907029095s               primary

http://forums.gentoo.org/viewtopic-t-848978.html?sid=ec0b458cb2c449f8a050c7912fa023cf

File system Alignment

time mkfs.ext4 -v -b 4096 -E stride=128,stripe-width=128 -O ^has_journal /dev/sda1 
 
 
mke2fs 1.41.12 (17-May-2010)
fs_types for mke2fs.conf resolution: 'ext4', 'default'
Calling BLKDISCARD from 0 to 2000398893056 failed.
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=128 blocks, Stripe width=128 blocks
122101760 inodes, 488378636 blocks
24418931 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
14905 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848
 
Writing inode tables: done                            
Writing superblocks and filesystem accounting information: done
 
This filesystem will be automatically checked every 39 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
 
real	9m27.444s
user	0m1.696s
sys	0m40.024s
bonnie++ -d /mnt/usb/ -u root 
 
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
morpheus         8G  1203  98 95010   9 43288   7  3663  96 125455  12 161.2   3
Latency              7299us     226ms     227ms   27785us   22094us     237ms
Version  1.96       ------Sequential Create------ --------Random Create--------
morpheus            -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               668us     580us    1033us     612us      69us     611us

tuned mount options:


Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
morpheus         8G  1186  98 92841   9 43259   7  3638  96 120936  12 159.7   3
Latency              7389us     226ms     147ms   23610us   25269us     274ms
Version  1.96       ------Sequential Create------ --------Random Create--------
morpheus            -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               658us     605us     627us     617us      12us     636us
1.96,1.96,morpheus,1,1295728732,8G,,1186,98,92841,9,43259,7,3638,96,120936,12,159.7,3,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,++
s,226ms,147ms,23610us,25269us,274ms,658us,605us,627us,617us,12us,636us

default mkfs.ext4 und mount with no options

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
morpheus         8G   736  98 93696  11 42589   7  3660  97 120755  12 162.5   3
Latency             11837us     226ms     227ms   23548us   23374us     286ms
Version  1.96       ------Sequential Create------ --------Random Create--------
morpheus            -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               612us    2088us     657us     602us     334us     620us
1.96,1.96,morpheus,1,1295730229,8G,,736,98,93696,11,42589,7,3660,97,120755,12,162.5,3,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,11837us,226ms,227ms,23548us,23374us,286ms,612us,2088us,657us,602us,334us,620us

mkfs.ext4 -v -b 4096 -m 0 -E stride=16,stripe-width=32 /dev/sda1

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
morpheus         8G   747  98 94455  10 42634   7  3808  97 119933  12 154.1   4
Latency             11611us     226ms     226ms   22920us   26877us     253ms
Version  1.96       ------Sequential Create------ --------Random Create--------
morpheus            -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               621us     503us     543us     637us      13us     615us
1.96,1.96,morpheus,1,1295759965,8G,,747,98,94455,10,42634,7,3808,97,119933,12,154.1,4,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,11611us,226ms,226ms,22920us,26877us,253ms,621us,503us,543us,637us,13us,615us

Largefile support

!!! very fast mkfs !!!

morpheus / # time mkfs.ext4 -v -m 0 -T largefile4 -O ^has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize /dev/sda1 
mke2fs 1.41.12 (17-May-2010)
fs_types for mke2fs.conf resolution: 'ext4', 'largefile4'
Calling BLKDISCARD from 0 to 2000398893056 failed.
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
476960 inodes, 488378636 blocks
0 blocks (0.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
14905 block groups
32768 blocks per group, 32768 fragments per group
32 inodes per group
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
        102400000, 214990848

Writing inode tables: done                            
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 33 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

real    0m19.368s
user    0m0.089s
sys     0m0.456s

mount -o noatime,data=writeback,barrier=0 /dev/sda1 /mnt/usb/

wd:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
morpheus         8G  1140  98 94574   9 43097   7  3642  98 119887  12 158.3   3
Latency              7905us     226ms     227ms   13808us   16982us     217ms
Version  1.96       ------Sequential Create------ --------Random Create--------
morpheus            -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               675us     453us     617us     601us      25us     697us
1.96,1.96,morpheus,1,1295758867,8G,,1140,98,94574,9,43097,7,3642,98,119887,12,158.3,3,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7905us,226ms,227ms,13808us,16982us,217ms,675us,453us,617us,601us,25us,697us

samsung

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
morpheus         8G  1198  98 106494  11 52066   8  3744  96 132966  13 140.7   2
Latency              7266us     128ms     126ms   31119us   31659us     313ms
Version  1.96       ------Sequential Create------ --------Random Create--------
morpheus            -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               601us     409us     541us      97us      64us      82us
1.96,1.96,morpheus,1,1295814353,8G,,1198,98,106494,11,52066,8,3744,96,132966,13,140.7,2,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7266us,128ms,126ms,31119us,31659us,313ms,601us,409us,541us,97us,64us,82us

USB Mass Storage Tuning

Linux 2.6 gives you the ability to see and to change the max_sectors value for each USB storage device, independently. Assuming you have a sysfs filesystem mounted on /sys and assuming /dev/sda is a USB drive, you can see the max_sectors value for /dev/sda simply by running:

$ cat /sys/block/sda/device/max_sectors

and you can set max_sectors to 2048 by running (as root):

$ echo 2048 > /sys/block/sda/device/max_sectors

Values should be positive multiples of 8 (16 on the Alpha and other 64-bit platforms). There is no upper limit, but you probably shouldn't make max_sectors much bigger than 2048 (corresponding to 1 MB, which is quite a lot).

wd:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
morpheus         8G  1174  98 104857  10 45213   6  3613  97 122696  10 160.4   2
Latency              7457us     226ms     226ms   25125us   14871us     264ms
Version  1.96       ------Sequential Create------ --------Random Create--------
morpheus            -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               631us     412us     656us     507us      95us      82us
1.96,1.96,morpheus,1,1295757422,8G,,1174,98,104857,10,45213,6,3613,97,122696,10,160.4,2,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7457us,226ms,226ms,25125us,14871us,264ms,631us,412us,656us,507us,95us,82us

samsung:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
morpheus         8G  1211  98 114716  11 55022   7  3741  97 141067  12 141.2   3
Latency              7277us     127ms     126ms   25340us   29201us     258ms
Version  1.96       ------Sequential Create------ --------Random Create--------
morpheus            -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               533us     415us     602us     644us      58us     580us
1.96,1.96,morpheus,1,1295812838,8G,,1211,98,114716,11,55022,7,3741,97,141067,12,141.2,3,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7277us,127ms,126ms,25340us,29201us,258ms,533us,415us,602us,644us,58us,580us

USB 3.0 Performance

samsung

write:

$ time dd if=/dev/zero bs=4096 count=10000000 of=/mnt/usb/40Gb.file
10000000+0 records in
10000000+0 records out
40960000000 bytes (41 GB) copied, 369.997 s, 111 MB/s
 
real	6m10.016s
user	0m1.317s

read:

$ time dd if=/mnt/usb/40Gb.file bs=64k of=/dev/null
625000+0 records in
625000+0 records out
40960000000 bytes (41 GB) copied, 312.399 s, 131 MB/s
 
real	5m12.446s
user	0m0.252s
sys	0m34.811s