Fifty members of the Linux storage and file system communities met
February 12 and 13 in San Jose,
California to give status updates, present new ideas and discuss issues during
the 2007 Linux Storage and File Systems Workshop. The workshop was chaired
by Ric Wheeler and sponsored by EMC, NetApp, Panasas, Seagate and Oracle.
Day 1: Joint Session
Ric Wheeler opened with an explanation of the basic contract that storage
systems make with the user: the complete set of data will be
stored, bytes are correct and in order, and raw capacity is utilized as
completely as possible. It is so simple that it seems that there should be no
open issues, right?
Today, this contract is met most of the time but Ric posed a number of
questions. How do we validate that no files have been lost? How do we
verify that the
bytes are correctly stored? How can we utilize disks efficiently for small
files? How do errors get communicated between the layers?
Through the course of the next two days some of these questions were discussed,
others were raised and a few ideas proposed. Continue reading for the details.
Ext4 Status Update
Mingming Cao gave a status update on ext4, the recent fork of the ext3 file
system. The primary goal of the fork was the move to 48-bit block numbers;
this change allows the file system to support up to 1024 petabytes of storage.
This feature was originally designed to be merged into ext3, but was seen as too disruptive. The patch is also
built on top of the new extents
system. Support for greater than 32K directory entries will also be merged
into ext4.
On top of these changes a number of ext3 options will be enabled by default
including: directory indexing which improves file access for large directories,
"resize inodes" which reserve space in the block group descriptor for online
growing, and 256-byte inodes. Ext3 users can use these features today with
a command like:
mkfs.ext3 -I 256 -O resize_inode dir_index /dev/device
A number of other features are also being considered for inclusion into ext4
and have been posted on the list as RFCs.
This includes a patch that will add nanosecond
timestamps and the creation of persistent
file allocations, which will be similar to posix_fallocate() but won't waste
time writing zeros to the disk.
Ext4 currently stores a limited number of extended attributes in-inode and has
space for one additional block of extended attribute data, but this may not be
enough to satisfy xattr-hungry applications. For example, Samba needs
additional space to support Vista's heavy use of ACLs, and eCryptFS can store
arbitrarily large keys in extended attributes. This led everyone to the
conclusion that data needs to be collected on how extended attributes are being used to help
developers decide how to best implement them. Until larger extended attributes
are supported, application developers need to pay attention to the limits that
exist on current file systems e.g. one block on ext3 and 64K on XFS.
Online shrinking and growing was briefly discussed and it was suggested that
online defragmentation, which is a planned feature, will be the first step
toward online shrinking. A bigger issue however is storage management and Ted
Ts'o suggested that the Linux file system community can learn from ZFS on how
to create easy to manage systems. Christoph Hellwig sees the disk management
issue as being a user space problem that can be solved with kernel hooks and
sees ZFS as a layering violation. Either way it is clear that disk management
should be improved.
The fsck Problem
Zach Brown and Valerie Henson were slated to speak on the topic of file system
repair. While Val booted her laptop, she introduced us to the latest fashion:
laptop rhinestones, a great discussion piece if you are waiting on a fsck. If
Val's estimates for fsck time in 2013 come true, having a way to pass the time
will become very important.
Val presented an estimate of 2013 fsck times. She first measured a fsck of her
37GB /home partition (with 21GB in use) which took 7.5 minutes and read 1.3GB of
file system data. Next, she used projections of disk technology from Seagate to
estimate the time to fsck a circa-2013 home partition, which will be 16 times larger.
Although 2013 disks will have a five-fold bandwidth increase, seek times will
only improve about 1.2 times (to 10ms) leading to an increase in fsck time from about 8
minutes to 80 minutes! The primary reason for long fscks is seek latency, since
fsck spends most of its time seeking over the disk discovering and fetching
dynamic file system data like directory entries, indirect blocks and extents.
Reducing seeks and avoiding the seek latency punishment is key to reducing fsck
times. Val suggested one solution would be keeping a bitmap on disk that
tracks the blocks that contain file system metadata; this would allow for
reading all data in a single arm sweep. This optimization, in the best case,
would make a single sequential sweep over the disk and, on the future disk, reading
all file system metadata would only take around 134 seconds, a large
improvement over 80 minutes. A full explanation of the findings and possible
solutions can be found in the paper Repair-Driven File System
Design [PDF]. Also, Val announced that she is working full time on a file system
called chunkfs
[PDF]
that will make speed and ease of repair a primary design goal.
Zach Brown presented some blktrace output from e2fsck. The outcome of the trace
is that, while the disk can stream data at 26 Mb/s, fsck is achieving only 12 Mb/s.
This situation could be improved to some degree without on-disk layout changes
if the developers had a vectorized I/O call. Zach explained that in many cases
you know the block locations that you need, but with the current API you can
only read one at a time.
A vectorized read would take a number of buffers and a list of blocks to read
as arguments. Then the application could submit all of the reads at once.
Such a system call could save a significant amount of time since the I/O
scheduler can reorder requests to minimize seeks and merge requests that are
nearby. Also, reads to blocks that are located on different disks could be
parallelized. Although a vectorized read could speed up the fsck eventually
file system layout changes will be needed to make fsck faster.
libata: bringing the ATA community together
Jeff Garzik gave an update on the progress of libata, the in-kernel library to
support ATA hosts and devices. He first presented the ATAPI/SATA features that
libata now supports including: PATA+C/H/S, NCQ, FUA, SCSI SAT, and
CompactFlash. The growing support for parallel ATA (PATA) drives in libata
will eventually deprecate the IDE driver; Fedora developers are helping to
accelerate testing and adoption of the libata PATA code by disabling the IDE
driver in Fedora 7 test 1.
Native Command Queuing (NCQ) is a new command protocol introduced in the SATA
II extensions and is now supported under libata. With NCQ the host can have
multiple outstanding requests on the drive at once. The drive can reorder and
reschedule these requests to improve disk performance. A useful feature of NCQ
drives is the force unit access (FUA) bit which will ensure the data, in write
commands with this bit set, will be written to disk before returning success.
This has the potential of enabling the kernel to have both synchronous and
non-synchronous commands in flight. There was a recent discussion
about both NCQ FUA and SATA FUA in libata on the linux-ide mailing list.
Jeff briefly discussed libata's support for SCSI ATA translation (SAT) which
lets an ATA device appear to be a SCSI device to the system. The motivation
for this translation is the reuse of error handling and support for distribution
installers which already know how to handle SCSI devices.
There are also a number of items slated as future work for libata. Many
drivers need better suspend/resume support and the driver API is due for a sane
initialization model using a allocate/register/unallocate/free system and "Greg
blessed" kobjects. Currently libata is written under the SCSI layer and
debate continues on how to restructure libata to minimize or eliminate its SCSI
dependence. Error handling has been substantially improved by Tejun Heo and
his changes are now in mainline. If you have had issues with SATA or libata
error handling, try an updated kernel to see if those issues have been
resolved. Tejun and others continue to add features and tune the libata stack.
Communication Breakdown: I/O and File Systems
During the morning a number of conversations sprung up about communication
between I/O and file systems. A hot topic was getting information from the
block layer about non-retryable errors that affect an entire range of bytes and
passing that data up to user space. There are situations when retries are
happening on a large range of bytes even when the I/O layer knows that an
entire range of blocks are missing or bad.
A "pipe" abstraction was discussed to communicate data on byte ranges that are
currently in error, under performance strain (because of a RAID5 disk failure),
or temporarily unplugged. If a file system were aware of ranges that are
currently handling a recoverable error, have unrecoverable errors or are
temporarily slow, it may be able to handle the situation more gracefully.
File systems currently do not receive unplug events and handling unplug
situations can be tricky. For example, if a fibre channel disk is pulled for a
moment and plugged back in it may be down for only 30 seconds but how should
the file system handle the situation? Ext3 currently remounts the entire file
system as read only. XFS has a configurable timeout for fibre channel disks
that must be reached before it sends an EIO error. And what should be done
with USB drives that are unplugged? Should the file system save state and hope
the device gets plugged back in? How long should it wait and should it still
work if it is plugged into a different hub? All of these questions were raised
but there are no clear answers.
The Filesystems Track
The workshop split into different tracks; your author decided to follow the
one dedicated to filesystems.
Security Attributes
Michael Halcrow, eCryptFS developer, presented an idea to use SELinux to
make file encryption/decryption dependent on application execution. For example, a
policy could be defined so that the data would be unencrypted when OpenOffice
is using the file but encrypted when the user copies the file to a USB key.
After presenting the mechanism and mark-up language for this idea Michael
opened the floor
to the audience. The general feeling was that SELinux is often disabled
by users and that per-mount-point encryption may be a more useful and easy to
understand user interface.
Why Linux Sucks for Stacking
Josef Sipek, Unionfs
maintainer, went over some of the issues involved with stacking file systems
under Linux. A stacking file system, like Unionfs, provides an alternative
view of a lower file system. For example, Unionfs takes a number of mounted
directories, which could be NFS/ext3/etc, as arguments at mount time and merges
their name space.
The big unsolved issue with stacking file systems is handling modifications to
the lower file systems in the stack. Several people suggested that leaving the
lower file system available to the user is just broken and that by default the
lower layers should only be mounted internally.
The new fs/stack.c file was discussed too. This file currently contains a
simple inode copy routines that is used by Unionfs and eCryptfs, but in the
future more stackable file system routines should be pushed to this file.
Future work for Unionfs includes getting it working under lockdep and
additional experimentation with an on-disk format. The on-disk format for
Unionfs is currently under development; it will store white-out files
(representing files which have been deleted by a user but which still exist on
the lower-level filesystems) and
persistent Unionfs inode data.
B-trees for a Shadowed FS
Many file systems use B-trees to represent files and directories. These
structures keep data sorted, are balanced, and allow for insertion and deletion
in logarithmic time. However, there are difficulties in using them with
shadowing. Ohad Rodeh presented his approach to using b-trees and shadowing in
an object storage device, but the methods are general and useful for any
application.
Shadowing may also be called copy-on-write (COW); the basic idea is that
when a write is made the block is read into memory, modified, and written to a
new location on disk. Then the tree is recursively updated starting at the
child and using COW until the root node is atomically updated. In this way the
data is never in an inconsistent state; if the system crashes before the root
node is updated then the write is lost but the previous contents remain intact.
Replicating the details of his presentation would be a wasted effort as his
paper, B-trees,
Shadowing and Clones [PDF], is well written and easy to read. Enjoy!
eXplode the code
Storage systems have a simple and important contract to keep: given user data
they must save that data to disk without loss or corruption even in the face of
system crashes. Can Sar gave an overview of eXplode [PDF], a systematic
approach to finding bugs in storage systems.
eXplode systematically explores all possible choices that can be made at each
choice point in the code to make low-probability events, or corner cases, just
as probable as the main running path. And it does this exploration on a real
running system with minimal modifications.
This system has the advantage of being conceptually simple and very effective.
Bugs were found in every major Linux file system, including a fsync bug that
can cause data corruption on ext2. This bug can be produced by doing the
following: create a new file, B, which recycles an indirect block from a
recently truncated file, A, then call fsync on file B and crash the system
before file A's truncate gets to disk. There is now inconsistent data on disk
and when e2fsck tries to fix the inconsistency it corrupts file B's data. A
discussion of the bug has been started on the linux-fsdevel
mailing list.
NFS
The second day of the file systems track started with a discussion of an NFS
race. The race appears when a client opens up a file between two writes
that occur during the same second. The client that just opened the file
will be
unaware of the second write and will keep an out-of-date version of the file in
cache. To fix the problem, a "change" attribute was suggested. This number would
be consistent across reboots, unit-less and would increment on every write.
In general everyone agreed that a change attribute is the right solution,
however Val Henson pointed out that implementing this on legacy file systems
will be expensive and will require on disk format changes.
Discussion then turned to NSFv4 access control lists (ACLs). Trond Myklebust
said they are becoming a standard and Linux should support them. Andreas
Gruenbacher is working on patches to add NFSv4 support to Linux but currently
only ext3 is supported; more information can be found on the Native NFSv4 ACLs on Linux page.
A possibly difficult issue will be mapping current POSIX ACLs to NFSv4 ACLs,
but a draft document, Mapping
Between NFSv4 and Posix Draft ACLs, lays out a mapping scheme.
GFS Updates
Steven Whitehouse gave an overview of the recent changes in the Global File
System 2 (GFS2), a cluster file system where a number of peers share
access to the storage device.
The important changes include a new journal layout that can
support mmap(), splice() and other system calls on
journaled files, page cache
level locking, readpages() and partial writepages()
support, and ext3 standard
ioctls lsattr and chattr.
readdir() was discussed at some length, particularly the ways in which it is
broken. A directory insert on GFS2 may cause a reorder of the extensible hash
structure GFS2 uses for directories. In order to support readdir() every hash
chain must be sorted. The audience generally agreed that readdir() is difficult
to implement and Ted Ts'o suggested that someone should try to go through
committee to get telldir/seekdir/readdir fixed or eliminated.
OCFS2
A brief OCFS2 status report was given by Mark Fasheh. Like GFS2, OCFS2 is a
cluster file system, designed to share a file system across nodes in a cluster.
The current development focus is on adding features, as the basic file system
features are working well.
After the status update the audience asked a few questions. The most requested
OCFS2 feature is forced unmount and several people suggested that this should
be a future virtual file system (vfs) feature. Mark also said that users
really enjoy the easy setup of OCFS2 and the ability to use it as a local file
system. A performance hot button for OCFS2 are the large inodes and occupy an
entire block.
In the future Mark would like to mix extent and extended attribute data
in-inode to utilize all of the available space. However, as the audience
pointed out, this optimization can lead to some complex code. In the future
Mark would also like to move to GFS's distribute lock manager.
DualFS: A New Journaling File System for Linux
DualFS is a file system by Juan Piernas that separates data and meta data into
separate file systems. The on-disk format for the data disk is similar to ext2
without meta-data blocks. The meta data file system is a log file system, a
design that allows for very fast writes since they are always made at the head
of the log which reduces expensive seeks. A few performance numbers were
presented; under a number of micro- and macro-benchmarks DualFS performs
better than other Linux journaling file systems. In its current form, DualFS
uses separate partitions for data and metadata, forcing the user to answer
a difficult question: how much metadata do I expect to have?
More information, including performance comparisons, can be found on the DualFS LKML announcement and the project homepage. The currently
available code is a patch on top of 2.4.19 and can be found on SourceForge.
pNFS Object Storage Driver
Benny Halevy gave an overview of pNFS (parallel NFS), which is part of the IETF
NFSv4.1 draft and
tries to solve the single server performance bottleneck of NFS storage systems.
pNFS is a mechanism for an NFS client to talk directly to a disk device without
sending requests through the NFS server, fanning the storage system out to the
number of SAN devices. There are many proprietary systems that do a similar
thing including EMC's High Road, IBM's TotalStorage SAN, SGI's CXFS and Sun's
QFS. Having an open protocol would be a good thing.
However, Jeff Garzik was skeptical of including pNFS in the NFSv4.1 draft
particularly because to support pNFS the kernel will need to provide
implementations of all three access protocols: file storage, object storage and
block storage. This will add significant complexity to the Linux NFSv4
implementation.
Benny explained that the pNFS implementation in Linux is modular to support
multiple layout-type specific drivers which are optional. Each layout driver
dynamically registers itself using its layout type and the NFS client calls it
across a well-defined API. Support for specific layout types is optional. In
the absence of a layout driver for some specific layout type the NFS client
falls back to doing I/O through the server.
After this overview Benny turned to the topic of OSDs, or object based storage
devices. These devices provide a more abstract view of the disk than the
classic "array of blocks" abstraction seen in todays disks. Instead of blocks,
objects are the basic unit of an OSD, and each object contains both meta-data
and data. The disk manages the allocation of the bytes on disk and presents
the object data as a contiguous array to the system. Having this abstraction
in hardware would make file system implementation much simpler. To support
OSDs in Linux Benny and others are working to get bi-directional SCSI command
support into the Kernel and support for variable length command descriptor
blocks (CDBs).
Hybrid Disks
Hybrid disks with an NVCache (flash memory) will be in consumers' hands soon.
Timothy Bisson gave an overview of this new technology. The NVCache will
have 128-256Mb of non-volatile flash memory that the disk can manage as a cache
(unpinned) or the operating system can manage by pinning specified blocks to
the non-volatile memory. This technology can reduce power consumption or
increase disk performance.
To reduce power consumption the block layer can enable the NVCache Power Mode,
which tells the disk to redirect writes to the NVCache, reducing disk spin-up
operations. In this mode the 10 minute writeback threshold of Linux laptop
mode can be removed. Another strategy is to pin all file system metadata in the
NVCache, but spin-ups will still occur on non-metadata reads. An open question
is how this pinning should be managed when two or more file systems are using
the same disk.
Performance can be increased by using the NVCache as a cache for writes
requiring a long seek. In this mode the block layer would pin the target
blocks ensuring a write to the cache instead of incurring the expensive seek.
Also, a file system can use the NVCache to store its journal and boot files for
additional performance and reduced system start-up time.
If Linux developers decide to manage the NVCache there are many open questions.
Which layer should manage the NVCache? The file system or block layer? And what
type of API should be created to leverage the cache? Another big question is
how much punishment can these caches take? According to Timothy it takes about
a year (using a desktop workload) to a fry the cache if you are using it as a
write cache.
Scaling Linux to Petabytes
Sage Weil presented Ceph, a network file system that is designed to scale to
petabytes of storage. Ceph is based on a network of object based storage
devices and complete copies of each object is distributed across multiple nodes
using an algorithm called CRUSH. This distribution makes it possible for nodes
to be added and removed from the system dynamically. More information on the
design and implementation can be found on the Ceph homepage
Conclusion
The workshop concluded with the general consensus that bringing together SATA,
SCSI and file system people was a good idea and that the status updates and
conversations were useful. However, the workshop was a bit too large for code
discussion and more targeted workshops will need to be held to workout the
details of some of the issues discussed at LSF'07. Topics for future workshops
include virtual memory and file system issues and extensions that are needed to
the VFS.
|