Logged in as vallor
My Account
Log out

Subscriber links
Send a link

Weekly Edition
Return to the Kernel page

Recent Features

Home	Weekly edition	Kernel	Security	Distributions
Archives	Search	Letters	Calendar	LWN.net FAQ
Subscriptions	Advertise	Write for LWN	Contact us	Privacy

The 2007 Linux Storage and File Systems Workshop

[LWN subscriber-only content]

March 19, 2007
This article was contributed by Brandon Philips

Fifty members of the Linux storage and file system communities met February 12 and 13 in San Jose, California to give status updates, present new ideas and discuss issues during the 2007 Linux Storage and File Systems Workshop. The workshop was chaired by Ric Wheeler and sponsored by EMC, NetApp, Panasas, Seagate and Oracle.

Day 1: Joint Session

Ric Wheeler opened with an explanation of the basic contract that storage systems make with the user: the complete set of data will be stored, bytes are correct and in order, and raw capacity is utilized as completely as possible. It is so simple that it seems that there should be no open issues, right?

Today, this contract is met most of the time but Ric posed a number of questions. How do we validate that no files have been lost? How do we verify that the bytes are correctly stored? How can we utilize disks efficiently for small files? How do errors get communicated between the layers?

Through the course of the next two days some of these questions were discussed, others were raised and a few ideas proposed. Continue reading for the details.

Ext4 Status Update

Mingming Cao gave a status update on ext4, the recent fork of the ext3 file system. The primary goal of the fork was the move to 48-bit block numbers; this change allows the file system to support up to 1024 petabytes of storage. This feature was originally designed to be merged into ext3, but was seen as too disruptive. The patch is also built on top of the new extents system. Support for greater than 32K directory entries will also be merged into ext4.

On top of these changes a number of ext3 options will be enabled by default including: directory indexing which improves file access for large directories, "resize inodes" which reserve space in the block group descriptor for online growing, and 256-byte inodes. Ext3 users can use these features today with a command like:

    mkfs.ext3 -I 256 -O resize_inode dir_index /dev/device

A number of other features are also being considered for inclusion into ext4 and have been posted on the list as RFCs. This includes a patch that will add nanosecond timestamps and the creation of persistent file allocations, which will be similar to posix_fallocate() but won't waste time writing zeros to the disk.

Ext4 currently stores a limited number of extended attributes in-inode and has space for one additional block of extended attribute data, but this may not be enough to satisfy xattr-hungry applications. For example, Samba needs additional space to support Vista's heavy use of ACLs, and eCryptFS can store arbitrarily large keys in extended attributes. This led everyone to the conclusion that data needs to be collected on how extended attributes are being used to help developers decide how to best implement them. Until larger extended attributes are supported, application developers need to pay attention to the limits that exist on current file systems e.g. one block on ext3 and 64K on XFS.

Online shrinking and growing was briefly discussed and it was suggested that online defragmentation, which is a planned feature, will be the first step toward online shrinking. A bigger issue however is storage management and Ted Ts'o suggested that the Linux file system community can learn from ZFS on how to create easy to manage systems. Christoph Hellwig sees the disk management issue as being a user space problem that can be solved with kernel hooks and sees ZFS as a layering violation. Either way it is clear that disk management should be improved.

The fsck Problem

Zach Brown and Valerie Henson were slated to speak on the topic of file system repair. While Val booted her laptop, she introduced us to the latest fashion: laptop rhinestones, a great discussion piece if you are waiting on a fsck. If Val's estimates for fsck time in 2013 come true, having a way to pass the time will become very important.

Val presented an estimate of 2013 fsck times. She first measured a fsck of her 37GB /home partition (with 21GB in use) which took 7.5 minutes and read 1.3GB of file system data. Next, she used projections of disk technology from Seagate to estimate the time to fsck a circa-2013 home partition, which will be 16 times larger. Although 2013 disks will have a five-fold bandwidth increase, seek times will only improve about 1.2 times (to 10ms) leading to an increase in fsck time from about 8 minutes to 80 minutes! The primary reason for long fscks is seek latency, since fsck spends most of its time seeking over the disk discovering and fetching dynamic file system data like directory entries, indirect blocks and extents.

Reducing seeks and avoiding the seek latency punishment is key to reducing fsck times. Val suggested one solution would be keeping a bitmap on disk that tracks the blocks that contain file system metadata; this would allow for reading all data in a single arm sweep. This optimization, in the best case, would make a single sequential sweep over the disk and, on the future disk, reading all file system metadata would only take around 134 seconds, a large improvement over 80 minutes. A full explanation of the findings and possible solutions can be found in the paper Repair-Driven File System Design [PDF]. Also, Val announced that she is working full time on a file system called chunkfs [PDF] that will make speed and ease of repair a primary design goal.

Zach Brown presented some blktrace output from e2fsck. The outcome of the trace is that, while the disk can stream data at 26 Mb/s, fsck is achieving only 12 Mb/s. This situation could be improved to some degree without on-disk layout changes if the developers had a vectorized I/O call. Zach explained that in many cases you know the block locations that you need, but with the current API you can only read one at a time.

A vectorized read would take a number of buffers and a list of blocks to read as arguments. Then the application could submit all of the reads at once. Such a system call could save a significant amount of time since the I/O scheduler can reorder requests to minimize seeks and merge requests that are nearby. Also, reads to blocks that are located on different disks could be parallelized. Although a vectorized read could speed up the fsck eventually file system layout changes will be needed to make fsck faster.

libata: bringing the ATA community together

Jeff Garzik gave an update on the progress of libata, the in-kernel library to support ATA hosts and devices. He first presented the ATAPI/SATA features that libata now supports including: PATA+C/H/S, NCQ, FUA, SCSI SAT, and CompactFlash. The growing support for parallel ATA (PATA) drives in libata will eventually deprecate the IDE driver; Fedora developers are helping to accelerate testing and adoption of the libata PATA code by disabling the IDE driver in Fedora 7 test 1.

Native Command Queuing (NCQ) is a new command protocol introduced in the SATA II extensions and is now supported under libata. With NCQ the host can have multiple outstanding requests on the drive at once. The drive can reorder and reschedule these requests to improve disk performance. A useful feature of NCQ drives is the force unit access (FUA) bit which will ensure the data, in write commands with this bit set, will be written to disk before returning success. This has the potential of enabling the kernel to have both synchronous and non-synchronous commands in flight. There was a recent discussion about both NCQ FUA and SATA FUA in libata on the linux-ide mailing list.

Jeff briefly discussed libata's support for SCSI ATA translation (SAT) which lets an ATA device appear to be a SCSI device to the system. The motivation for this translation is the reuse of error handling and support for distribution installers which already know how to handle SCSI devices.

There are also a number of items slated as future work for libata. Many drivers need better suspend/resume support and the driver API is due for a sane initialization model using a allocate/register/unallocate/free system and "Greg blessed" kobjects. Currently libata is written under the SCSI layer and debate continues on how to restructure libata to minimize or eliminate its SCSI dependence. Error handling has been substantially improved by Tejun Heo and his changes are now in mainline. If you have had issues with SATA or libata error handling, try an updated kernel to see if those issues have been resolved. Tejun and others continue to add features and tune the libata stack.

Communication Breakdown: I/O and File Systems

During the morning a number of conversations sprung up about communication between I/O and file systems. A hot topic was getting information from the block layer about non-retryable errors that affect an entire range of bytes and passing that data up to user space. There are situations when retries are happening on a large range of bytes even when the I/O layer knows that an entire range of blocks are missing or bad.

A "pipe" abstraction was discussed to communicate data on byte ranges that are currently in error, under performance strain (because of a RAID5 disk failure), or temporarily unplugged. If a file system were aware of ranges that are currently handling a recoverable error, have unrecoverable errors or are temporarily slow, it may be able to handle the situation more gracefully.

File systems currently do not receive unplug events and handling unplug situations can be tricky. For example, if a fibre channel disk is pulled for a moment and plugged back in it may be down for only 30 seconds but how should the file system handle the situation? Ext3 currently remounts the entire file system as read only. XFS has a configurable timeout for fibre channel disks that must be reached before it sends an EIO error. And what should be done with USB drives that are unplugged? Should the file system save state and hope the device gets plugged back in? How long should it wait and should it still work if it is plugged into a different hub? All of these questions were raised but there are no clear answers.

The Filesystems Track

The workshop split into different tracks; your author decided to follow the one dedicated to filesystems.

Security Attributes

Michael Halcrow, eCryptFS developer, presented an idea to use SELinux to make file encryption/decryption dependent on application execution. For example, a policy could be defined so that the data would be unencrypted when OpenOffice is using the file but encrypted when the user copies the file to a USB key. After presenting the mechanism and mark-up language for this idea Michael opened the floor to the audience. The general feeling was that SELinux is often disabled by users and that per-mount-point encryption may be a more useful and easy to understand user interface.

Why Linux Sucks for Stacking

Josef Sipek, Unionfs maintainer, went over some of the issues involved with stacking file systems under Linux. A stacking file system, like Unionfs, provides an alternative view of a lower file system. For example, Unionfs takes a number of mounted directories, which could be NFS/ext3/etc, as arguments at mount time and merges their name space.

The big unsolved issue with stacking file systems is handling modifications to the lower file systems in the stack. Several people suggested that leaving the lower file system available to the user is just broken and that by default the lower layers should only be mounted internally.

The new fs/stack.c file was discussed too. This file currently contains a simple inode copy routines that is used by Unionfs and eCryptfs, but in the future more stackable file system routines should be pushed to this file.

Future work for Unionfs includes getting it working under lockdep and additional experimentation with an on-disk format. The on-disk format for Unionfs is currently under development; it will store white-out files (representing files which have been deleted by a user but which still exist on the lower-level filesystems) and persistent Unionfs inode data.

B-trees for a Shadowed FS

Many file systems use B-trees to represent files and directories. These structures keep data sorted, are balanced, and allow for insertion and deletion in logarithmic time. However, there are difficulties in using them with shadowing. Ohad Rodeh presented his approach to using b-trees and shadowing in an object storage device, but the methods are general and useful for any application.

Shadowing may also be called copy-on-write (COW); the basic idea is that when a write is made the block is read into memory, modified, and written to a new location on disk. Then the tree is recursively updated starting at the child and using COW until the root node is atomically updated. In this way the data is never in an inconsistent state; if the system crashes before the root node is updated then the write is lost but the previous contents remain intact.

Replicating the details of his presentation would be a wasted effort as his paper, B-trees, Shadowing and Clones [PDF], is well written and easy to read. Enjoy!

eXplode the code

Storage systems have a simple and important contract to keep: given user data they must save that data to disk without loss or corruption even in the face of system crashes. Can Sar gave an overview of eXplode [PDF], a systematic approach to finding bugs in storage systems.

eXplode systematically explores all possible choices that can be made at each choice point in the code to make low-probability events, or corner cases, just as probable as the main running path. And it does this exploration on a real running system with minimal modifications.

This system has the advantage of being conceptually simple and very effective. Bugs were found in every major Linux file system, including a fsync bug that can cause data corruption on ext2. This bug can be produced by doing the following: create a new file, B, which recycles an indirect block from a recently truncated file, A, then call fsync on file B and crash the system before file A's truncate gets to disk. There is now inconsistent data on disk and when e2fsck tries to fix the inconsistency it corrupts file B's data. A discussion of the bug has been started on the linux-fsdevel mailing list.

NFS

The second day of the file systems track started with a discussion of an NFS race. The race appears when a client opens up a file between two writes that occur during the same second. The client that just opened the file will be unaware of the second write and will keep an out-of-date version of the file in cache. To fix the problem, a "change" attribute was suggested. This number would be consistent across reboots, unit-less and would increment on every write.

In general everyone agreed that a change attribute is the right solution, however Val Henson pointed out that implementing this on legacy file systems will be expensive and will require on disk format changes.

Discussion then turned to NSFv4 access control lists (ACLs). Trond Myklebust said they are becoming a standard and Linux should support them. Andreas Gruenbacher is working on patches to add NFSv4 support to Linux but currently only ext3 is supported; more information can be found on the Native NFSv4 ACLs on Linux page. A possibly difficult issue will be mapping current POSIX ACLs to NFSv4 ACLs, but a draft document, Mapping Between NFSv4 and Posix Draft ACLs, lays out a mapping scheme.

GFS Updates

Steven Whitehouse gave an overview of the recent changes in the Global File System 2 (GFS2), a cluster file system where a number of peers share access to the storage device. The important changes include a new journal layout that can support mmap(), splice() and other system calls on journaled files, page cache level locking, readpages() and partial writepages() support, and ext3 standard ioctls lsattr and chattr.

readdir() was discussed at some length, particularly the ways in which it is broken. A directory insert on GFS2 may cause a reorder of the extensible hash structure GFS2 uses for directories. In order to support readdir() every hash chain must be sorted. The audience generally agreed that readdir() is difficult to implement and Ted Ts'o suggested that someone should try to go through committee to get telldir/seekdir/readdir fixed or eliminated.

OCFS2

A brief OCFS2 status report was given by Mark Fasheh. Like GFS2, OCFS2 is a cluster file system, designed to share a file system across nodes in a cluster. The current development focus is on adding features, as the basic file system features are working well.

After the status update the audience asked a few questions. The most requested OCFS2 feature is forced unmount and several people suggested that this should be a future virtual file system (vfs) feature. Mark also said that users really enjoy the easy setup of OCFS2 and the ability to use it as a local file system. A performance hot button for OCFS2 are the large inodes and occupy an entire block.

In the future Mark would like to mix extent and extended attribute data in-inode to utilize all of the available space. However, as the audience pointed out, this optimization can lead to some complex code. In the future Mark would also like to move to GFS's distribute lock manager.

DualFS: A New Journaling File System for Linux

DualFS is a file system by Juan Piernas that separates data and meta data into separate file systems. The on-disk format for the data disk is similar to ext2 without meta-data blocks. The meta data file system is a log file system, a design that allows for very fast writes since they are always made at the head of the log which reduces expensive seeks. A few performance numbers were presented; under a number of micro- and macro-benchmarks DualFS performs better than other Linux journaling file systems. In its current form, DualFS uses separate partitions for data and metadata, forcing the user to answer a difficult question: how much metadata do I expect to have?

More information, including performance comparisons, can be found on the DualFS LKML announcement and the project homepage. The currently available code is a patch on top of 2.4.19 and can be found on SourceForge.

pNFS Object Storage Driver

Benny Halevy gave an overview of pNFS (parallel NFS), which is part of the IETF NFSv4.1 draft and tries to solve the single server performance bottleneck of NFS storage systems. pNFS is a mechanism for an NFS client to talk directly to a disk device without sending requests through the NFS server, fanning the storage system out to the number of SAN devices. There are many proprietary systems that do a similar thing including EMC's High Road, IBM's TotalStorage SAN, SGI's CXFS and Sun's QFS. Having an open protocol would be a good thing.

However, Jeff Garzik was skeptical of including pNFS in the NFSv4.1 draft particularly because to support pNFS the kernel will need to provide implementations of all three access protocols: file storage, object storage and block storage. This will add significant complexity to the Linux NFSv4 implementation.

Benny explained that the pNFS implementation in Linux is modular to support multiple layout-type specific drivers which are optional. Each layout driver dynamically registers itself using its layout type and the NFS client calls it across a well-defined API. Support for specific layout types is optional. In the absence of a layout driver for some specific layout type the NFS client falls back to doing I/O through the server.

After this overview Benny turned to the topic of OSDs, or object based storage devices. These devices provide a more abstract view of the disk than the classic "array of blocks" abstraction seen in todays disks. Instead of blocks, objects are the basic unit of an OSD, and each object contains both meta-data and data. The disk manages the allocation of the bytes on disk and presents the object data as a contiguous array to the system. Having this abstraction in hardware would make file system implementation much simpler. To support OSDs in Linux Benny and others are working to get bi-directional SCSI command support into the Kernel and support for variable length command descriptor blocks (CDBs).

Hybrid Disks

Hybrid disks with an NVCache (flash memory) will be in consumers' hands soon. Timothy Bisson gave an overview of this new technology. The NVCache will have 128-256Mb of non-volatile flash memory that the disk can manage as a cache (unpinned) or the operating system can manage by pinning specified blocks to the non-volatile memory. This technology can reduce power consumption or increase disk performance.

To reduce power consumption the block layer can enable the NVCache Power Mode, which tells the disk to redirect writes to the NVCache, reducing disk spin-up operations. In this mode the 10 minute writeback threshold of Linux laptop mode can be removed. Another strategy is to pin all file system metadata in the NVCache, but spin-ups will still occur on non-metadata reads. An open question is how this pinning should be managed when two or more file systems are using the same disk.

Performance can be increased by using the NVCache as a cache for writes requiring a long seek. In this mode the block layer would pin the target blocks ensuring a write to the cache instead of incurring the expensive seek. Also, a file system can use the NVCache to store its journal and boot files for additional performance and reduced system start-up time.

If Linux developers decide to manage the NVCache there are many open questions. Which layer should manage the NVCache? The file system or block layer? And what type of API should be created to leverage the cache? Another big question is how much punishment can these caches take? According to Timothy it takes about a year (using a desktop workload) to a fry the cache if you are using it as a write cache.

Scaling Linux to Petabytes

Sage Weil presented Ceph, a network file system that is designed to scale to petabytes of storage. Ceph is based on a network of object based storage devices and complete copies of each object is distributed across multiple nodes using an algorithm called CRUSH. This distribution makes it possible for nodes to be added and removed from the system dynamically. More information on the design and implementation can be found on the Ceph homepage

Conclusion

The workshop concluded with the general consensus that bringing together SATA, SCSI and file system people was a good idea and that the status updates and conversations were useful. However, the workshop was a bit too large for code discussion and more targeted workshops will need to be held to workout the details of some of the issues discussed at LSF'07. Topics for future workshops include virtual memory and file system issues and extensions that are needed to the VFS.

The 2007 Linux Storage and File Systems Workshop

Posted Mar 19, 2007 19:38 UTC (Mon) by subscriber darrick@us.ibm.com [Link]

Various notes (brain dumps?) from the rest of the workshop can be found at http://linuxfs.pbwiki.com/LSF07-Workshop-Notes

The 2007 Linux Storage and File Systems Workshop

Posted Mar 19, 2007 22:46 UTC (Mon) by subscriber eskild [Link]

Nice article; enjoyed it. Thanks.

Vectorized Read?

Posted Mar 20, 2007 1:31 UTC (Tue) by subscriber ldo [Link]

Re vectorized read--why invent a new API? Why not use existing aio to queue up a whole lot of reads one after the other?

Vectorized Read?

Posted Mar 20, 2007 18:44 UTC (Tue) by subscriber pphaneuf [Link]

I was wondering just the same...

Vectorized Read?

Posted Mar 22, 2007 15:02 UTC (Thu) by subscriber superstoned [Link]

Wouldn't this increase latency? And doesn't the kernel already do this, in a limited way?

Vectorized Read?

Posted Mar 22, 2007 18:35 UTC (Thu) by subscriber vmole [Link]

Because fsck needs block addressed access, and aio_read() is based on (fd, offset, count). Otherwise, you don't even need AIO; you'd just use readv(2). OTOH, it seems like readv() would be sufficient given the RAW_IO to the partition...so maybe someone else should answer this :-)

Vectorized Read?

Posted Mar 23, 2007 4:14 UTC (Fri) by subscriber ldo [Link]

Because fsck needs block addressed access, and aio_read() is based on (fd, offset, count).

Hint:

fd = open("/dev/sda", ...)

Vectorized Read?

Posted Mar 23, 2007 16:55 UTC (Fri) by subscriber vmole [Link]

Yes, that's what I meant by "RAW_IO on the partition" :-) But, again, if that would work, then readv(2) would seem to provide the required vectorized API, so why are the smart people saying they need a new API?

Vectorized Read?

Posted Mar 23, 2007 20:05 UTC (Fri) by subscriber zlynx [Link]

readv and writev both use a vector of memory buffers, but they are not for writing a vector of disk blocks.

We probably need readiov/writeiov and readviov/writeviov or something like that.

I also had a crazy idea just now. What if they used device mapper to create a dm device with a linear view of every block fsck needed to read? Let readahead run on that.

How about readahead(2) or fadvise or posix_fadvise?

Vectorized Read?

Posted Mar 23, 2007 20:22 UTC (Fri) by subscriber vmole [Link]

Please explain further (I'm not doubting you, just trying to understand). If you open a block device (e.g. /dev/hda1), and you can read at specific offsets (which is what readv(2) gives you), isn't that being able to read specific blocks? I mean, I understand that it's not the underlying physical disk blocks, but isn't the mapping between physical disk blocks and the logical blocks that the filesystem sees all handled in hardware?

A very quick glance at the e2fsprogs source indeed seems to use open("/dev/hda1") and read/write.

Where's Mr. Tso when we need him?

Vectorized Read?

Posted Mar 23, 2007 21:28 UTC (Fri) by subscriber zlynx [Link]

I am going on the man page for readv.

ssize_t readv(int fd, const struct iovec *vector, int count);
struct iovec {
void *iov_base; /* Starting address */
size_t iov_len; /* Number of bytes */
};

iov_base and iov_len apply to RAM addresses, not disk block or character addresses. The descriptor "fd" is read linearly for "count" iovec structures.

So, readv does *not* give you the ability to read at specific offsets. As I understand it from reading the documentation.

Vectorized Read?

Posted Mar 23, 2007 21:36 UTC (Fri) by subscriber vmole [Link]

<p>Oh, duh, right. I knew that, at one time in the distant past. Sorry for the noise.

Vectorized Read?

Posted Mar 23, 2007 21:34 UTC (Fri) by subscriber giraffedata [Link]

... and you can read at specific offsets (which is what readv(2) gives you), isn't that being able to read specific blocks?

You need to be able to read not just specific offsets, but specific discontiguous offsets. readv() does not give you that. You have to read a contiguous area of the file (block device). You cannot make it read from 4K to 8K and 12K to 16K.

What readv() adds to read() is that you can read that contiguous file region into discontiguous memory, whereas with read() it has to go into a single contiguous range of memory addresses.

never understood why Linux community continues to ignore ZFS

Posted Mar 20, 2007 3:13 UTC (Tue) by subscriber qu1j0t3 [Link]

You can't ignore it forever... Here are just two reasons:
1) No fsck (always consistent).
2) (Unlike RAID or other "trusted" subsystems) never delivers bad data to applications (end to end checksummed).

Google for all the other reasons why we won't accept anything less than Solaris 10's ZFS, or something like it (actually there's really nothing else like it), soon. Well, trust Apple to take the lead: It will be in OS X 10.5.

never understood why Linux community continues to ignore ZFS

Posted Mar 20, 2007 3:21 UTC (Tue) by subscriber rfunk [Link]

You are aware that ZFS is not licensed under the GPL, and will likely
never be licensed compatibly with the GPLv2 (required for Linux kernel
inclusion), right?

Gee that's tough.

Posted Mar 20, 2007 3:43 UTC (Tue) by subscriber qu1j0t3 [Link]

I guess those who want next-generation capabilities will be running Solaris for a while yet. No great hardship.

But they're also ignoring the ideas of ZFS (those who forget...condemned to repeat...etc). Out of spite?

I see a nice Wikipedia page has sprung up.

Gee that's tough.

Posted Mar 20, 2007 4:21 UTC (Tue) by subscriber bronson [Link]

Are you trolling? I don't understand what you're trying to say.

Gee that's tough.

Posted Mar 20, 2007 4:47 UTC (Tue) by subscriber drag [Link]

Most of the features of ZFS are aviable on Linux right now.

LVM, MD, DM, and lots of other FS-related features can be combined in different ways that will accomplish do the vast majority of what ZFS does. All of it's been around for a long time and is proven. It's just in many smaller components instead of one big monolythic package.

The main thing that ZFS brings to the table is that it's simplier to administrate. There are a few fancier features that Linux doesn't have, like 128bit-ness, but there are a lot of things that Linux can do that ZFS can't.

Essentially if Linux developers decided to adopt ZFS they'd be replicating existing functionality.

Plus ZFS is copyrighted to Sun and is not aviable for Linux inclusion due to licensing differences.

Also I expect that Sun has patents on various aspects of ZFS, so unless Linux is able to incorporate code from Sun into the kernel then it's likely that Linux developers would violate obvious patents if they tried to re-impliment it.

Just my perspective on the whole 'zfs' issue.

ZFS

Posted Mar 20, 2007 8:06 UTC (Tue) by subscriber job [Link]

I don't think that is the main thing ZFS brings. Of course, for Solaris admins, the main thing is that ZFS brings logical volume-type functionality which was badly missing from vanilla Solaris before.

But to me from a Linux background, it brings 1) much better snapshot capabilities (more of them with better performance) due to it being copy-on-write at the file system level, 2) checksummed data integrity (I haven't seen this elsewhere but I suspect this is great for running low end storage systems such as software RAID) and 3) better performing software raid-5 ("raid-z") because of its knowledge how the files are laid out.

I like ZFS. I haven't used it in production (and probably won't, at my current work), but it is a solid piece of engineering. It seems a bit bastardized at first as it is both a file system and a volume manager, but I think it is well motivated.

Gee that's tough.

Posted Mar 20, 2007 15:18 UTC (Tue) by guest cajal [Link]

"LVM, MD, DM, and lots of other FS-related features can be combined in different ways that will accomplish do the vast majority of what ZFS does."

No, they can't. They don't give you self-repairing storage with provable data correctness. They don't give you dynamic striping. The LVM still requires you to manually carve up PVs into LVs. ext3 doesn't support arbitrary number of extended attributes. I could go on.

Further, just because the GPL is not compatible with Sun's implementation of ZFS, is no reason that the Linux kernel community couldn't reimplement ZFS. The on-disk format and man pages are available at http://opensolaris.org/os/community/zfs/docs/ I think this could be a valuable piece of software for the Linux kernel.

Gee that's tough.

Posted Mar 21, 2007 13:40 UTC (Wed) by subscriber mennucc1 [Link]

> Most of the features of ZFS are aviable on Linux right now.

but still I would like to have snapshots in EXT (maybe in 5 :-) ?)

Gee that's tough.

Posted Mar 21, 2007 16:06 UTC (Wed) by subscriber bronson [Link]

For snapshotting, just run whatever filesystem you want on top of LVM. LVM is a little hard to get used to at first but it's definitely worth the effort. Here are some notes I took when setting it up on my systems: http://wiki.u32.net/LVM

I now put all nontrivial partitions on LVM. Works for me. I'll let others argue whether LVM snapshots are worse than ZFS, or if ZFS is a layering violation. :)

Gee that's tough.

Posted Mar 22, 2007 5:55 UTC (Thu) by subscriber snitm [Link]

LVM2 Snapshots are quite bad. For starters they are done at the block-level whereas ZFS provides file-level snapshots (aka redirect on write). LVM2 snapshots don't scale well either; seeing as each snapshot imposes a copy out penalty because there isn't a shared exception store (aka LVM snapshot LV) for all snapshots of an origin LV.

Gee that's tough.

Posted Mar 20, 2007 11:58 UTC (Tue) by subscriber nix [Link]

They're ignoring the ideas of ZFS. Right.

Of course this means that the Val Henson who's a Linux filesystems hacker must be an *entirely different person* from the vhenson@eng.sun.com who was closely involved with ZFS development? (Perhaps she has a secret twin.)

Gee that's tough.

Posted Mar 20, 2007 18:10 UTC (Tue) by subscriber zooko [Link]

The fact that Val Henson worked on ZFS doesn't prove that the Linux filesystem developers are learning lessons from ZFS. It doesn't even prove that Henson has learned lessons from ZFS. Indeed, since Henson is the author of the "Compare-By-Hash" paper critiquing compare-by-hash on intuitively appealing but incorrect grounds, perhaps she dislikes the ZFS design, which features compare-by-hash among many other ideas.

Note: whenever anyone cites that regrettable compare-by-hash paper, they really ought to cite Graydon Hoare's follow-up (disclaimer: I helped Graydon a bit on writing that page), John Black's follow-up, and Henson and Henderson's much better self-followup.

Oh, I see that Jeff Bonwick is given thanks in that last paper. This is evidence that Henson has benefitted from the lessons of ZFS.

never understood why Linux community continues to ignore ZFS

Posted Mar 20, 2007 6:33 UTC (Tue) by subscriber dmarti [Link]

Try it with FUSE?

fsck

Posted Mar 20, 2007 6:03 UTC (Tue) by subscriber Ross [Link]

It's nice to have fsck even if you don't use it automatically because not all corruption happens in ways that can be repaired by journal replay or unrolling a transaction. Checksums are a great idea, and so is using RAID, but the ability to scan the entire disk for metadata inconsistencies remains valuable despite them.

fsck

Posted Mar 20, 2007 10:46 UTC (Tue) by subscriber etienne_lorrain@yahoo.fr [Link]

> not all corruption happens in ways that can be repaired by journal replay or unrolling a transaction

I've got a problem few days ago and have not been impressed by journal replay on ext3.
I think I remember the FS would not unmount at the end of the shutdown, but no other illness sign during the few hours Linux session.
No funny setup: no LVM, no RAID, simple partition table, no SMP, ia32... HD SMART good even after crash, no hardware change for a long time.
Result in the root directory + main E3FS descriptor loss and approximately 95 % of lost directories (with all the files inside them) after fsck, most directory inode have lost their name - for sure no way to get /var/log/messages or anything in /boot.
It was a test system with a linux 2.6.21-rc and no real important file on this partition, but sometimes I wonder if I shall not only use the simple ext2fs - when I had crashes (every 3-4 years) it has never been so extensive - maybe less chance to have a wrong journal?

Just my £0.02,
Etienne.

fsck

Posted Mar 20, 2007 11:45 UTC (Tue) by subscriber drag [Link]

Ouch I had a similar thing happen with a /home directory and XFS with a power outage (well my cat falling down and yanking out wires from behind my system).

Lost my ~/ directory. About 30 gigs worth of thousands of files and directories. All turned into numbers. No way to find out what file was what unless I tried to open them up individually. Was not a fun experiance.

fsck

Posted Mar 20, 2007 12:13 UTC (Tue) by subscriber nix [Link]

So it was a cat-astrophe?

(sorry)

More seriously, if fsck fails or acts in unuseful ways like that, and you can get an image of the disk pre-fsck, tytso might be interested in it...

fsck

Posted Mar 20, 2007 12:53 UTC (Tue) by subscriber etienne_lorrain@yahoo.fr [Link]

An image of the partition before applying the EXT3 journal would certainly be usefull, so quite a few full DVDs; but you always think: well, I've lost two or three files, just begin the recovery, will not be too bad... well it seems a bit more, give the "always answer yes" option to fsck... well it begin to feel bad... well there was nothing that important on the filesystem... and too late to do it right anyways.
That is at those times that you like the separate partition for valuable files, with simple partition schemes and no optimised filesystem (choice in between FAT and ext2fs).
The thing I should have done is to run an e3fsck after I had few crashs with the floppy driver on this PC (a bug already solved leaving interrupt disabled so power-off in X), but it was 6 to 10 sessions (i.e. few hours of work followed by shutdown/power off) before so it should have been handled by the EXT3 journal recovery.
I still should have forced the check from another distribution to be sure - or the FC6 live CD... too late.

Etienne.

fsck

Posted Mar 20, 2007 14:55 UTC (Tue) by subscriber nix [Link]

If you give e2fsck the -y option, IMHO you deserve everything you get. There's a *reason* it's not the default. (And, yes, when I'm lucky enough to have enough accessible storage and a major filesystem does south without recent backups, I *do* gzip up the filesystem image and back it up before doing a fsck. Perhaps I could use e2image but I've never dared risk it.)

fsck

Posted Mar 20, 2007 16:02 UTC (Tue) by subscriber bronson [Link]

Unless you're an ext3 filesystem engineer, how are you supposed to know what fixes to make?? If fsck asks my mom, "1377 unreferenced nodes, delete? (Y)" (whatever a typical error looks like; it's been a while), what is she supposed to do?

The only two modes that the average Linux user can run fsck in:
- All Y, which you say is a bad idea.
- All N, in which case there's no point.

Maybe fsck could offer an "all trivial" setting, where it would automatically make fixes that it thinks are unlikely to cause data loss. If a bigger problem is found, fsck could bail out saying, "Serious errors found, back up partition before repairing!"

This is the same problem as Windows users splatting "Yes" every time their OS asks, "Do you want to allow a connection to sdlkh.phishing.org" except that the fsck questions are even less understandable!

fsck

Posted Mar 20, 2007 17:02 UTC (Tue) by subscriber nix [Link]

Your points have merit: it's hard to work out which changes are safe. That's why I tend to e2fsck first with -n, and review the list to see if there are a lot of changes or they look intuitively frightening. If they do, it's image-first time, so I can retry with more n answers if fsck goes wrong.

(The `all trivial' option already exists: it's what you get if you run e2fsck with -a. If e2fsck says you must run fsck manually, that means it has nontrivial fixes to make. In my experience this is rare indeed, even in the presence of repeated power failures or qemu kills while doing rm -r.)

This is yet one more place where a CoW device-mapper layer would be useful: instead of doing huge copies, you could just do the e2fsck in a CoWed temporary image (maybe mounted on a loopback filesystem somewhere :) ) and see if that worked...

fsck / xfs

Posted Mar 20, 2007 13:24 UTC (Tue) by subscriber rfunk [Link]

When SGI first introduced XFS for Irix, they talked about "no need for fsck, ever", just as
the ZFS proponent above does. Later they were forced to admit that there are times
when filesystems need to be checked or repaired, and they introduced tools for doing so.

Unfortunately, those tools probably aren't as useful when things go wrong as they would
be if the need for them had been anticipated from the start. Like you, I've discovered the
hard way that XFS is very bad for situations when the disk might accidentally lose power
or get disconnected unexpectedly. SGI apparently designed it for a stable server-room
situation.

I've never had a problem with ext3, and am slowly migrating my XFS filesystems to ext3.

fsck / xfs

Posted Mar 22, 2007 11:47 UTC (Thu) by subscriber wookey [Link]

I too have found the hard way that yanking the power on XFS (or just hitting reset at a bad time) is a very bad idea. All the files that had pending writes just end up as the correct length of zeros. When this is includes your package database, perl binaries and a load of other libs, this is quite bad.

The xfs_repair tool did do a pretty-good repair job (once I fixed it so it ran! http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=414079) but it did take about 5 hours to do it on a pair of 200GB mirrored drives. Then I got to re-install everything to fix the damage.

About 3 days faff in total. Fair dues though - there was no user-data loss and the system was recoverable, but I've never had this trouble with reiser3 on my laptop or ext3 on other boxes. So, yes, XFS is a really nice filesystem (live resizing, nice and fast) but I'd avoid it unless there is a UPS around.

The 2007 Linux Storage and File Systems Workshop

Posted Mar 20, 2007 8:10 UTC (Tue) by subscriber job [Link]

Nothing new on tail packing? It seems to me the world is slowly but steady migrating off ReiserFS 3 and I have nothing to replace it with. Directory indexing for ext3 was a big step forward on large spool directories with many small files, but there is still a lot of space wasted plus I have this suspicion the block layer cache works better with tail packing.

The 2007 Linux Storage and File Systems Workshop

Posted Mar 20, 2007 11:22 UTC (Tue) by subscriber nix [Link]

s/better/worse/

Tail packing gives you all the complexity and potential bugs of blocksizes != PAGE_SIZE (namely, that multiple files in a single page read off disk should end up in different pages in the page cache) combined with offset horrors and repacking nightmares (how do you handle expansion of a file currently located at the start of a tail-packed group? You need to unpack on the fly and so on. Unless you're careful you may end up with file *shrinkage* requiring new data blocks...)

Tail packing strikes me as being like softupdates: a really nice idea but *so* hard to get right that it's probably not worth it. (I remain astonished that anyone managed to implement either at all.)

The 2007 Linux Storage and File Systems Workshop

Posted Mar 20, 2007 19:35 UTC (Tue) by subscriber job [Link]

I see. I kind of assumed the block cache was aware of the tails so to avoid caching the same data twice. I figured that must be quite simple compared to freeing/moving all those sub-blocks, which obviously is possible since Reiser can do it (and fast, too).

The 2007 Linux Storage and File Systems Workshop

Posted Mar 22, 2007 11:53 UTC (Thu) by subscriber nix [Link]

The page cache is aware of blocksizes differing from PAGE_SIZE, which provides a lot of what's needed, but that code is complex and delicate, and extending it to allow pieces of multiple files to co-exist in a single page-cache page is quite unlikely to be done (the memory savings are, after all, marginal, at half a page per file, and the complexity increase is significant).

The 2007 Linux Storage and File Systems Workshop

Posted Mar 20, 2007 9:16 UTC (Tue) by subscriber hein.zelle [Link]

NFS developments>

What's the "current" solution for shared filesystems, for linux/linux and linux/windows environments? At work we're looking for a solution that allows authenticated shares, with working access control, preferably (but not required) usable for windows machines as well.

Our current solution (nfs for linux/linux and samba for linux/windows) gives all kinds of problems, ranging from no proper authentication for nfs shares, and non-functional access control when trying to mount samba shares from windows machines.

The 2007 Linux Storage and File Systems Workshop

Posted Mar 20, 2007 17:16 UTC (Tue) by subscriber bronson [Link]

There's no good solution today. The Samba team, however, is working very hard to make CIFS the best Unix<->Unix and Unix<->Windows network filesystem. We'll see when Samba 4 ships... So far, it's looking very promising.

The 2007 Linux Storage and File Systems Workshop

Posted Mar 20, 2007 20:03 UTC (Tue) by subscriber k8to [Link]

The piece I haven't seen so far is "here's how to configure samba to remove all the crap you don't care about if you haven't got windows". Granted, the Samba 4 hasn't shipped, but a document or setting along those lines would be useful *now*. Does it exist?

When I went looking I found a whole lot of complexity. I was really *not* interested in learning all the details of SMB and windows networking, I just wanted my Linux machines to share files.

The 2007 Linux Storage and File Systems Workshop

Posted Mar 20, 2007 23:25 UTC (Tue) by subscriber drag [Link]

I doubt it's usefull for large numbers of users, but for my personal stuff the fuse-based Sshfs is a superior replacement to NFS or Samba.

Faster, strong encryption, strong authentication aviable, trivially easy to setup. Robust.

The downside is that you can't use it for anything that requires special file typs, like named pipes. So ~/ is out. No booting from it. But for serving up large media files or sharing abritrary directories between computers it's great.

The 2007 Linux Storage and File Systems Workshop

Posted Mar 21, 2007 12:25 UTC (Wed) by subscriber nix [Link]

Of course, you can't usefully store named pipes or devices on NFS-shared filesystems, either...

The 2007 Linux Storage and File Systems Workshop

Posted Mar 21, 2007 15:53 UTC (Wed) by subscriber k8to [Link]

You've mentioned this before in response to my discussing of issuses encountered trying to use NFS for similar purposes.

My response is still that sshfs is a great thing, and pretty useful for trivial tasks, or remote manipulation of low-bandiwidth, high-latency small set sof files. It's much easier to edit some remote configuration thing with a local tool via sshfs than most anything else.

But sshfs still can't handle a variety of normal file sharing activities reasonably. It fails entirely on mmap and large files make it choke because it hasn't got sufficient cache sophistication. Over a LAN you'll never get 10% of your throughput while maxing your CPUs on the ciphers. If the ssh link actually goes down (this happens), the whole thing gets very unhappy and it is impossible to recover.

Basically the only thing that makes sshfs "better" than the traditional lousy network filesystems we love to hate is that it has a well defined focus. It's a no-server-configuration filesystem for accessing small numbers of smallish files without high performance expectations. It is a remarkably pleasant tool when used inside its scope, but one of the reasons it is pleasant is it has a much narrower scope.

The 2007 Linux Storage and File Systems Workshop

Posted Mar 21, 2007 0:02 UTC (Wed) by subscriber saffroy [Link]

"There's no good solution today."

Of course there is one, it's called OpenAFS. Except installing and managing an AFS cell is not quite easy (although the Debian packager provides useful scripts to quickly start a simple server)...

But AFS has its share of good features:
- good Linux and Windows clients (and other *nixes as well)
- Kerberos authentication
- powerful ACLs
- excellent scalability: large sites (MIT, CMU...) run cells with thousands of clients (and several servers sharing the load of course)
- client-side caching using local storage
- the namespace is global, users can use the same paths on all clients
- data is organized in volumes (subtrees) that can be mounted anywhere in the global namespace, both by admins and users
- volumes can be snapshotted, and users can mount snapshots (no need for the admin to restore from tapes when user accidentally deletes a file and notices immediately)

...and more. :)

I've used OpenAFS on Linux for years, and never complained (it helps that I have a competent and knowledgeable admin).

The 2007 Linux Storage and File Systems Workshop

Posted Mar 21, 2007 2:54 UTC (Wed) by subscriber drag [Link]

OpenAFS is, indeed, very nice.

But it's Windows support is crap. Not because OpenAFS is not cool, but because Window uses SMB or Microsoft's DFS to do it's thing and nobody else's. OpenAFS has to use a sort of SMB emulation were it deals with AFS stuff then translates that to something that the system can use.

But if your just dealing with Linux clients then that's not a problem.

Also the file and directory permission model is bizzare and isn't realy compatable with just standard Unix-style ACL (user/group/world read/write/execute) model. So people used to Linux permissions have to relearn how to deal with AFS permissions.

It's not posix, and it's not compatable with special file types like named pipes.

Also there is no real way to access your data unless your AFS server stuff is actually running. OpenAFS tends to incure a higher amount of knowledge and administration stuff isn't very easy to deal with.

Then it's large file performance is realy bad. It's just plain slow and the volumes are very limited in size.

What it's VERY good for is if you have a large distributed network.

Say you have a wireless network or a WAN-wide thing were you have a entire campus of computers to take care off. It handles disconnection very well, it's caching support is very good for semi-offline work (ie you can still edit a file even if you temporarially lost contact with the servers.

It's security stuff is nice. The volume management is very nice, snapshotting and mirroring stuff. It's safe to use over the internet and unencrypted wireless networks.

And as a special bonus it's /afs/ directory tree is very handy. It allows people to move volumes around, change servers, setup mirrors, and all sorts of stuff without having to have the clients know of any of these changes. Were as with NFS or SAMBA if you change out file servers or whatnot then the clients all have to be reconfigured to know the new locations and names of the servers and directories. With OpenAFS this is not nessicary.

But considuring the lack of posix support and poor large file performance as well as permission issues it's not realy a replacement for NFS. It's a alternative that is usefull in places were NFS is not.

And it's poor Windows support means that it's not usefull as a replacement for Samba.

But it's nice.

The OpenAFS points out a huge problem for Linux in general though. AFS is ancient. It's old old old. It's like X Windows/Athena/Kerberos ancient. Still, even with it's age, it's still MUCH more sophisticated then NFS or SMB network protocols. Nobody has realy produced anything better.

Lustre, maybe. It certainly has a lot of cool features and is fast. But I don't think that it has any security.

Supports lots of stuff. TCP networking, ininaband, and all sorts of other bizzare interconnects.

Supports ACLs, extended ACLs, extended attributes. Lots of high aviability and high performance features. Failover, extra redudancy. You can use it as root FS. It supports Quotas.

It doesn't require special patches and kernel recompiles for Linux client support.

The only thing that it lacks is robust security. They plan on supporting GSSAPI and Kerberos with the 1.8.0 release. This is due out by the end of this year according to their roadmap...
For Unix and Windows comaptability it supports SMB and NFS v2/v3/v4 export.

The 2007 Linux Storage and File Systems Workshop

Posted Mar 21, 2007 13:47 UTC (Wed) by subscriber saffroy [Link]

It's nice that you mention Lustre, actually I was kind of surprised that it would not be mentioned in this article. Lustre definitely has a great potential (great scalability, sequential I/O performance, client cache, excellent POSIX conformance), and I feel it could be a good general purpose global fs someday.

That is, if its creators (CFS) let it grow out of its niche HPC market: at the moment, I feel they're more concerned about implementing the features asked by their paying customers, which are big supercomputing centers. I'm certainly not blamining them for that, but for instance, they are more sensitive to large file throughput (tens of GB/s) than to file creation rates (Lustre is still damn slow here).

If the community or the customers push in the right direction, Lustre can become an excellent distributed fs for nearly everyone, but I feel it has yet to happen -- and I hope it will.

Oh, and don't take CFS roadmaps too seriously. ;-)

The 2007 Linux Storage and File Systems Workshop

Posted Mar 22, 2007 2:01 UTC (Thu) by subscriber drag [Link]

I don't take them too seriously. :)

But they aren't to far off. One thing worth noting is that Ext4 integration isn't mentioned anywere on them, but it's obvious that Ext4 is going to play a large part in it.

One thing about CFS, which I think is important to keep in mind, is that they are decendents of the failed Coda and then the Intermezzo projects. I don't know the exact relationships, but I think that they were developers in those projects.

The thing is is that they learned the hard way that distributed network file system protocols aren't a easy thing to make, even if you are good at it. It takes a lot of time and effort to get anything going and a long time of development to get to the point were you can actually release anything.

So it's not something that lends itself to the Linux-style development proccess of 'release early', 'release often'.

So they formed CFS to pursue the money nessicary to support themselves while they hacked on Lustre full time. The HPC market is the easiest and most profitable place to target for this sort of stuff, and they know that from Beowolf stuff that open source and distributed computing can lead to dramatic results.

The 2007 Linux Storage and File Systems Workshop

Posted Mar 22, 2007 11:56 UTC (Thu) by subscriber nix [Link]

Of course Coda itself was an enhancement of AFS (losing most of its scalability in the process, though)...

The 2007 Linux Storage and File Systems Workshop

Posted Mar 22, 2007 17:08 UTC (Thu) by subscriber jwb [Link]

What do you mean by Ext4 integration? Lustre already includes all Ext4 features and then some. In fact you might say that Ext4 is just rolling features into the mainline kernel that have long been used in Lustre. mballoc, delalloc, and extents have all been in Lustre for years.

The 2007 Linux Storage and File Systems Workshop

Posted Mar 22, 2007 10:20 UTC (Thu) by subscriber wingo [Link]

This would be a nice topic to revisit in an article -- networked filesystems. Samba for unix<->unix is something I'd like to know more about.

What ext4 really needs is ...

Posted Mar 20, 2007 9:44 UTC (Tue) by subscriber sdalley [Link]

If I had to vote for one feature in ext4 that isn't there already, it would be file-level checksumming/correction, similar to that described in http://www.cs.wisc.edu/adsl/Publications/iron-sosp05.pdf . It's disconcerting not to be able to be sure that the data you got back from a file might not be the data you put in because of a PSU glitch or a high-flying write. Nice to have super performance, compact packing, squillion-byte limits etc etc, but reliability trumps these every time.

What ext4 really needs is ...

Posted Mar 22, 2007 15:16 UTC (Thu) by subscriber superstoned [Link]

A /me too from here. I hate file corruption, esp if it goes unnoticed... Having some redundancy in the filesystem (reiser4 promised to bring that, zfs as well) would be great as well, to allow users to recover bad files.

The 2007 Linux Storage and File Systems Workshop

Posted Mar 20, 2007 11:27 UTC (Tue) by subscriber nix [Link]

Of course (as many people have said) the only right way to do fs stacking is to consider it a *stacked namespace* and do it in the VFS. If *that*'s done then all the problems with modiciation of unioned subtrees suddenly vanish. (And until it's done I can't think of all that many uses for unioned filesystems at all: they'll stay stuck in a niche, mostly making read-only media appear writable: an important job, but nowhere near as useful as *real* unioning could be.)

Of course given how long it's taking for subtree sharing support to appear in the userspace tools in any useful way, the mere presence of unioning in the VFS doesn't necessarily mean that it'll be usable...

The 2007 Linux Storage and File Systems Workshop

Posted Mar 20, 2007 11:33 UTC (Tue) by subscriber nix [Link]

A performance hot button for OCFS2 are the large inodes and occupy an entire block.

This needs rephrasing: it doesn't make sense as written. Should that `and' be a `, which'?

Persistant allocation without zeroing :== bug

Posted Mar 20, 2007 15:51 UTC (Tue) by subscriber davecb [Link]

GCOS used to have that facility, which allocated disk blocks and, under certain circumstances, failed to zero them.

Sure enough, one day I got some blocks containing the decrypted contents of a file my director really, really didn't want other people to read.

It was a bug: it got fixed. Fairly quickly, too!

--dave

Persistant allocation without zeroing :== bug

Posted Mar 20, 2007 22:28 UTC (Tue) by subscriber k8to [Link]

Giving them the benefit of the doubt, perhaps they mean that the space is allocated, but the zeroing is done lazily. Ie. hand zeros to reading apps reading uninitalized space, or perhaps errors, or whatever. Ie. allocation doesn't necessarily mean you get to see what's there. Yes allowing apps to read uninitialized reclaimed disk space leads to data leaks.

Persistant allocation without zeroing :== bug

Posted Mar 20, 2007 23:46 UTC (Tue) by subscriber saffroy [Link]

The most common use I see for this feature is to have (more) predictable performance, and no allocation error (ENOSPC) when *writing* to a file (eg. for realtime apps that capture data). If you plan to capture large amounts of data, zeroing blocks is not very convenient...

Persistant allocation without zeroing :== bug

Posted Mar 21, 2007 17:19 UTC (Wed) by subscriber davecb [Link]

Oh quite: that's what the GCOS programs was
intended to do. It just didn't succeed.

--dave

Persistant allocation without zeroing :== bug

Posted Mar 21, 2007 17:40 UTC (Wed) by subscriber davecb [Link]

To be fair, I should comment that preallocation of disk blocks
for write is a **good** idea, and one which the Samba team,
amoung others, would welcome.
The "impedance mismatch" between FAT and NTFS on one
hand and Unix file systems on the other has caused lots of
problems wen Unix servers return an out of space indication
in circumstances where Windows clients think an error can't
happen (;-))
Of course, if the preallocated space isn't written to, I'd hope
the blocks would be zeroed on close, and previous to the close,
would not be avilable to be read. The latter introduces an impedance
mismatch with Unix files opened fro both write and read...

--dave

The 2007 Linux Storage and File Systems Workshop

Posted Mar 22, 2007 18:48 UTC (Thu) by subscriber thedevil [Link]

>>Ted Ts'o suggested that someone should try to go through committee to get telldir/seekdir/readdir fixed or eliminated<<

So, if committee (POSIX I suppose) goes along with that, how are we going to scan directories?

The 2007 Linux Storage and File Systems Workshop

Posted Mar 23, 2007 16:26 UTC (Fri) by subscriber nix [Link]

I can't see how you can eliminate opendir()/readdir(), but telldir() and
the wierdness around it could be zapped quite easily.

Outside of scripting language interfaces, the Linux kernel itself, and
glibc, I see hardly any uses of telldir() on a modern Linux box (Ruby,
Perl &c scripts aren't likely to depend on the detailed semantics of
telldir() anyway because non-POSIX systems don't implement them). The
Midnight Commander VFS implements a telldir operation but never appears to
call it...

... in fact, on my system here, I have no actual *uses* of telldir() at
all. Even strfry() is called more often.

I'd say telldir()'s semantics could be changed pretty easily. Something
like it might remain potentially useful, but its (extremely annoying to
implement) current semantics don't seem to matter to most real code.

NVCache for journal

Posted Mar 23, 2007 3:56 UTC (Fri) by subscriber sweikart [Link]

> Also, a file system can use the NVCache to store its journal ...

This seems like a great approach. If the NVCache could appear as
a block device that could be partitioned, filesystems that use jbd
(like ext3) could use it right away (a smaller NVCache partition for
read-mostly filesystems like /usr, a larger NVCache partition for
write-mostly filesystems like /var).

I assume journals are treated as ring buffers for writing, which
is the right access pattern for prolonging the life of flash.

-scott

NVCache for journal

Posted Mar 23, 2007 7:41 UTC (Fri) by subscriber xanni [Link]

This looks like it could potentially be a really big win for DualFS in particular!

NVCache for journal

Posted Mar 25, 2007 1:48 UTC (Sun) by subscriber dlang [Link]

useing flash for a journal would have some headaches (potentially fixable)

first, every write on the system would have to go through the flash, so longest life access pattern or not, it may still have problems.

second, when the write is complete the system needs to go back to the hjournal (flash) and mark it as being completed

third, the chunks of data going into the journal are of many different sizes (yes, in the end it all gets down to writing fixed size disk blocks, but is that the most efficant method to record an atime update in a journal? probably not)

good drivers and layouts that are flash aware may be abel to address these, if the API gives them sufficiant control (for example, since flash can be chagned from 0 to 1 without eraseing, make sure the 'entry commited flag is 1 when complete and you can just flip that bit, in theory)

NVCache for journal

Posted Mar 25, 2007 2:20 UTC (Sun) by subscriber sweikart [Link]

> first, every write on the system would have to go through the
> flash, so longest life access pattern or not, it may still have
> problems.

Actually, for most of us journal users, only the metadata writes
go into the journal.

> ... is that the most efficient method to record an atime update
> in a journal?

You raise a good point. As a system administrator, I don't mount
with noatime, because atime is too useful. But, atime is not so
critical that I need it to be journaled; so, I'd love to mount
nojournaledatime.

-scott

The storage systems contract

Posted Mar 23, 2007 21:56 UTC (Fri) by subscriber giraffedata [Link]

Storage systems have a simple and important contract to keep: given user data they must save that data to disk without loss or corruption even in the face of system crashes.

s/to disk//

The fact that there's a disk in there, if there is one, is none of the other party's business. That's why clauses in that contract specifying what things such as "fsync" mean are so ambiguous.

The contract is actually quite complex in the area of what data is allowed to be lost in the face of a system crash.

DualFS

Posted Mar 23, 2007 22:04 UTC (Fri) by subscriber giraffedata [Link]

DualFS is a file system by Juan Piernas that separates data and meta data into separate file systems.

That's separate devices, not separate file systems. A single DualFS file system has both metadata and file data, and the innovation is that they are stored in such a way that access to one doesn't interfere with access to the other.

And it's worth noting that while ideally these devices would not share a head, the experiments reported are done with the devices being partitions of a single physical device and still show improvement.