Version 67 of VFS

Updated 2002-05-24 09:19:01

As of Tcl 8.4a3 (and in Tcl 8.4a4, 8.4a5 heavily debugged), Tcl's filesystem is completely 'virtual filesystem aware'. What this means is that in principle you can divert all desired filesystem activity away from the native operating system and to something else.

In practice what this means is that, given appropriate extensions, any ordinary Tcl code can use the standard file commands: cd, pwd, glob, file, load and operate on 'virtual files' without realising it. Such virtual files could be remote files (on ftp sites or over a http connection) or inside archives (zip files, for instance). The idea is that a script doesn't need to know what or where exactly a file actually exists (note: using the file system command, there is introspection into what kind of system a file is actually on, if that information is needed).

Tcl's core doesn't come with any useful such extensions (apart from the one which lets you use the native filesystem, as ever!). But, a number of such extensions do exist -- see below.

(Historical note: the VFS idea for Tcl was originally built in pure Tcl by Matt Newman for use in TclKit, and later re-introduced at the C level by Vince Darley, as described in TIP #17). The nice thing about it being at the C level is that all Tcl extensions (including Tk) are automatically vfs-aware (or, more precisely, completely unaware that they are operating on non-native files).

Known users of Tcl's FS API

Here are the known extensions which make use of this API:

native filesystem

Tcl's core hooks the native filesystem into these APIs to provide Tcl's standard file support. This is coded in TclIOUtil.c and the various platform specific files (tclWinFile.c tclWinFCmd.c, etc).

prowrap

There is a provisional patch for prowrap which attempts to make it use these APIs. [L1 ] Unfortunately, prowrap's 'build' system is so unfriendly that this patch has not been tested. With a little effort, this patch would allow prowrap to support 'glob' (as well as generally have a more robust virtual filesystem).

testreporting filesystem

This is part of Tcl's test code/suite. When registered it allows all filesystem activity to be reported upon (a brief message describing each filesystem access is printed to stderr by default).

tclvfs extension

This is at http://sourceforge.net/projects/tclvfs/ and tclvfs and should compile easily on win/unix/macos (there's just one file vfs.c to compile). This extension, which can than be accessed with package require vfs exposes the vfs API to Tcl scripts. This means you can then write a completely new filesystem using pure Tcl.

The tclvfs extension actually comes with a library of tcl code which implements a number of such filesystems. You can mount .zip archives, ftp sites, http sites, webdav remote disks, metakit archives (as used by tclkit), and even mount Tcl's proc namespaces as a filesystem!


If you would like to add/use new filesystem types, you can do that in one of two ways: use 'tclvfs' and write the new filesystem in Tcl; or write a new filesystem in C (or indeed you could write/modify something like tclvfs which exposes the vfs to tcl in a different way).


Examples

Take a look at the One-line web browser in Tcl.

See the tclvfs page for examples there.


Future Vfs Thoughts (i.e. not implemented in Tcl 8.4)!

asynchronicity

Asynchronous filesystem access would be useful, primarily for copying files, but also potentially other purposes:

    file copy -command progress http://a.b.c/foo .     

    set fd [open http://a.b.c/foo w]
    puts $fd -command progress $megabytes

    load -command progress http://a.b.c/tk84.dll

Note that both 'file rename', and 'load' may require cross-filesystem copies.

WebDAV

[future VFS thoughts: BXXP? WebDAV?]

WebDAV would be great! It should be possible to write a WebDAV implementation entirely in Tcl using the tclvfs extension. See tclvfs for a simple webdav implementation (needs work).

file locking

Many asynchronous protocols or operations would benefit from some kind of locking mechanism. It might be good to extend the interface in the future with this. (Note that any vfs can provide its own 'file attributes' both readable and writable so in principle any other features can be supported through attributes using the existing vfs code).

exec

This Tcl command effectively boils down to forking off a variety of processes and hooking their input/output/errors up appropriately. Most of this code is quite generic, and ends up in 'TclpCreateProcess' for the actual forking and execution of another process (whose name is given by 'argv[0]' in TclpCreateProcess). Would it be possible to make a Tcl_FSCreateProcess which can pass the command on either to the native filesystem or to virtual filesystems? The simpler answer is "yes", given that we can simply examine 'argv[0]' and see if it is it is a path in a virtual filesystem and then hand it off appropriately, but could a vfs actually implement anything sensible? The kind of thing I'm thinking of is this: we mount an ftp site and would then like to execute various ftp commands directly. Now, we could use 'ftp::Quote' (from the ftp package) to send commands directly, but why not 'exec' them? If my ftp site is mounted at /tcl/ftppub, why couldn't "exec /tcl/ftppub FOO arg1 arg2" attempt a verbatim "FOO arg1 arg2" command on the ftp connection? (Or would perhaps "exec /tcl/ftppub/FOO arg1 arg2" be the command?). Similarly a Tcl 'namespace' filesystem could use 'exec' to evaluate code in the relevant namespace (of course you could just use 'namespace eval' directly, but then you couldn't hook the code up to input/output pipes).

If anyone wants to have a look at a patch which does most of this on Windows: ftp://ftp.ucsd.edu/pub/alpha/tcl/chanExec.patch should do the trick. 4 tests fail, however, so any help fixing that would be great. It should be pretty easy to move the patch over to Unix as well.

See VFS, exec and command pipelines


JCW VFS in the core offers an interesting option: taking all file system calls out of the core, and bringing them back in as an extension. This is one of the areas where I see a quite unique opportunity: with FS calls out (and using a wrapped exe such as TclKit to provide just access to runtime files stored at the end of the executable), one has a system which is provably incapable of accessing (let alone modifying) the local disk. Such a setup, with sockets still included, would offer a formidable alternative to client-side Java deployment, with all the power of scripting and Tk at its disposal. There are a few extra steps to take (such as disabling exec, and load, and piped open)*, but that's about it. The 8.4a4 core has all the logic in place to start modularizing file system calls. TclKit already fully relies on memory-mapped access to the executable for all runtime files, so we're pretty close to such a setup, IMO...

(*) There's no need to disable 'load' since it will be disabled anyway if a native filesystem isn't present. Both 'exec' and piped open end up at TclpCreateProcess (see 'exec' discussion above). This and related procedures (in tclPipe.c and tclWin|UnixPipe.c) in turn call 'TclpOpenFile, TclpCloseFile, TclpMakeFile, etc'. The question is whether we can suitably abstract all of this away into the filesystem table. The easiest would perhaps be if the generic code tclPipe.c used channels everywhere instead of TclFile file descriptors. Then there would only need to be a single filesystem-specific call 'TclpCreateProcess' which could go in the lookup table. This would then allow us (very easily) to create a version of Tcl which is incapable of accessing the local disk, since all native filesystem support could actually be in a separate library.

Does anyone understand enough about the unix/win pipeline code (remember there's no 'exec' on macos) to know whether a shift from TclFile to Channels would be a problem? See above, for a patch which does most of this: ftp://ftp.ucsd.edu/pub/alpha/tcl/chanExec.patch

Such a rewritten exec core could allow diversion of pipes to or from ordinary Tcl code as well (as in Streams). This would make a 'stream' a real, robust, part of Tcl.


Need: an internal VFS

KBK (10 April 2002): One critical piece that I think the Tcl Core needs is an internal VFS: one that simply packages files into the executable. By now, there are a number of files that must be installed at some well-known place in the filesystem (designated by $::tcl_library) and available at run time, complicating deployment of Tcl-based applications. These include:

  • The standard scripts, init.tcl, auto.tcl, package.tcl, etc., without which the Tcl interpreter itself doesn't work.
  • The standard packages, http, msgcat, and tcltest.
  • The encoding files, without which the Tcl system finds itself unable to read files in any encoding except UTF-8.

Soon to come will no doubt be:

  • Message catalogs for localization of error messages, formats of numbers, times, and currencies, names of languages and countries, and so on; these are fundamental to internationalized systems.
  • Time zone information files. Tcl has been too long at the mercy of the implementation of strftime on the local system; there are a number of bugs in time zone management that simply are not fixable without local definition of time zone names, offsets, and DST rules. I've been contemplating some fairly bizarre hacks on this issue to avoid confronting the question of how to distribute the roughly five hundred files of a few hundred bytes each that any reasonably complete time-zone library would need.
  • Additional bundled packages as we move to larger "batteries included" distributions.

In addition, a universally-available internal VFS is a common component for scripted document support, and for single-file packaging tools such as freewrap and the prowrap component of TclPro. It's an answer to a commonly asked question: How can I compile Tcl type scripts into binary code?

There are several open issues with actually getting an internal VFS into place.

  • Fabricating the VFS. There is an open question about how to get the VFS content installed in the executable. Executable file formats vary, and compiler-based solutions are problematic (many compilers, for instance, have inconveniently small fixed limits on the size of compiled-in data. For this reason, the internal VFS probably must be optional, with the Tcl system capable of running out of a native filesystem.
  • Choice of 'mount point' name. The VFS must have a name that cannot collide with any path name in the native filesystem. While awkward, this issue can probably be managed, for instance by naming its mount point to be the same as the result of [info nameofexecutable]. It's still something we have to settle on before we pass a formal TIP on the subject.
  • Choice of format. ZIP/JAR format is probably the best choice, since there is a wealth of code (for instance, the ZIP filesystem of Tobe [L2 ]) available for the purpose, and the format is familiar. There are nevertheless several competitors, for instance Jean-Claude Wippler's successful VFS based on MetaKit. Once again, a decision will have to be made.

I don't see any of these problems as insurmountable, and I think this is a direction we should be going. I welcome the comments of others!


JCW - Cool, very cool, Kevin. Let me take you up on that offer:

First, as you may have noticed, I've repositioned your remarks - they were placed in a weird spot (in the middle of a sentence?). Perhaps your browser played tricks on you?

Your comments were most enlightening to me. It's good to make the VFS concept a central one - even when ultimately it means there are no issues, since it's all the same regardless of what handler is used.

But from another angle, the terminology "internal" VFS reflects what I think might be inappropriate perspective: "Tcl is the center of the universe, let's extend it and internalize more". I can understand this approach - given the history and how it must have been a atruggle to make Tcl become more of foundation for application development over the years, as opposed to JO's early positioning as an add-on C library.

I consider persistence to be a more effective center of the universe, if there needs to be one at all. Or in different words: that what happens when a system stops doing things and you return to it later. Code and programs (and interpreters, at a different level), act as "operators" on data. Code evolves and gets replaced - data is a passive component (conversion happens, but is always painful). Data is the fixed point, code comes and goes.

So from where I stand, I'm far more inclined to say that there is a data model, such as a hierarchical file system (this itself may one day need to be revisited, btw), and that some of its objects happen to be runnable and able to make things come to life. Tcl is not the heart, but a (major) part of it, "/" is the center of things.

That perspective highlights an option I've been mulling over for some time: taking all native file-system access out of the Tcl core, and bringing it back as an extension using VFS. So "/" would start out unattached to any real-world data, and tcl then starts mounting things like the local file system (or part of it), and the VFS it needs to access the standard runtime files you described above. The value of this should not be under-estimated: a scripting infrastructure which can be stripped down to not contain a single local filesystem call can be of great importance for some deployment scenario's. Think for example of a network-client which runs locally but cannot touch the local disk, not even if hacked. One could still access the local VFS btw, by using a read-only memory-mapped file (as MetaKit does).

There's a bootstrap problem when "/" is the center of everything, just as there is for a CPU trying to figure out how to boot a "file" off a hard disk before it understands the file system. Addressing that in a generic manner would help all VFS implementations greatly.

The choice of ZIP vs. MetaKit has a number of dimensions. Right now, ZIP cannot be written from VFS (though a trivial bit of Tcl, such as [L3 ] could fix that). More important to me, is the transaction model which MetaKit supports, meaning that changes are safe - even if the process is interrupted. That too can be had in other ways, transactions are no rocket science - I just never bothered because I already had a solution. Convenience and acceptance come to mind, which was no doubt a key reason to pick ZIP in Java as well. Encryption will become an issue. Performance is an issue as well, an area in which I consider MK superior to ZIP because its file index (zip would call it the "catalog") can be traversed on demand.

What this tells me is that the choice of "vfs handler" should be kept open. There is value in having more than one storage implementation, and people will gain from having a trade-off option to pick one over the other. Or to come up with yet better designs, for that matter. One could technically build a tclkit (which is MK centric) to use ZIP archives as its "internal VFS", but the generality of being able to treat the tclkit executable itself as simply another scripted document, is starting to simplify things considerably. To illustrate: the latest SDX utility [L4 ] makes the construction of single-file executables like ProWrap and FreeWrap trivial - so that all deployment scenarios from 1 to N files are treated uniformly. The one restriction that keeps popping up is that executables can only run in read-only mode - it makes sense, but it means 1-file deployment can never use a r/w VFS setup. This OS-imposed limitation means ZIP and MK become equivalent for 1-file use. It's the 2-or-more-files deployment scenarios which benefit from MK, which is what scripted documents are all about.

Note that file format issues are just about to become irrelevant in normal use - as one can "VFS-mount" any of them and then simply do a (recursive) "file copy" in Tcl. If zlib compress/decompress is added to the Tcl core, one can write VFS readers for both ZIP and MK SD's in pure Tcl. In fact, I have that code for both - it just needs some work to clean things up and release it all as a set of packages. Creating ZIP files from pure Tcl is easy, as mentioned before. Creating MK datafiles from pure Tcl is not - the file format of MK is non-trivial, and aimed at high performance and at a much larger set of uses than zip archives. but it's a matter of time, because that is exactly where I intend to go.

A last comment on why perhaps the term "internal VFS" may not be optimal long term: suppose one uses a net-based VFS handler, that would allow fetching all run-time files off the net in whatever format one likes. Not very different from running off NFS, but there may be more aspects to this, such as security or ownership issues.


Vince - Just a few comments on the very interesting discussion above

Certainly Tcl needs to support some sort of packaging by default, with zip/jar being indeed the obvious choice. I also see why it would not be the best idea to fix on a single choice. There are two reasons for this: (i) Given the generic VFS support in Tcl, Tcl doesn't actually care how it gets hold of the files at all, (ii) JCW and perhaps others might prefer to build things using other filesystems (metakit), (iii) we might even want to initialise Tcl off a network.

So, it seems as if what Tcl needs, extremely early in the startup sequence, is a call to a new TclInitializeFilesystem, the result of which must be that there is at least one entry in Tcl's filesystem lookup table. Then the rest of Tcl's startup procedes as normal. The default TclInitializeFilesystem will register the &nativeFilesystem table (from tclIOUtil.c), but as long as there is a mechanism (whether this is compile-time or not, I'm not sure) for a different action to be taken, then a Zip, Metakit, Network filesystem could be registered instead, as appropriate.


KBK (11 April 2002) -- OK, I think we're all reading from the same page here. I wasn't trying to suggest that there necessarily be one and only one choice for an "internal" VFS - one that Core functionality can use for packaging. Rather, I was trying to say that there should be at least one internal VFS packaged with the Core, so that we can be more weakly dependent on the host FS.

I think we could still salvage the use of [info nameofexecutable] as a mount point, if we introduce another layer of naming beneath it; perhaps the initialization should mount Tcl's own internal VFS as [file join [info nameofexecutable] Tcl] or something.


JCW Vince's TclInitializeFilesystem sounds like a very good idea to me. It de-couples a key policy decison.

W.r.t. [info nameofexe] I am not sure. It assumes that what Tcl considers "/" maps to some real world "/".

We do need a spot to anchor information so the startup process can access "files" such as info.tcl (at least that what they look like).

Ponder, ponder... how about taking the model of VFS being the implementation of Uniform Resource Locators. An idea launched on the chat earlier today: when one sees a path "haha://blah", then what that really means is that the VFS handler called "haha" deals with "blah". Plus on top the powerful concept of mounting, so things may switch schemes without knowledge of the URL user.

Now imagine (actually it exists) a file system which maps into Tcl's namespaces and vars/arrays. Let's call it the "var" scheme. I.e.

 set fd [open var://tcl_platform/platform]
 puts [gets $fd]
 close %fd

Maybe the name "inner" would be more accurate.

Now, drop the assumption that real files need to be mounted in certain spots for Tcl to find them. One thing we do know, is that the "var" scheme always exists (every Tcl Interp has vars, or at least the capability to have them).

Drop all library searching logic, and *define* Tcl startup as doing:

 source var://tcl/config/boot.tcl

No if's, no but's, always.

Then startup becomes a matter of putting the right stuff in ::tcl::config(boot.tcl) - it'd be totally flexible, wouldn't it? It could be two-stage, i.e. mount something else, and source the "real" init.tcl?


Vince -- just like to echo jcw's point that [info nameof] is not necessarily a good place to look, because if we haven't mounted a native filesystem the string it returns will not be very meaningful, and a lot of the core code might not like it at all (the problem is that the first parts of the path will not exist). For this reason I think the path would have to be one that is considered absolute, otherwise the same trouble will always arise. This means that either it is under '/' or it is a new toplevel 'drive' (which the new code allows quite happily, and is probably preferable in this case, since, again '/' won't necessarily exist). This could be something like what jcw suggests (var://...) or something like 'tcl://' under which everything lies....

Anyway, that is really just a 'cosmetic' issue, since whatever vfs is written to support the new Tcl filesystem, it will be pretty trivial to modify it to accept any particular mount point as its source.


Category Package Category Acronym