public inbox for goredo-devel@lists.cypherpunks.ru
Atom feed
From: "Jan Niklas Böhm" <mail@jnboehm•com>
To: goredo-devel@lists.cypherpunks.ru
Subject: Re: Suggestion to revert touching files when the hash matches (problem with hardlinks)
Date: Wed, 2 Nov 2022 23:42:22 +0100	[thread overview]
Message-ID: <b8add293-85d0-2dc3-978b-4900788fe071@jnboehm.com> (raw)
In-Reply-To: <Y2J3Wn87oBbkA2cw@stargrave.org>

> goredo (as apenwarr/redo, redo-c, baredo, redo-sh) expects that
> filesystem paths reference only a single "object", single entity it is
> aware of. To make hardlinks working, you have to change the whole
> architecture, where you explicitly expect single entity (under redo's
> control) to have possible multiple identifiers on filesystem.

Maybe I am missing something, but I am not really sure why this would 
require more than the change that I was suggesting.  The hardlink will 
be created by goredo and thus both files will be tracked by redo.  Thus 
the two files being equivalent should be perfectly valid in this 
setting, the only reason that it falls apart is the direct mutation of 
the old target file via touch/os.Chtimes.

In general neither file pointing to the same inode should be written to 
or mutated (or any files that were created by goredo) so it doesn't 
matter which link to the inode is used to modify it, both are wrong.  I 
think that touching the old target file is the wrong level of 
abstraction.  On the flipside, using mv means committing the transaction 
with the file that goredo controlled all along.

When hardlinking any file to $3 it also does not matter whether the 
source of the hardlink was redone prior to the linking or not since no 
writing will be done to that file.  The only difference will be that the 
ctime and the link counter will change, but those are not vital 
attributes to the integrity of the file data.

> I agree with Spacefrogg's answers on that thread, emphasizing that any
> expectations on any certain behaviour of hardlinked files are just wrong
> (with current widely used implementations). You should redesign your
> targets and workflow to fit redo's expectation. Maybe some
> proxy/intermediate targets will help, maybe just do not track all of
> generated files and honestly expect defined (by your .do-files)
> behaviour, where the fact of successful "foo" target completion also
> means "foo.bar" file existence (although untracked).

I am reluctant because that will then encode the dependency between foo 
and foo.bar in the code instead of using the dependency resolution of 
redo itself.  This is not a problem for the simple case, but will become 
increasingly more complex the more targets and linked files interact. 
This is precisely what a build system excels at, so the proposal seems a 
bit unsatisfactory.

The same way that goredo expects that files are not touched by the user 
the user should be allowed to expect that redo does not touch the files 
it created itself.

> Actually that optimisation *may* improve execution a lot when goredo is
> used on filesystems with active write-cache usage (like UFS with
> soft-updates or ZFS). With that optimisation: temporary file is created,
> then it is filled with the output, and then it is deleted -- everything
> related to it will be just dismissed from the write-cache and no real
> I/O is issued to the disk. Except for inode update, that is much more
> lightweight operation. Without that optimisation your disk is literally
> forced to create a new copy of the file, removing the old one, that is
> considerable amount of really issued I/O. You may notice that "if hsh ==
> hshPrev" check is done before fsync() is called -- so that optimisation
> works even with REDO_NO_SYNC=0. I did that optimisation exactly because
> of high I/O rate and no files contents really changed.

I was not aware of those optimizations.  But I would argue that the time 
spent on building $3 will most likely dwarf the performance enhancement. 
  Outside of testing goredo this will probably never bottleneck the 
application.  But an interesting thing that I didn't know before, 
nontheless.  (This I am not sure about, but if you "cp --reflink" to a 
file that is then touched, will the full copy materialize?  If yes, this 
would be a more expensive write since the data of the target file is not 
cached.  If no, then "cp --reflink" would also not solve the issue in 
this particular case and touching the file would break it.)

> Modification check is only intended to warn user about some unexpected
> events happened with the target under redo's observation and control.
> Target must be produced only by redo itself, under its tight control. If
> someone "external" modifies it, then in general that can be treated as
> undefined behaviour and wrong usage of the redo ecosystem itself. So
> even if file's content stays the same, but its inode is touched
> "outside" redo, then something wrong is already occurring.

That I can agree with.

I would like to put forward yet another possible solution.  The number 
of links to the file could be checked prior to calling os.Chtimes and 
only do the optimized procedure if the number of links to the target 
file is 1 since that will not have any ripple effects.  Conceptually 
moving $3 to the target feels cleaner as this does not mutate file 
attributes but I have to admit that I am not aware of all the advantages 
that using touch might bring.

I do not think that using hardlinks should invalidate most assumptions 
of the build system, as they often play nice, unless they are seriously 
abused (such as doing "ln foo $3; echo bla > $3).  But even that would 
be picked up by redo if the file "foo" was under its control and a 
warning would be emitted.

  reply	other threads:[~2022-11-02 22:44 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-31 21:37 Suggestion to revert touching files when the hash matches (problem with hardlinks) Jan Niklas Böhm
2022-11-01  6:42 ` goredo
2022-11-01  7:50   ` Jan Niklas Böhm
2022-11-01  8:21     ` goredo
2022-11-01  9:02       ` Jan Niklas Böhm
2022-11-01 11:49         ` Spacefrogg
2022-11-01 13:14           ` Jan Niklas Böhm
2022-11-02 13:57             ` Sergey Matveev
2022-11-02 22:42               ` Jan Niklas Böhm [this message]
2022-11-03  8:55                 ` Sergey Matveev