public inbox for goredo-devel@lists.cypherpunks.ru
Atom feed
* redo-ifchange hangs when called with many arguments
@ 2021-10-16 15:58 goredo
  2021-10-19 13:58 ` Sergey Matveev
  0 siblings, 1 reply; 6+ messages in thread
From: goredo @ 2021-10-16 15:58 UTC (permalink / raw)
  To: goredo-devel

Hi,

One of my projects resulted in an inadvertent stress test of how many arguments redo-ifchange can handle.

There was a situation where it was called with 500 arguments which resulted in a strange lockup. redo was started with `-j 5` and REDO_LOG=1. The redo-ifchange process was just sleeping without making progress, like it was waiting for pipe input or something similar. It was not waiting for disk, nor consuming any CPU time, nor launching processes. I killed it with SIGTERM after about 2 minutes. So, it seems it was actively waiting for some file descriptors and terminated normally when asked to.

Limiting the call to 50 arguments resolved the issue. I haven't iteratively narrowed it down, just cut it to 50 directly.

I also noticed before, that killing redo with C-c does not always terminate the process it spawned (through a .do file). So, maybe these two are related and there is some intricate issue with controlling sub-processes, like redo not properly waiting for its sub-processes.

–Michael

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: redo-ifchange hangs when called with many arguments
  2021-10-16 15:58 redo-ifchange hangs when called with many arguments goredo
@ 2021-10-19 13:58 ` Sergey Matveev
  2021-10-19 14:56   ` goredo
  0 siblings, 1 reply; 6+ messages in thread
From: Sergey Matveev @ 2021-10-19 13:58 UTC (permalink / raw)
  To: goredo-devel

[-- Attachment #1: Type: text/plain, Size: 1530 bytes --]

Greetings!

*** goredo [2021-10-16 15:58]:
>One of my projects resulted in an inadvertent stress test of how many arguments redo-ifchange can handle.

Well, basically it should depend solely on OS. Of course except for
maximal command line arguments number, that is the main limiting factor
of redo. I wonder why DJB decided to pass dependencies as arguments (as
tar), instead of reading them from stdin (as cpio/pax), that would not
artificially limit their number.

>The redo-ifchange process was just sleeping without making progress, like it was waiting for pipe input or something similar.

I do not clearly remember (possibly I am mistaken), but I think I met
that kind of behaviour last year when I run several thousand targets at
once in parallel. I do not remember how I debugged and traced it, but
the problem was in touching limits of opened file descriptors, that I
fixed in 1.2.0 release (it was non-optimal resources usage, but everything
got stuck at OS level limits). I also noticed that my FreeBSD has nearly
two million opened files limit, but some version of Ubuntu just only one
thousand.

>I also noticed before, that killing redo with C-c does not always terminate the process it spawned (through a .do file).

According to the code (and I checked it in practice), it should not
terminate any of its processes at all :-). I added "infanticide" to it
in 1.17.0 release.

-- 
Sergey Matveev (http://www.stargrave.org/)
OpenPGP: CF60 E89A 5923 1E76 E263  6422 AE1A 8109 E498 57EF

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: redo-ifchange hangs when called with many arguments
  2021-10-19 13:58 ` Sergey Matveev
@ 2021-10-19 14:56   ` goredo
  2021-10-19 15:26     ` Sergey Matveev
  0 siblings, 1 reply; 6+ messages in thread
From: goredo @ 2021-10-19 14:56 UTC (permalink / raw)
  To: goredo-devel

> Well, basically it should depend solely on OS. Of course except for
> maximal command line arguments number, that is the main limiting factor
> of redo. I wonder why DJB decided to pass dependencies as arguments (as
> tar), instead of reading them from stdin (as cpio/pax), that would not
> artificially limit their number.

I would have expected that, too. I use xargs to pass unknown number of arguments to redo-ifchange, as it manages maximum arg len.

> I do not clearly remember (possibly I am mistaken), but I think I met
> that kind of behaviour last year when I run several thousand targets at
> once in parallel. I do not remember how I debugged and traced it, but
> the problem was in touching limits of opened file descriptors, that I
> fixed in 1.2.0 release (it was non-optimal resources usage, but everything
> got stuck at OS level limits). I also noticed that my FreeBSD has nearly
> two million opened files limit, but some version of Ubuntu just only one
> thousand.

So, I may have hit the bug again? Resource exhaustion should have led to the OS killing the redo-ifchange process or would opening another file just fail and redo does not always check if that succeeded? Because it really looks like it's waiting for something that does never happen. In case it hits the open file limit. Redo could just wait for already spawned jobs to finish or serialise its dependency checking.

> According to the code (and I checked it in practice), it should not
> terminate any of its processes at all :-). I added "infanticide" to it
> in 1.17.0 release.

:) Thanks! I may have never noticed, because the other targets usually run quickly.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: redo-ifchange hangs when called with many arguments
  2021-10-19 14:56   ` goredo
@ 2021-10-19 15:26     ` Sergey Matveev
  2021-10-19 16:05       ` goredo
  0 siblings, 1 reply; 6+ messages in thread
From: Sergey Matveev @ 2021-10-19 15:26 UTC (permalink / raw)
  To: goredo-devel

[-- Attachment #1: Type: text/plain, Size: 1363 bytes --]

*** goredo [2021-10-19 14:56]:
>So, I may have hit the bug again?

Go and goredo should not ignore any errors from OS. But many OSes are
known to be buggy (have unexpected behaviour), like:
http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#OS_X_AND_DARWIN_BUGS (is not related to Go)
or previously well-known "bug" with deadlocked native Go's syslog logger:
https://github.com/golang/go/issues/5932
https://github.com/cloudfoundry/lager/issues/15
https://github.com/driskell/log-courier/issues/90
and here everything was related to the bug in Linux epoll itself,
however many people treated it like Go's bug. Many people just used
workarounds of that deadlocking behaviour, but Go's authors do not want
to fix OS errors/bad behaviour. So probably I really believe that could
be a problem of the OS itself, because there were so many of them already.

Twice a year I got harsh panic in the deepest levels of Go syscalls
itself when opening file (os.File) -- it is/was a problem of some
exact FreeBSD version.

Of course possibly there is a bug in goredo's code/logic, but it should
be reproducible somehow: I have not seen any kind of deadlocks with
several thousands targets, but all of that was done on FreeBSD exclusively.

-- 
Sergey Matveev (http://www.stargrave.org/)
OpenPGP: CF60 E89A 5923 1E76 E263  6422 AE1A 8109 E498 57EF

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: redo-ifchange hangs when called with many arguments
  2021-10-19 15:26     ` Sergey Matveev
@ 2021-10-19 16:05       ` goredo
  2021-10-21  9:51         ` Sergey Matveev
  0 siblings, 1 reply; 6+ messages in thread
From: goredo @ 2021-10-19 16:05 UTC (permalink / raw)
  To: goredo-devel

Thanks! That is valuable information for me. I will limit argument passing to mitigate the bug.

Maybe, this warrants a note on the FAQ with descriptions of possible workarounds...

You could make redo work conservatively and not inspect more dependencies in parallel than jobs are allowed to start anyhow. (I honestly don't know, at what point those many files get opened.) As most people have only single-digit number of CPUs, this should take care of even more limited OSes with a 1000 open files limit.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: redo-ifchange hangs when called with many arguments
  2021-10-19 16:05       ` goredo
@ 2021-10-21  9:51         ` Sergey Matveev
  0 siblings, 0 replies; 6+ messages in thread
From: Sergey Matveev @ 2021-10-21  9:51 UTC (permalink / raw)
  To: goredo-devel

[-- Attachment #1: Type: text/plain, Size: 1383 bytes --]

*** goredo [2021-10-19 16:05]:
>I will limit argument passing to mitigate the bug.

I think that it is good enough workaround. If I met that kind of
freezing (again) and won't find the reason (possibly because do not have
time to investigate it clearly), I would just limit number of arguments
(xargs/whatever).

>Maybe, this warrants a note on the FAQ with descriptions of possible workarounds...

Well, I assume that with literally every program you have to remember
about OS environment it is running under. So that would be excess
information. And everybody who uses redo will already know about xargs
and similar tools. Of course, in my humble opinion.

>You could make redo work conservatively and not inspect more dependencies in parallel than jobs are allowed to start anyhow.
>(I honestly don't know, at what point those many files get opened.)

It could be done, yes, probably I will add it soon. When you pass 1000
targets, when 1000 goroutines will run, at least with opened target's
.lock file, waiting when free job slot token will arrive. Target that
waits for children to finish (dependencies) at least holds opened files
with .dep metainformation and at least stderr pipe. So yeah, jobs number
does not limit opened files number.

-- 
Sergey Matveev (http://www.stargrave.org/)
OpenPGP: CF60 E89A 5923 1E76 E263  6422 AE1A 8109 E498 57EF

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-10-21 10:01 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-16 15:58 redo-ifchange hangs when called with many arguments goredo
2021-10-19 13:58 ` Sergey Matveev
2021-10-19 14:56   ` goredo
2021-10-19 15:26     ` Sergey Matveev
2021-10-19 16:05       ` goredo
2021-10-21  9:51         ` Sergey Matveev