Greetings!

*** John Goerzen [2021-08-18 08:56]:
>So while looking into the question of "how could I have the quickest
>delivery and execution of packets between machines on a LAN"

I am sure that if we are dealing with <=1Gbps Ethernet, then the main
bottleneck is the network itself and TCP-related algorithms. If we deal
with >=10Gbps links and especially high-latency ones, then TCP is the
thing you are likely have to tune. That is why I played with various
protocols like UDT, Tsunami, QUIC and some others I do not remember now.
It is better to possibly loose some traffic because of congestion, send
more overall data than necessary, but deliver the whole packet as fast
as it is possible. For example flush it with a "tsunami" of UDP packets
and then resent the lost chunks (and new MTH hash algorithm allows
immediate integrity checking too). I did not dive deeply into all of
that, but with an ordinary 1Gbps Ethernet adapter and short-length home
network all of that is behind an ordinary TCP. Possibly fine TCP tuning
will be always enough for NNCP.
https://en.wikipedia.org/wiki/TCP_congestion_avoidance_algorithm

>2) nncp-caller seems to be doing frequent calls to nanosleep, futex,
>clock_gettime, and epoll while it has a connection to a remote.

Yeah, that is Go runtime uses for goroutines running for established
session. And many goroutines in NNCP are in endless loop with a sleep,
constantly checking is there anything new in the spool directory.

>The broad question is: what is the most efficient way to do fast data
>exchange?  (Efficient in terms of both SSD life and battery life on a
>laptop)

For me, the first thing about efficiency is dealing with the network.
Transport protocol: currently just an ordinary TCP and an administrator
tuning it for necessary purposes. And application protocol atop of it:
NNCP's SP, that can aggregate multiple SP-packets in single TCP segment.
In theory. In practice it is done during the handshake, but then each
even about newly appeared packet is sent immediately, to notify remote
side as quickly as possible. And Noise_IK pattern is used, because of
reduced number of round-trips, comparing to Noise_XK, which hides
identity.

Then comes CPU and memory. I assume that battery life depends mainly on
CPU. Cryptographic algorithms used in NNCP are some kind of the fastest
ones: ChaCha20-Poly1305 and BLAKE3. AES-GCM with hardware acceleration
could be faster (and less CPU hungry), but that will complicate
SP-protocol with algorithm negotiation, that I won't do. But neither
ChaCha20-Poly1305, nor BLAKE3 implementations use multiple CPUs now.
Multiple connections will be parallelized, because they will work in
multiple independent goroutines.

SSD life depends on disk activity. Because I use mainly hard drives
everywhere, I tend to minimize and serialize all disk operations.
Obviously :-). Of course the most optimal way is to transparently
receive data, checksum it, decipher, authenticate and write only the
deciphered/processed payload to the disk. But because of reliability
requirement we have to save encrypted packet, do various fsync-calls,
and only after that begin its processing, with another fsyncs.
Performance and reliability guarantees are opposites. Turning off fsync
(zfs set sync=disabled, mount -o nosync), atime, .hdr files of course
will hasten NNCP.

Constant rereading of spool directory, stat-ing files in it, locking --
generally won't create any real I/O operations to the disk, because of
filesystem caching. And of course it won't wearout SSDs, because it is
read operations. But it consumes CPU, indeed.

Instead of constant rereading of directory contents, software can use
various frameworks like kqueue and inotify, that will explicitly
immediately notify about changes, without the need of an endless
expensive loop with a sleep. But all of that is OS-specific, that is why
I am not looking in that direction. I am not against that kind of
optimization, but just have not seen they eating too much CPU to worry
about. But they are not free of course -- any kind of syscalls is
relatively expensive.

There are many places NNCP can be optimized, especially in SP-related
code, to do less loops with sleeps and syscalls. Especially with
OS-specific things like kqueue/epoll events notification.

>I have been using persistent connections (very high onlinedeadline and
>maxonlinetime) with nncp-caller, even when that's not strictly necessary,
>reasoning that there is no particular overhead for establishing a new
>connection periodically and all the logging associated with that.  However,
>if nncp-caller is using CPU time/battery power to maintain that, then
>perhaps I'm a bit off there.  (Though it does seem to be negligible)

NNCP sends PING packets from time to time and runs various goroutines
that check if anything new appeared in spool directories. We should do
benchmarks of course, but session establishing is several TCP/SP
roundtrips, with asymmetric cryptography involved (that is *very*
expensive from CPU point of view: 0.5-1M of CPU cycles), with first
handshake packets padded to their maximal size of ~64KB. So handshake
should be very expensive (traffic, delays, CPU) comparing to long-lived
sessions.

>The bigger question is around tossing.  Does autotoss do something more
>restrictive than nncp-toss (perhaps only toss from a particular machine)?

Yes, it runs tosser only for the node we have got connection.

>Is there a way, since autotoss is in-process with nncp-caller, to only
>trigger the toss algorithm when a new packet has been received, rather than
>by cycle interval?

Can be done. Should be done :-). Current autotosser runs literally the
same toss-functions as nncp-toss.

>One other concern about a very short cycle interval is that a failing packet
>can cause a large number of log entries.

I remember about that issue and about the whole problem of (unexistent)
errors processing. Currently I just had no time to think about that. And
in the nearest weeks won't start thinking about it too... various other
things in real life I have to finish :-)

>A final question about when-tx-exists being true.  I am a bit unclear how
>that interacts with cron.  Is it:
>2) Called are made only when cron says to, but only if an outgoing packet
>exists.  (when-tx-exists causes FEWER calls than cron alone)
>I'm guessing it's #2 but I'm not certain.

Yes, exactly like you wrote here. when-tx-exists just tells, every time
we appear to make a call, to check if there really exists any outgoing
packet (with specified niceness).

-- 
Sergey Matveev (http://www.stargrave.org/)
OpenPGP: CF60 E89A 5923 1E76 E263  6422 AE1A 8109 E498 57EF