Opened 5 years ago

Closed 4 years ago

#15853 closed bug (fixed)

[pkgman] Failed to download package cmake: Operation would block

Reported by: diver Owned by: pulkomandy
Priority: high Milestone: R1/beta2
Component: Kits/Package Kit Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

hrev54023 x86_64.

Resizing Terminal window while pkgman is downloading packages results in Operation would block error.

Change History (32)

comment:1 by Coldfirex, 5 years ago

This seems strangely familiar. I thought we had a similar ticket in the last couple of months that was fixed, but couldn't find it.

comment:2 by pulkomandy, 5 years ago

Yes, this is supposed to be already fixed in hrev53812.

Last edited 5 years ago by pulkomandy (previous) (diff)

comment:3 by diver, 5 years ago

I now started getting "Operation would block" even without resizing Terminal.

pkgman list showed this url which was 404: https://eu.hpkg.haiku-os.org/haikuports/master/x86_64/current

Fixed it with: pkgman add https://eu.hpkg.haiku-os.org/haikuports/master/repository/x86_64/current/

Has some redirect stopped working or something? Could be that Operation would block would thrown if pkgman can't connect due to 404? Sounds like a wrong error code.

comment:4 by waddlesplash, 5 years ago

There was an error code change in the version of OpenSSL 1.1 we are on that broke some applications which may be related. They reverted it in the newer patches of OpenSSL, so we should update.

comment:5 by diver, 5 years ago

A few users on Telegram started seeing this error. Should we move it to beta2?

comment:6 by pulkomandy, 5 years ago

Milestone: UnscheduledR1/beta2
Priority: normalhigh

Yes, definitely.

comment:7 by pulkomandy, 5 years ago

I have added a recipe for OpenSSL 1.1.1g. Let's see if it helps. If it does, we need to re-run the hardlink packages script, I guess, so that it ships with the beta?

comment:8 by waddlesplash, 5 years ago

I'm going to do that anyway after Humdinger pushes new recipes for applications following the translations freeze.

comment:9 by nielx, 5 years ago

Beta 2 status meeting: this issue has a potential patch that can be applied to the beta2 branch.

comment:10 by pulkomandy, 5 years ago

Can we do this update soon? It would be nice to make sure OpenSSL fixes indeed solve this, or if further changes on our side would also be needed.

comment:11 by waddlesplash, 5 years ago

Why do we need to do the update to test? OpenSSL is not statically linked, just update your system package and test it.

comment:12 by pulkomandy, 5 years ago

Indeed. Let me rephrase: can anyone reproduce this with an updated install? (not a stock one that would still include the unpatched openssl)

comment:13 by diver, 5 years ago

Just happened to me with haiku-master-hrev54185-x86_gcc2h-anyboot.iso with openssl_1.1.1e-1_x86_gcc2.hpkg on Terminal window resize during pkgman install qbittorrent.

comment:14 by pulkomandy, 5 years ago

Yes, which is why I asked to test with an updated install. openssl_1.1.1g is what we want to test, but the default image is currently not shipped with it.

comment:15 by diver, 5 years ago

So far I can't reproduce it. However Just now in hrev54215 I got this:

~> pkgman full
100% repochecksum-1 [65 bytes]
Validating checksum for Haiku...done.
Refreshing repository "Haiku" failed  0%: Interrupted system call
100% repochecksum-1 [64 bytes]

I didn't resize the terminal and subsequent pkgman full succeeded.

comment:16 by diver, 5 years ago

Reproduced Operation would block with openssl-1.1.1g-1-x86_64.hpkg

comment:17 by ambroff, 4 years ago

So it looks like hrev53812 did address this, but then the (necessary) followup hrev53853 broke it again. When I reproduce this locally, sending SIGWINCH or any other signal interrupts the system call of SSL_read(), but it weirdly returns SSL_ERROR_WANT_READ. I expected it to return a more specific error in this case, something mapped to EAGAIN. But I guess that isn't what we get.

So after hrev53853 we map SSL_ERROR_WANT_READ to B_WOULD_BLOCK. But the caller, BHttpRequest::_MakeRequest(), is not a non-blocking I/O context and doesn't expect B_WOULD_BLOCK.

A bandaid fix would be to just add a special branch to the only call of fSocket->Read() like so

diff --git a/src/kits/network/libnetapi/HttpRequest.cpp b/src/kits/network/libnetapi/HttpRequest.cpp
index 2638e66888af..9b5b550e2451 100644
--- a/src/kits/network/libnetapi/HttpRequest.cpp
+++ b/src/kits/network/libnetapi/HttpRequest.cpp
@@ -588,8 +588,12 @@ BHttpRequest::_MakeRequest()
                        bytesRead = fSocket->Read(chunk, kHttpBufferSize);
 
                        if (bytesRead < 0) {
-                               readError = bytesRead;
-                               break;
+                               if (bytesRead == B_WOULD_BLOCK)
+                                       fSocket->WaitForReadable();
+                               else {
+                                       readError = bytesRead;
+                                       break;
+                               }
                        } else if (bytesRead == 0)
                                receiveEnd = true;

This works for me but it isn't a complete fix. For one, I think this could happen on the write path too, but there isn't any error checking happening for calls to fSocket->Write().

The other potential problem is that it seems SSL_read() could return SSL_ERROR_WANT_WRITE if the state machine of the TLS session is expecting you to write because no data will ever be available for read.

It also seems that SecureSocket should know whether the caller expects Read() or Write() to block. If it had a blocking I/O mode then we wouldn't need to touch all of the Read and Write call sites to make this work everywhere.

comment:18 by waddlesplash, 4 years ago

Is there no way to detect a purely interrupted syscall, and just retry that, and return EWOULDBLOCK syscalls to the caller? This seems like a common problem, what do other libraries that use OpenSSL do or expect?

comment:19 by waddlesplash, 4 years ago

It looks like "errno" may still contain, according to the OpenSSL manpages, the underlying syscall error. Maybe we can just test that here, if we get ERROR_WANT_...?

comment:20 by pulkomandy, 4 years ago

The ERROR_WANT_WRITE should not happen in our case because we configure OpenSSL in "auto retry" mode, if I understand things correctly (but it's quite possible that I don't).

waddlesplash, do you have a reference for this in the manpages? If this is indeed now documented, the correct thing to do would be to restart the syscall specifically in the case we get EINTR, and not in other error cases. But when I wrote this code I could not find any documentation that would guarantee this. I may have simply missed it.

comment:21 by waddlesplash, 4 years ago

Actually, it appears AUTO_RETRY makes BIO ... not retry: https://www.openssl.org/docs/man1.0.2/man3/BIO_get_ssl.html (see NOTES)

I found some answers on StackOverflow and elsewhere that OpenSSL does not taint "errno", but maybe that is no longer the case. Seems worth checking.

comment:22 by ambroff, 4 years ago

I don't fully understand what AUTO_RETRY mode is. You're right waddlesplash that page seems to imply that it does the opposite of what we want.

I tested commenting out the line that sets SSL_MODE_AUTO_RETRY on BSecureSocket::Private::sContext, and I also tried setting SSL_MODE_AUTO_RETRY on BSecureSocket::Private::fSSL. In every case I still get SSL_ERRRO_WANT_READ back when interrupting a read.

The good news is that errno is being set to EINTR.

comment:23 by ambroff, 4 years ago

I take it back, something else was setting errno. I made sure to clobber errno before calling SSL_read and now I can see that it isn't being set. EDIT: Revisiting this today, I see that this comment was incorrect because my test code was just wrong. errno does indeed get set to EINTR.

Last edited 4 years ago by ambroff (previous) (diff)

comment:24 by pulkomandy, 4 years ago

https://www.openssl.org/docs/man1.1.1/man3/SSL_CTX_set_mode.html is a bit more clear about auto_retry and also mention that it is the default in OpenSSL 1.1.1, so we can remove this from the context setup.

According to https://stackoverflow.com/questions/24188013/openssl-and-signals?rq=1 , we should be able to get the errno from get_last_socket_error(), and use BIO_should_retry() to know if we need to retry an operation.

comment:25 by ambroff, 4 years ago

I've been reading through examples in the openssl repo to get a better feel for the API. It looks like SSL_waiting_for_async() is a better signal for whether we should return B_WOULD_BLOCK.

https://www.openssl.org/docs/man1.1.0/man3/SSL_waiting_for_async.html

So we can check for EINTR in errno, but in my experiments relying on SSL_waiting_for_async seems to do the right thing.

comment:26 by ambroff, 4 years ago

Looking through the openssl code, it looks like the place where it defines whether or not BIO_should_retry is just this.

        switch (SSL_get_error(ssl, (int)ret)) {
        case SSL_ERROR_WANT_READ:
            BIO_set_flags(b, BIO_FLAGS_READ | BIO_FLAGS_SHOULD_RETRY);
            break;
        case SSL_ERROR_WANT_WRITE:
            BIO_set_flags(b, BIO_FLAGS_WRITE | BIO_FLAGS_SHOULD_RETRY);
            break;
        case SSL_ERROR_WANT_CONNECT:
            BIO_set_flags(b, BIO_FLAGS_IO_SPECIAL | BIO_FLAGS_SHOULD_RETRY);
            BIO_set_retry_reason(b, BIO_get_retry_reason(next));
            break;
        case SSL_ERROR_WANT_X509_LOOKUP:
            BIO_set_retry_special(b);
            BIO_set_retry_reason(b, BIO_RR_SSL_X509_LOOKUP);
            break;
        default:
            break;
        }
        break;

It doesn't appear that we are using SSL_MODE_ASYNC. If we did, then I think we'd want to use SSL_waiting_for_async() to determine whether to return B_WOULD_BLOCK instead of retrying.

I think just reverting hrev53853 gets us back into a working state, and if we decide to support SSL_MODE_ASYNC then we would just change this retry loop to use SSL_waiting_for_async and BIO_should_retry together.

comment:27 by waddlesplash, 4 years ago

Doesn't that break Vision, which does use non-blocking sockets here?

I can't remember why I made it use non-blocking sockets (or maybe it already did) but at least that needs to be preserved (or fixed).

If EINTR is set in errno, then we may just want to rely on that instead of SSL_waiting_for_async.

comment:28 by ambroff, 4 years ago

Oh I didn't know about the Vision problem. I'll look at that code to try to understand what's going on.

comment:29 by waddlesplash, 4 years ago

Vision requiring asynchronous communications probably occurred here: https://github.com/HaikuArchives/Vision/commit/2343f4eadb453bfd5bf7e978e456b586f1fb4edb

That commit may just be wrong, I have not revisited it since I wrote it; but at least it's worked until now.

comment:30 by ambroff, 4 years ago

OK cool, thanks for the heads up. I've been reading through the code. This is a nice test case.

At least this seems to work as a hotfix. https://github.com/ambroff/haiku/commit/bf182a76186aab0df4f7959f2f1bc7a6749aa077

It will take me some time to fully understand how openssl is meant to be used. I've never used it directly before this.

comment:31 by waddlesplash, 4 years ago

I have only minimal experience with OpenSSL, but at least that fix looks good to me.

comment:32 by waddlesplash, 4 years ago

Resolution: fixed
Status: assignedclosed

Fix merged in hrev54273.

Note: See TracTickets for help on using tickets.