Opened 9 years ago
Closed 7 years ago
#12319 closed bug (fixed)
Time/NTP sometimes crashes in gethostbyname()
Reported by: | ttcoder | Owned by: | nobody |
---|---|---|---|
Priority: | normal | Milestone: | R1 |
Component: | Network & Internet | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description
This is hrev49523
We load Time as an add-on into CC's address space to use the Network Time Protocol syncing feature within it, about every 3 minutes (basically after each transition from one song to the next). Seems to work great initially.
However I Just saw this for the first time after doing an extended 12+ hr test run since the past night (and we had reports hinting at this for a few weeks in older hrevs).
All splitviews in Debugger were collapsed and 'locked' (I thought I had deleted the settings in config/settings/Debugger but will try again before filing a ticket for that) but retrieving the info from syslog was easy:
KERN: vm_soft_fault: va 0x719a0000 not covered by area in address space KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x719a0844, ip 0x143481c, write 0, user 1, thread 0x4562 KERN: vm_page_fault: thread "(add-on)ntp_update_time" (17762) in team "CommandCenter" (602) tried to read address 0x719a0844, ip 0x143481c ("libnetwork.so_seg0ro" +0x1781c) KERN: debug_server: Thread 17762 entered the debugger: Segment violation KERN: stack trace, current PC 0x143481c nsdispatch + 0xe4: KERN: (0x71945ff8) 0x142f811 gethostbyname_internal + 0x305 KERN: (0x71946048) 0x142f4a9 gethostbyname + 0x79 KERN: (0x71946078) 0x5cf11c ntp_update_time__FPCcPPCcPl + 0x24 KERN: (0x71946188) 0x1903f04 thread_func__25NetworktimeprefletAsAddonPv + 0x290 KERN: (0x719462b8) 0x20d3c7f thread_entry + 0x23
FWIW -- might be just a coincidence, but it ran fine when unattended through the night but crashed within a few minutes of me checking the system this morning -- I checked if CC was running fine, launched a browser, accessed sites.
Will shortly provide some sample code to make it easier to reproduce this (maybe combined with some heavier network and DNS access usage, in case there is a concurrency component involved in the bug?)
Attachments (6)
Change History (22)
by , 9 years ago
Attachment: | class_NetworktimeprefletAsAddon.cpp added |
---|
comment:1 by , 9 years ago
In light of yak's analysis in the other ticket, it occurs to me that maybe gethostbyname() works on some sort of 'static' (team-wide common) data, so if a thread executing gethostbyname() is killed in the middle of it, it could leave it in an inconsistent state, and all further calls to gethostbyname() would fail or crash ?
Does that make sense, I'm not sure If so, for future reference, is there a "re-entrant" version of gethostbyname() (somewhat like localtime()
has a localtime_r()
re-entrant counterpart). ?
I'm asking because I tried two "torture tests" right now:
- first worked with Haiku's ntp.cpp, just adding a main() that invokes it synchronously in an infinite loop, while slowing down my LAN with misc trafic. Could not make it crash.
- then I modified it to invoke a-synchronously (i.e. in a separate thread) and kill said thread almost immediately, and indeed gethostbyname() is hosed quickly:
KERN: debug_server: Thread 7921 entered the debugger: Segment violation KERN: vm_soft_fault: va 0x0 not covered by area in address space KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xa, ip 0x80136e83, write 0, user 0, thread 0x1ef3 KERN: vm_soft_fault: va 0x0 not covered by area in address space KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xa, ip 0x80136e83, write 0, user 0, thread 0x1ef3 KERN: vm_soft_fault: va 0x0 not covered by area in address space KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x4, ip 0x80136e83, write 0, user 0, thread 0x1ef3 KERN: vm_soft_fault: va 0x0 not covered by area in address space KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xa, ip 0x80136e83, write 0, user 0, thread 0x1ef3 KERN: vm_soft_fault: va 0x0 not covered by area in address space KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xa, ip 0x80136e83, write 0, user 0, thread 0x1ef3 KERN: vm_soft_fault: va 0x2e323000 not covered by area in address space KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x2e323931, ip 0x80136e83, write 0, user 0, thread 0x1ef3 KERN: vm_soft_fault: va 0x0 not covered by area in address space KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xa, ip 0x80136e83, write 0, user 0, thread 0x1ef3 KERN: vm_soft_fault: va 0x0 not covered by area in address space (..etc..)
- If I increase the delay from 50000 µs to 100000 µs I get the exact same stack crawl as at the top of this ticket.
Anyway I'll obviously re-organize the code which is deffective due to that unprotected kill_thread call, but I'll only feel 100% reassured if I also can use a re-entrant version of gethostbyname :-)
EDIT: I've hit duckduckgo a bit and the function is dubbed "obsolete" but I'm not sure what bearing that has on the re-entrancy discussion Here. More interesting, I also found this:
gethostbyname() .... uses static storage that is re-used in each call, making these functions unsafe for use in multithreaded applications.
but again I'm not sure if they're just saying to not call it from concurrent threads at the same time, or if that also implies that the function is not state-neutral, i.e. you can't call it twice in sequence if it was interrupted before completing the first time.
comment:2 by , 9 years ago
Killing the thread is not a good idea anyway, even if the function is reentrant - it could leak memory and leave sockets open, etc.
getaddrinfo
should probably be used instead of gethostbyname
.
comment:3 by , 9 years ago
Milestone: | R1/beta1 → R1 |
---|
by , 9 years ago
Attachment: | ntp (without deprecated calls).cpp added |
---|
Modified to use BNetworkAddressResolver: we now pass all the "torture tests"..!
comment:4 by , 9 years ago
Modified the code to do away with the obsolete gethostbyname()
and the code is now 100% solid in my testing, even if using ridiculously short delays.
Might not be all that interesting for the Haiku project, but I posted the above anyway because ticket surfers might find the code of interest to migrate their old apps to be re-entrant I suppose. Beware though -- It's certainly been a bit of a learning experience for me doing this twenty-liner, been 10+ years since I last had to look up man pages for sockets and sockadd and sockaddr_in and what not, probably missed something! The code works for me though.
comment:5 by , 8 years ago
If you could actually provide a patch, I'd be happy to apply it :-)
Two things, though:
- You should never use
sizeof(address.SockAddr())
, but insteadaddress.Length()
. The former will always return the length of struct sockaddr, which might not be adequate for anything but IPv4. - There is no need to use an BNetworkAddressResolver; you can just use the BNetworkAddress constructor that takes the same arguments directly.
comment:6 by , 8 years ago
Updated the ntp.cpp file to use the getaddrinfo function instead of the obsolete gethostbyname function.
Other changes include support for ipv6 and required parameter changes and setup for "sendto" and "recvfrom" in ntp.cpp.
Performed a build with the modified ntp.cpp file and everything seemed to worked fine. Also inspected the functionality using the graphical debugger.
Kindly review the patch.
by , 8 years ago
Attachment: | 0001-updated-ntp-to-use-getaddrinfo.patch added |
---|
Updated ntp to use getaddrinfo
comment:7 by , 8 years ago
patch: | 0 → 1 |
---|
comment:8 by , 8 years ago
Thanks for your contribution! I haven't actually tested your code, but it does look okay and functioning.
I have a few remarks, though:
- Please try to follow our coding style closely. A good start would be trying hard to not make your changes stand out, but it would be even more appreciated to give the guide itself a read :-)
- Please consider using the
BNetworAddressResolver
[1] class, as in the example from ttcoder 19 months ago. It has a much nicer API, and would allow you to avoid those stringify macros completely. It's also much less likely to leak any memory with it, although your patch seems to deal with that appropriately.
[1] http://cgit.haiku-os.org/haiku/tree/headers/os/net/NetworkAddressResolver.h
comment:9 by , 8 years ago
I went through the coding style guidelines, made appropriate style changes and tested it against the "checkstyle.py" script. The code now adheres to the guidelines.
I see the advantages of using BNetworkAddressResolver. I have attached another patch which uses BNetworkAddressResolver. However it would be lovely if you could test my original patch as well since it makes me feel more original with my patch and gives me a greater sense of contribution.
Thanks for the remarks.
by , 8 years ago
Attachment: | 0001-update-ntp-to-used-BNetworkAddressResolver.patch added |
---|
updated ntp to use BNetworkAddressResolver
comment:10 by , 8 years ago
The patch is indeed much smaller. However, you also removed one change of your original patch that I liked: iterating over all results, and connect to the first that fits. Your new code no longer does that; it will only test the first result it gets from the resolver.
There are a couple of style violations left that I will point out here for your convenience:
- NetworkAddress.h and NetworkAddressResolver.h are public headers, as such, they should be ordered differently: we have the header counterpart of the sources first (to ensure it's complete and does not rely on any other headers), then from most general to most specific: first c/c++ headers, then POSIX, then Haiku, then application headers. Between each section is a blank line.
- Line 123: No space after '(', operator (
||
) goes to the beginning of the next line, not the end of the line. - Line 153: Indentation seems to be wrong; the '0' from the arguments list within the if clause needs to be indented one tab more than the block.
Thanks for the quick update!
comment:12 by , 8 years ago
I was attending to my college exams for the past week. You can expect an update within 2-3 days.
by , 8 years ago
Attachment: | 0001-updated-ntp-to-use-BNetworkAddressResolver.patch added |
---|
used the high level BNetworkAddressResolver class + itearting over all the available addresses until a successful request has been made
comment:13 by , 8 years ago
In the jamfile, I think you want to link to libnetwork rather than libnet.
The error handling seems a bit off. If resolver.InitCheck is not ok, the error should be something like "could not resolve server address", not "could not contact server".
If resolving all server addresses fail, I think your code will end up in an infinite loop: resolver.GetNextAddress will not return B_OK, and you will call it over and over again. I think using a while (resolver.GetNextAddress(&cookie, address) == B_OK)
would be less error prone.
After the while, you are checking the error state of the resolver, but not that of actually writing to the server. I suggest storing the result of sendto in a bool:
if (sendto(...) != -1) { success = true; break; }
Then after the while loop:
if (!success) { // error handling goes here }
comment:14 by , 8 years ago
Tried linking with libnetwork (used network instead of bnetapi) but ran into compilation errors. Taken you suggestions and made another patch although I feel my code wouldn't have ended up in a situation for an infinite loop - I checked the condition when next address didn't return B_OK. But the new patch is anyways neater.
by , 8 years ago
Attachment: | 0001-fixed-ntp.cpp.patch added |
---|
comment:15 by , 8 years ago
I didn't see the endless loop either, but I like the current version better in any case. Pointing out some minor style issues:
- There is always a space between
while
and '(' (line 150). - Same as for
if
and '(' (line 158). - We do like a newline at the end of the file so please don't remove that.
In general, always try to copy the style you find in the file you're working on, when you're not sure about the coding style being used. It's always best to read through the coding guidelines, though :-)
Thanks for the update!
comment:16 by , 7 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Applied (with some coding style fixes) in hrev51390. Thanks!
One way to invoke Time as described