Opened 9 years ago

Closed 7 years ago

#12319 closed bug (fixed)

Time/NTP sometimes crashes in gethostbyname()

Reported by: ttcoder Owned by: nobody
Priority: normal Milestone: R1
Component: Network & Internet Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

This is hrev49523

We load Time as an add-on into CC's address space to use the Network Time Protocol syncing feature within it, about every 3 minutes (basically after each transition from one song to the next). Seems to work great initially.

However I Just saw this for the first time after doing an extended 12+ hr test run since the past night (and we had reports hinting at this for a few weeks in older hrevs).

All splitviews in Debugger were collapsed and 'locked' (I thought I had deleted the settings in config/settings/Debugger but will try again before filing a ticket for that) but retrieving the info from syslog was easy:

KERN: vm_soft_fault: va 0x719a0000 not covered by area in address space
KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x719a0844, ip 0x143481c, write 0, user 1, thread 0x4562
KERN: vm_page_fault: thread "(add-on)ntp_update_time" (17762) in team "CommandCenter" (602) tried to read address 0x719a0844, ip 0x143481c ("libnetwork.so_seg0ro" +0x1781c)
KERN: debug_server: Thread 17762 entered the debugger: Segment violation
KERN: stack trace, current PC 0x143481c  nsdispatch + 0xe4:
KERN:   (0x71945ff8)  0x142f811  gethostbyname_internal + 0x305
KERN:   (0x71946048)  0x142f4a9  gethostbyname + 0x79
KERN:   (0x71946078)  0x5cf11c  ntp_update_time__FPCcPPCcPl + 0x24
KERN:   (0x71946188)  0x1903f04  thread_func__25NetworktimeprefletAsAddonPv + 0x290
KERN:   (0x719462b8)  0x20d3c7f  thread_entry + 0x23

FWIW -- might be just a coincidence, but it ran fine when unattended through the night but crashed within a few minutes of me checking the system this morning -- I checked if CC was running fine, launched a browser, accessed sites.

Will shortly provide some sample code to make it easier to reproduce this (maybe combined with some heavier network and DNS access usage, in case there is a concurrency component involved in the bug?)

Attachments (6)

class_NetworktimeprefletAsAddon.cpp (3.9 KB ) - added by ttcoder 9 years ago.
One way to invoke Time as described
ntp (without deprecated calls).cpp (5.8 KB ) - added by ttcoder 9 years ago.
Modified to use BNetworkAddressResolver: we now pass all the "torture tests"..!
0001-updated-ntp-to-use-getaddrinfo.patch (4.1 KB ) - added by a-star 7 years ago.
Updated ntp to use getaddrinfo
0001-update-ntp-to-used-BNetworkAddressResolver.patch (2.3 KB ) - added by a-star 7 years ago.
updated ntp to use BNetworkAddressResolver
0001-updated-ntp-to-use-BNetworkAddressResolver.patch (3.5 KB ) - added by a-star 7 years ago.
used the high level BNetworkAddressResolver class + itearting over all the available addresses until a successful request has been made
0001-fixed-ntp.cpp.patch (3.6 KB ) - added by a-star 7 years ago.

Download all attachments as: .zip

Change History (22)

by ttcoder, 9 years ago

One way to invoke Time as described

comment:1 by ttcoder, 9 years ago

In light of yak's analysis in the other ticket, it occurs to me that maybe gethostbyname() works on some sort of 'static' (team-wide common) data, so if a thread executing gethostbyname() is killed in the middle of it, it could leave it in an inconsistent state, and all further calls to gethostbyname() would fail or crash ?

Does that make sense, I'm not sure If so, for future reference, is there a "re-entrant" version of gethostbyname() (somewhat like localtime() has a localtime_r() re-entrant counterpart). ?

I'm asking because I tried two "torture tests" right now:

  • first worked with Haiku's ntp.cpp, just adding a main() that invokes it synchronously in an infinite loop, while slowing down my LAN with misc trafic. Could not make it crash.
  • then I modified it to invoke a-synchronously (i.e. in a separate thread) and kill said thread almost immediately, and indeed gethostbyname() is hosed quickly:
KERN: debug_server: Thread 7921 entered the debugger: Segment violation
KERN: vm_soft_fault: va 0x0 not covered by area in address space
KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xa, ip 0x80136e83, write 0, user 0, thread 0x1ef3
KERN: vm_soft_fault: va 0x0 not covered by area in address space
KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xa, ip 0x80136e83, write 0, user 0, thread 0x1ef3
KERN: vm_soft_fault: va 0x0 not covered by area in address space
KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x4, ip 0x80136e83, write 0, user 0, thread 0x1ef3
KERN: vm_soft_fault: va 0x0 not covered by area in address space
KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xa, ip 0x80136e83, write 0, user 0, thread 0x1ef3
KERN: vm_soft_fault: va 0x0 not covered by area in address space
KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xa, ip 0x80136e83, write 0, user 0, thread 0x1ef3
KERN: vm_soft_fault: va 0x2e323000 not covered by area in address space
KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x2e323931, ip 0x80136e83, write 0, user 0, thread 0x1ef3
KERN: vm_soft_fault: va 0x0 not covered by area in address space
KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xa, ip 0x80136e83, write 0, user 0, thread 0x1ef3
KERN: vm_soft_fault: va 0x0 not covered by area in address space
  (..etc..)
  • If I increase the delay from 50000 µs to 100000 µs I get the exact same stack crawl as at the top of this ticket.

Anyway I'll obviously re-organize the code which is deffective due to that unprotected kill_thread call, but I'll only feel 100% reassured if I also can use a re-entrant version of gethostbyname :-)

EDIT: I've hit duckduckgo a bit and the function is dubbed "obsolete" but I'm not sure what bearing that has on the re-entrancy discussion Here. More interesting, I also found this:

gethostbyname() .... uses static storage that is re-used in each call, making these functions unsafe for use in multithreaded applications.

but again I'm not sure if they're just saying to not call it from concurrent threads at the same time, or if that also implies that the function is not state-neutral, i.e. you can't call it twice in sequence if it was interrupted before completing the first time.

Version 1, edited 9 years ago by ttcoder (previous) (next) (diff)

comment:2 by hamish, 9 years ago

Killing the thread is not a good idea anyway, even if the function is reentrant - it could leak memory and leave sockets open, etc.

getaddrinfo should probably be used instead of gethostbyname.

Last edited 9 years ago by hamish (previous) (diff)

comment:3 by diver, 9 years ago

Milestone: R1/beta1R1

by ttcoder, 9 years ago

Modified to use BNetworkAddressResolver: we now pass all the "torture tests"..!

comment:4 by ttcoder, 9 years ago

Modified the code to do away with the obsolete gethostbyname() and the code is now 100% solid in my testing, even if using ridiculously short delays.

Might not be all that interesting for the Haiku project, but I posted the above anyway because ticket surfers might find the code of interest to migrate their old apps to be re-entrant I suppose. Beware though -- It's certainly been a bit of a learning experience for me doing this twenty-liner, been 10+ years since I last had to look up man pages for sockets and sockadd and sockaddr_in and what not, probably missed something! The code works for me though.

comment:5 by axeld, 7 years ago

If you could actually provide a patch, I'd be happy to apply it :-)

Two things, though:

  • You should never use sizeof(address.SockAddr()), but instead address.Length(). The former will always return the length of struct sockaddr, which might not be adequate for anything but IPv4.
  • There is no need to use an BNetworkAddressResolver; you can just use the BNetworkAddress constructor that takes the same arguments directly.
Last edited 7 years ago by axeld (previous) (diff)

comment:6 by a-star, 7 years ago

Updated the ntp.cpp file to use the getaddrinfo function instead of the obsolete gethostbyname function.

Other changes include support for ipv6 and required parameter changes and setup for "sendto" and "recvfrom" in ntp.cpp.

Performed a build with the modified ntp.cpp file and everything seemed to worked fine. Also inspected the functionality using the graphical debugger.

Kindly review the patch.

by a-star, 7 years ago

Updated ntp to use getaddrinfo

comment:7 by a-star, 7 years ago

patch: 01

comment:8 by axeld, 7 years ago

Thanks for your contribution! I haven't actually tested your code, but it does look okay and functioning.

I have a few remarks, though:

  • Please try to follow our coding style closely. A good start would be trying hard to not make your changes stand out, but it would be even more appreciated to give the guide itself a read :-)
  • Please consider using the BNetworAddressResolver [1] class, as in the example from ttcoder 19 months ago. It has a much nicer API, and would allow you to avoid those stringify macros completely. It's also much less likely to leak any memory with it, although your patch seems to deal with that appropriately.

[1] http://cgit.haiku-os.org/haiku/tree/headers/os/net/NetworkAddressResolver.h

Last edited 7 years ago by axeld (previous) (diff)

comment:9 by a-star, 7 years ago

I went through the coding style guidelines, made appropriate style changes and tested it against the "checkstyle.py" script. The code now adheres to the guidelines.

I see the advantages of using BNetworkAddressResolver. I have attached another patch which uses BNetworkAddressResolver. However it would be lovely if you could test my original patch as well since it makes me feel more original with my patch and gives me a greater sense of contribution.

Thanks for the remarks.

by a-star, 7 years ago

updated ntp to use BNetworkAddressResolver

comment:10 by axeld, 7 years ago

The patch is indeed much smaller. However, you also removed one change of your original patch that I liked: iterating over all results, and connect to the first that fits. Your new code no longer does that; it will only test the first result it gets from the resolver.

There are a couple of style violations left that I will point out here for your convenience:

  • NetworkAddress.h and NetworkAddressResolver.h are public headers, as such, they should be ordered differently: we have the header counterpart of the sources first (to ensure it's complete and does not rely on any other headers), then from most general to most specific: first c/c++ headers, then POSIX, then Haiku, then application headers. Between each section is a blank line.
  • Line 123: No space after '(', operator (||) goes to the beginning of the next line, not the end of the line.
  • Line 153: Indentation seems to be wrong; the '0' from the arguments list within the if clause needs to be indented one tab more than the block.

Thanks for the quick update!

comment:11 by korli, 7 years ago

Any chance for an update of the patch?

comment:12 by a-star, 7 years ago

I was attending to my college exams for the past week. You can expect an update within 2-3 days.

by a-star, 7 years ago

used the high level BNetworkAddressResolver class + itearting over all the available addresses until a successful request has been made

comment:13 by pulkomandy, 7 years ago

In the jamfile, I think you want to link to libnetwork rather than libnet.

The error handling seems a bit off. If resolver.InitCheck is not ok, the error should be something like "could not resolve server address", not "could not contact server".

If resolving all server addresses fail, I think your code will end up in an infinite loop: resolver.GetNextAddress will not return B_OK, and you will call it over and over again. I think using a while (resolver.GetNextAddress(&cookie, address) == B_OK) would be less error prone.

After the while, you are checking the error state of the resolver, but not that of actually writing to the server. I suggest storing the result of sendto in a bool:

if (sendto(...) != -1) {
    success = true;
    break;
}

Then after the while loop:

if (!success) {
    // error handling goes here
}

comment:14 by a-star, 7 years ago

Tried linking with libnetwork (used network instead of bnetapi) but ran into compilation errors. Taken you suggestions and made another patch although I feel my code wouldn't have ended up in a situation for an infinite loop - I checked the condition when next address didn't return B_OK. But the new patch is anyways neater.

by a-star, 7 years ago

Attachment: 0001-fixed-ntp.cpp.patch added

comment:15 by axeld, 7 years ago

I didn't see the endless loop either, but I like the current version better in any case. Pointing out some minor style issues:

  • There is always a space between while and '(' (line 150).
  • Same as for if and '(' (line 158).
  • We do like a newline at the end of the file so please don't remove that.

In general, always try to copy the style you find in the file you're working on, when you're not sure about the coding style being used. It's always best to read through the coding guidelines, though :-)

Thanks for the update!

comment:16 by waddlesplash, 7 years ago

Resolution: fixed
Status: newclosed

Applied (with some coding style fixes) in hrev51390. Thanks!

Note: See TracTickets for help on using tickets.