Opened 3 years ago

Closed 3 years ago

#13002 closed enhancement (fixed)

Parse URLs Without Requiring Group-Capable Regex

Reported by: apl-haiku Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: Network & Internet Version: R1/Development
Keywords: URL BUrl regex Mac Cc:
Blocked By: Blocking:
Has a Patch: yes Platform: All

Description

See #12952 for background.

At present, the BUrl class requires a "group capable" regex engine to be linked in order that it is able to parse URLs. This is because it uses the groups in the regex to pull out the components. This causes problems on some systems (eg; Mac) where the built-in regex library does not offer these features.

This patch re-implements the URL parsing so that it no longer uses regex pattern matching to extract the parts. Hopefully this will, in turn, mean that the Mac build no longer requires the GNU regex library. At least it will be a step in that direction.

As part of this work, I also noted that the parsing of URLs with IPv6 hosts was not working so I have also re-implemented the "authority" parse algorithm such that it is able to cope with IPv6 URLs.

I also have also implemented a number of unit tests at the same time.

Attachments (1)

0001-Url-implement-same-URL-parsing-logic-in-C-C-code-wit.patch (22.4 KB) - added by apl-haiku 3 years ago.

Download all attachments as: .zip

Change History (6)

comment:1 Changed 3 years ago by apl-haiku

Has a Patch: set

comment:2 Changed 3 years ago by korli

Looks good.

comment:3 Changed 3 years ago by pulkomandy

I'm not sure we want to go this way, for several reasons.

1) Parsing URL is very tricky. The regexp as given by the RFC is the reference and we should try to stick with it. 2) I don't think it is a good idea to work around problems in other operating systems. This code is first meant to be used by our native system (Haiku), and if there are problems with other OSes, these should be fixed in the build-compatibility headers, not by rewriting code all over the place to fit the limitations. 3) It leads to a lot more code to test and debug. 4) There was code similar to this before I implemented the regexp parsing, and it was broken. 5) What is the impact on performance? Is it faster or slower than the regexp way?

comment:4 Changed 3 years ago by apl-haiku

1) Parsing URL is very tricky. The regexp as given by the RFC is the reference and we should try to stick with it.

I agree that parsing URLs is not easy and (without using a specific regex engine) I have used the regex as a guide in this change.

2) I don't think it is a good idea to work around problems in other operating systems.

This is true in that the larger project is not about portability. However reducing the complexity of the cross-compile 'touch point' with other operating systems is a benefit.

3) It leads to a lot more code to test and debug.

This is true, but I have also added a number of additional unit tests. These tests should hopefully provide some level of safety. If there are some more problems then additional test-cases can be added.

4) There was code similar to this before I implemented the regexp parsing, and it was broken.

This is true and yes any change is undeniably a risk. I notice that your change was, in part, triggered by data URLs and so I have now added a unit test for this (passed OK) and updated the patch on this ticket.

5) What is the impact on performance? Is it faster or slower than the regexp way?

This is a good question which I had not considered. To check this I wrote a small program which parses a URL 1 million times containing the major components first creating and then destroying a BUrl object. The URL is;

` http://loop:pea@www.something.co.nz:8888/some/path?key=value#frag `

For the old implementation this takes circa 31s user-time and on the new implementation circa ~8s user-time. So the new implementation in this ticket is approximately 3x faster. This is on a VM running on a ~2009 mac laptop.


In any case; over to you for a decision. The updated patch is also on the ticket if somebody wants to take it up later.

comment:5 Changed 3 years ago by pulkomandy

Resolution: fixed
Status: newclosed

Applied in hrev50637.

Note: See TracTickets for help on using tickets.