Opened 8 years ago

Closed 8 years ago

Last modified 8 years ago

#12710 closed enhancement (fixed)

BPackageInfo; URL Validation Checks

Reported by: apl-haiku Owned by: bonefish
Priority: normal Milestone: Unscheduled
Component: Kits/Package Kit Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

When packages are created at the moment, there are some URLs involved in the package info. These URLs are not validated for being well-formed. As a result, malformed URLs have been processed by the HaikuDepotServer system as a result of parsing the HPKR data. The typical problems to date have been whitespace at the end of the URLs and malformed URL schemes. Better validation could be employed, but this at least will provide some basic checks.

Attachments (5)

Change History (44)

by apl-haiku, 8 years ago

comment:1 by axeld, 8 years ago

Looks mostly good, but you add some minor coding style issues:

  • No one ever seems to delete the UrlStringValidator (C++ doesn't have garbage collection).
  • PackageInfoParser.cpp, line 548: We don't use parenthesis around single line statements.
  • Line 928 seems to add some stray white space, line 932 might as well.
  • Line 1163ff.: all operators besides ',' always go to the next line, ie.:
    return url.StartsWith("http://")
    	|| url.StartsWith("https://")
    	|| url.StartsWith("ftp://")
    	|| url.StartsWith("file:/");
    

by apl-haiku, 8 years ago

comment:2 by apl-haiku, 8 years ago

patch: 01

comment:3 by apl-haiku, 8 years ago

Hi Axel; Thanks for reviewing that. Hopefully I have got those all fixed up in a second patch.

comment:4 by axeld, 8 years ago

Yes, it looks great now, thanks! However, I missed that you didn't use git format-patch to create the patch, sorry. Could you do it once more so that you get properly attributed for the patch, and I don't have to invent a commit message for it? :-)

If not in the near future, please tell, so that I can commit it anyway.

comment:5 by apl-haiku, 8 years ago

Hi Axel; that is added as a git format-patch now; should be the same.

comment:6 by axeld, 8 years ago

Thanks Andrew. I'm afraid there are more issues that I missed so far, though.

  • Still coding style issues: 80 column line limit is violated at several places, and stray white space is added to a few lines. I had fixed those issues, as I wanted to get over with it, but...
  • It doesn't compile using GCC 5 on Ubuntu (x86_64). Among the errors:
    • 1078/1082: jump to case label
    • 959: enumation value ... not handled
  • The usual format of our git patch line would read like this: BPackageInfo::Parser: Validate URL strings.

Also, please don't zip the patch next time for easy online viewing :-)

comment:7 by axeld, 8 years ago

Oh, and I missed that the operators/spacing are still wrong in the if-clause in line 1174ff.

comment:8 by apl-haiku, 8 years ago

Hi Axel; I'm first looking into the GCC 5 issue. I've setup a build environment with Xubuntu (x86_64) which has GCC 5.3.1. If I create a directory "generated" in the haiku source directory and invoke "../configure ..." – can you tell me the arguments to get the build process configured to build with the host compiler?

comment:9 by axeld, 8 years ago

If you build a Haiku image, some parts of the Haiku source will be compiled on the host platform in order to do that automatically; it's not like I want to annoy you :-)

So all you need to trigger the issue is to build a complete image, like jam -q @release-raw.

comment:10 by apl-haiku, 8 years ago

Hi Axel; Thanks for the clarification.

It seems then that the build with GCC 5.3.1 is working fine at my end. I can build an image without any compilation failures and can then test the change is working OK on the build host. I wonder what the difference would be?

I am now taking a look at the other issues. The issue at PackageInfoParser.cpp:1174 with the if statement;

  • I have moved the logical or (||) operators to the start of the lines
  • I cannot see an white-space issues there; can you please point out the issue

I can see a 80 column limit violation at PackageInfoParser.h:165. I checked the 'coding guidelines' page, but I cannot see any specific guidance. Maybe something like the following would work?

...
struct BPackageInfo::Parser::UrlStringValidator :
    public BPackageInfo::Parser::StringValidator {
...

I can also see one other 80 column violation at PackageInfoParser.cpp:1215 - no problem to fix. I am not sure where whitespace has been added. Are you able to provide line number references for any white-space issues?

Also fixed a couple of situations where bad pointer references were used; char *f instead of char* f.

comment:11 by axeld, 8 years ago

I've tested with GCC 5.2.1 using Ubuntu 15.10.

The white space issue at line 1174 is the extra spacing around the inner parentheses.

Also, when you do a git diff it shows extra white space (at the end of lines) with a red marker in its default configuration. In a reg-exp-search-capable text editor, you would search for \s$ to find it.

In your 80 column limit example, the : operator would also go to the next line, otherwise this is how I'd do that.

comment:12 by pulkomandy, 8 years ago

Your patch does not allow IDN domain names. For example, http://清华大学.cn is a valid URL.

I would suggest using the BUrl class to parse URLs, and improving said class if you find there are problems with the way it handle things.

comment:13 by apl-haiku, 8 years ago

This is correct, the logic will not handle those domains. The validation is also otherwise very lenient, but on the other hand, is enough to catch the class of problems I see coming through on the HaikuDepotServer system.

I did originally look at using BUrl to validate the domains because that would be much better. I didn't pursue that because the BUrl class is in "libnetapi" and I don't think that this library is available as a build-host artifact. This would then mean it were not possible to build the "package" tool for the build-host.

I'm a bit new to this code base so please do correct me if I am wrong.

comment:14 by pulkomandy, 8 years ago

Right, currently this is the case, but we could easily move BUrl to the support kit, or make parts of libnetapi available on the build host. BUrl does not include any network access, so it may make sense to have it outside libnetapi anyway?

comment:15 by apl-haiku, 8 years ago

OK great; being able to use BUrl instead would really be a better outcome long term.

The class could probably make sense living in either support or the network kits so perhaps to reduce disruption it would be better to make the "libnetapi" available on the build environment.

I see the build is controlled from ".../src/kits/network/libnetapi/Jamfile", but given my current level of Jam knowledge, that is going to be quite hard for me to work through and get right. Are you able to either modify it to generate a build-host variant with just "Url.cpp" or give me some pointers of other situations where something similar is being done?

comment:16 by pulkomandy, 8 years ago

The jam files for the host libraries are stored at http://cgit.haiku-os.org/haiku/tree/src/build/ . In particular, http://cgit.haiku-os.org/haiku/tree/src/build/libbe/support/Jamfile is for the support kit.

The simplest solution would be to add libnetapi to the SEARCH_SOURCE, then add BUrl to the sourcefile list. This will effectively put it in the support kit for the host libraries.

comment:17 by apl-haiku, 8 years ago

Thank you for the pointers to get me started. I'm currently trying to get a build of libbe_build.so with the BUrl in it. I have created the directory;

../src/build/libbe/network

In here I have added Jamfile...

SubDir HAIKU_TOP src build libbe network;

UseHeaders [ FDirName $(HAIKU_TOP) headers build ] : true ;
UseHeaders [ FDirName $(HAIKU_TOP) headers os net ] : true ;
UseHeaders [ FDirName $(HAIKU_TOP) headers private locale ] : true ;
UseHeaders [ FDirName $(HAIKU_TOP) headers private shared ] : true ;

UsePrivateBuildHeaders app interface shared network ;

USES_BE_API on <libbe_build>network_kit.o = true ;

SEARCH_SOURCE += [ FDirName $(HAIKU_TOP) src kits network libnetapi ] ;

# BUrl uses ICU to perform IDNA conversions (unicode domain names)
UseBuildFeatureHeaders icu ;
Includes [ FGristFiles Url.cpp ]
        : [ BuildFeatureAttribute icu : headers ] ;

BuildPlatformMergeObjectPIC <libbe_build>network_kit.o :
        Url.cpp
;

...and then following the pattern, have edited ../Jamfile to include this network/Jamfile. I am currently stuck because the build then has a circular reference and compilation fails...

...
...patience...
warning: libbe_build.so depends on itself
warning: libbe_build.so depends on itself
...found 3244 target(s)...
...updating 5 target(s)...
Link /Volumes/HaikuSource/haiku/generated/objects/darwin/lib/libpackage_build.so 
clang: error: no such file or directory: '/Volumes/HaikuSource/haiku/generated/objects/darwin/lib/libbe_build.so'
...

If I comment-out the Jamfile contents referring to ICU then the circular reference is gone, but then the build chain cannot find the ICU files;

/Volumes/HaikuSource/haiku/headers/private/locale/ICUWrapper.h:14:10: fatal error: 'unicode/bytestream.h' file not found

I found the section related to "UseBuildFeatureHeaders icu" in /build/jam/BuildFeatures, but the Jam-logic is not very clear to me. Can you give me any guidance on this?

comment:18 by pulkomandy, 8 years ago

UseBuildFeatureHeaders only changes the include paths, I'm not sure why it ends up in a circular dependency.

in reply to:  18 comment:19 by bonefish, 8 years ago

Replying to pulkomandy:

UseBuildFeatureHeaders only changes the include paths, I'm not sure why it ends up in a circular dependency.

The rule also sets up a dependency to the extracted ICU package for Haiku. For the extraction the package command is needed which in turn requires libbe_build. Hence the dependency cycle.

Either the build platform's ICU would need to be used -- I guess that would be a new build requirement -- or the respective code in Url.cpp needs to be guarded by an #ifdef and disabled for when building for the build platform.

comment:20 by pulkomandy, 8 years ago

The latter should be possible as ICU is used only to decode/encode IDNA domains to their "punycode" form. Could we live without these on the build platform or would that be a problem for your intended use?

comment:21 by apl-haiku, 8 years ago

Do you mean that BUrl::IDNAToAscii and BUrl::IDNAToUnicode would simply return B_ERROR all of the time in the build environment (eg; linux) artifacts?

A problem could arise where people are trying to assemble HPKRs on, for example, linux and they are using IDNA domain names, but maybe that problem can be addressed later because it is unlikely to arise short-term.

I will try to make that change when I get some time and see if I can get it to compile without ICU support.

comment:22 by pulkomandy, 8 years ago

Yes, this would be an option, or alternatively the methods could be completely removed. I don't know if the HPKR code would use these methods at all, it seems fine to leave the URL utf-8 encoded in the HPKR file.

As long as these methods aren't used, everything should be fine, and we only need to convert domain names when actually using the host part of the URL to connect to a server.

comment:23 by apl-haiku, 8 years ago

BUrl requires RegExp from libshared_build, but I am not too sure how to express that dependency between libbe_build and libshared_build. Looking in src/build/... I guess it would be added to the list of libs on BuildPlatformSharedLibrary in src/build/libbe/Jamfile?

comment:24 by bonefish, 8 years ago

Yes, you have to add the library to the library list. I believe a name map entry hasn't been defined for it, so you'll have to use the full name -- "libshared_build.a" -- instead of the abbreviation ("shared").

comment:25 by apl-haiku, 8 years ago

I am making some hard-won progress with this. I'm currently stuck on a build issue.

BUrl uses the RegEx (from libshared) which uses POSIX regex. On the build environment I'm using (MacOS-X) it seems to be linking to something that doesn't provide "regex groups" functionality for some reason. The compiled regex always reports that it has no groups in the expression. This means that the BUrl parse fails because it is trying to use the regex groups to pull apart the URL elements.

The build process does produce a libgnuregex_build.a and indeed, if I write a stand-alone test program, without libgnuregex_build.a there are no groups, but with libgnuregex_build.a, regex groups functionality seems to be present.

I presume this is my problem. My question is, how do I get that libgnuregex_build.a into the link process with the Jamfiles. I could imagine that it would make sense to include reference to libgnuregex_build.a in ../libbe/Jamfile, like this...

BuildPlatformSharedLibrary libbe_build.so :
        # no sources here
        :
        <libbe_build>app_kit.o
        <libbe_build>icon_kit.o
        <libbe_build>interface_kit.o
        <libbe_build>network_kit.o
        <libbe_build>storage_kit.o
        <libbe_build>support_kit.o

        libshared_build.a
        libgnuregex_build.a

        z $(HOST_LIBSUPC++) $(HOST_LIBSTDC++)
;

...but this doesn't seem to be fixing the issue. Am I going about this the right way or does it need to be bundled into libshared_build.a?

comment:26 by pulkomandy, 8 years ago

Since these are static libraries, the order in which they are added is important. libgnuregex should be listed before anything that uses regexp, otherwise the symbol won't be resolved and the host OS version will be used.

In this case, both libshared and libgnuregex should be listed before the object files, I think.

in reply to:  26 comment:27 by bonefish, 8 years ago

Replying to pulkomandy:

Since these are static libraries, the order in which they are added is important. libgnuregex should be listed before anything that uses regexp, otherwise the symbol won't be resolved and the host OS version will be used.

In this case, both libshared and libgnuregex should be listed before the object files, I think.

That doesn't sound right. Usually static libraries need to be listed after objects that require their symbols. AFAIK the linker's algorithm looks at static libraries only once: It tries to resolve undefined symbols from previously processed objects and discards any objects from the static library that aren't used afterwards.

To examine what the final link line looks like use jam -q -n libbe_build.so (remove the libbe_build.so file first or throw in the "-a" switch). If that looks OK, examine the symbols of the resulting libbe_build.so (don't know what the right tool under Mac OS X is -- Google knows). The respective regex symbols should not be appear as undefined.

comment:28 by apl-haiku, 8 years ago

Thanks for the pointer. It seems that Jam is receiving instruction to include "gnuregex" into the build from "BuildSetup";

# Unlike glibc FreeBSD's libc doesn't have built-in regex support.
if $(HOST_PLATFORM) = freebsd {
		HOST_LIBROOT += /usr/lib/libgnuregex.so ;
		HOST_STATIC_LIBROOT += /usr/lib/libgnuregex.so ;
} else if $(HOST_PLATFORM) = darwin {
		HOST_LIBROOT += libgnuregex_build.a ;
		HOST_STATIC_LIBROOT += libgnuregex_build.a ;
}

So this seems to insert the libgnuregex_build.a into the link before the object files. I also added it at the end as well (I think that makes more sense?) in this case below;

cc -lm -L/opt/local/lib -dynamic -dynamiclib -Xlinker -flat_namespace -o \
"/Volumes/HaikuSource/haiku/generated/objects/darwin/lib/libbe_build.so" \
"/Volumes/HaikuSource/haiku/generated/objects/darwin/x86_64/release/build/libroot/libroot_build_function_remapper.a" \
"/Volumes/HaikuSource/haiku/generated/objects/darwin/lib/libroot_build.so" \
"/Volumes/HaikuSource/haiku/generated/objects/darwin/x86_64/release/build/libgnuregex/libgnuregex_build.a" \
"/Volumes/HaikuSource/haiku/generated/objects/darwin/x86_64/release/build/libbe/app/app_kit.o" \
...SNIP...
"/Volumes/HaikuSource/haiku/generated/objects/darwin/x86_64/release/build/libshared/libshared_build.a" \
"/Volumes/HaikuSource/haiku/generated/objects/darwin/x86_64/release/build/libgnuregex/libgnuregex_build.a" \
-lz -lgcc_s.1 -lstdc++

Still the symbols remain undefined in the libbe_build.so;

                 U _regcomp
                 U _regexec
                 U _regfree

Rather strange. I will have to come back to it later.

comment:29 by bonefish, 8 years ago

Does libgnuregex_build.a define the symbols?

Having the static library that early on the link line indeed doesn't make too much sense. Unless the linker on Mac OS X works differently, that is. Jonathan changed the regex library handling in b55c918f579fb523946747cf26dde829fe7fe8c2. Before that it was using a shared library (but with hard-coded absolute path). It should be possible to build a shared library instead of a static one. At least in theory that shouldn't change anything with respect to your issue, though, given that it doesn't work with the repetition at the end either. OTOH sometimes static library code isn't suitable to be linked into shared libraries, but I would have thought that this would raise a link error instead.

comment:30 by apl-haiku, 8 years ago

The link for libbe_build.so looks something like this;

cc -lm -L/opt/local/lib -dynamic -dynamiclib -Xlinker -flat_namespace \
-o ".../objects/darwin/lib/libbe_build.so"
...
".../objects/darwin/x86_64/release/build/libshared/libshared_build.a" \
".../objects/darwin/x86_64/release/build/libgnuregex/libgnuregex_build.a" \
-lz -lgcc_s.1 -lstdc++

I've figured out that if I move the -lm into the end (directly after -lz) then the link process *does* bundle the symbols...

  • _regcomp
  • _regexec
  • _regfree

...into the linked libbe_build.so which is what I need. Actually, I can also omit the -lm and it also seems to be happy on the MacOS-X build host. I guess I need to figure out how to shift the -lm to the end where it should be.

I can see the -lm is added in BuildSetup:448 as variable HOST_LINKFLAGS and it seems to get used in the linking process at OverriddenJamRules:23.

comment:31 by bonefish, 8 years ago

Sounds like libm is either containing or otherwise pulling in the regex symbols from somewhere else. Moving the library to the end of the link line isn't easily done. A kludge could be added, but I'd rather avoid that. Instead I'd try and build a shared libgnuregex_build and see, if that fixes the issue.

comment:32 by apl-haiku, 8 years ago

I have modified the libgnuregex target to produce a shared library build product. This seems to then work fine. I added a brief ReadMe.md file to help anybody else who is caught out with this problem.

I then noticed there were some issues with the validation of the BUrl. For example, it seemed to be tolerant of trailing whitespace -- something I was trying to avoid. It also included the port in the host field in some cases which is also not quite right.

I've therefore modified the parsing logic in BUrl a bit to tighten it up. At the same time, I have added some automated testing to cover scenarios around those changes so that it is safer to make further changes in this class.

I was able to reproduce the problem Axel mentioned on Ubuntu; that should be sorted out now and has been tested on GCC 5.3.1.

On the MacOS-X platform, the operating system makes .DS_Store files all over the place so I added a .gitignore entry for those at the top level.

The commit message has been adjusted as suggested by Axel.

I hope it is good to go!

comment:33 by apl-haiku, 8 years ago

Hello; is it possible for somebody to review this change for me?

comment:34 by pulkomandy, 8 years ago

Looks mostly good, a minor problem is the unusual introduction of ReadMe.md directly in the source folder for libgnuregex, usually we try to keep docs in haiku/docs (http://cgit.haiku-os.org/haiku/tree/docs). I think http://cgit.haiku-os.org/haiku/tree/docs/develop/build could be a good place for this one?

comment:35 by axeld, 8 years ago

Thanks for the update, Andrew! I cannot comment about the host parsing changes to BUrl, but the rest looks good. Personally, I wouldn't mind the ReadMe.md in its current location.

I spotted a number of minor coding style issue, however (in Url.cpp):

  • line 22, 27-28: we usually indent #if/include/define like this (I believe this isn't mentioned in the coding style, though, so it might not be required):
    #if X
    #	include <y>
    #	define Z
    #endif
    
  • line 513: the parentheses are now superfluous
  • line 895: missing spaces around '='

The unit tests are greatly appreciated!

comment:36 by apl-haiku, 8 years ago

Hi Axel and @pulkomandy; Thank you for the reviews. I've moved that readme to /docs/develop/build/libgnuregex.md and I have made those code patches. Yes the unit tests will be helpful next time it needs changing. Please take a look and let me know if there are any further issues that need to be addressed.

comment:37 by apl-haiku, 8 years ago

Hi there; is it possible to get this reviewed and closed off?

comment:38 by waddlesplash, 8 years ago

Resolution: fixed
Status: newclosed

Applied in hrev50362. Thanks!

comment:39 by apl-haiku, 8 years ago

Thanks.

Note: See TracTickets for help on using tickets.