Opened 5 years ago

Closed 7 months ago

#11800 closed bug (invalid)

WebPositive inconveniently strict

Reported by: donn Owned by: pulkomandy
Priority: normal Milestone: R1
Component: Applications/WebPositive Version: R1/Development
Keywords: Cc:
Blocked By: Blocking: #12163
Has a Patch: no Platform: All

Description

WebPositive won't accept invalid HTML, where Safari for example manages fine with the same file. Per suggestion on the forum, I'm posting an example that I received in email from Expedia. That's where I encounter this problem, when I invoke WebPositive on an HTML disk file that I have received via email. They clearly aren't valid HTML - this one starts with a 1x1 img and then tries to follow that with doctype - but like I say, Safari renders them perfectly without a whimper.

Because I have this problem only with email, I wondered if there might be some different standard for file: vs http:.

Attachments (1)

mail009.html (48.6 KB ) - added by donn 5 years ago.
HTML mail from expedia.com

Download all attachments as: .zip

Change History (10)

by donn, 5 years ago

Attachment: mail009.html added

HTML mail from expedia.com

comment:1 by donn, 5 years ago

... and it turns out that the same file is handled fine when it's encountered online, so it appears that the problem does indeed involve different standards for file: vs. http: content. I run into it when I read my email on Haiku, but it would also affect HTML documention, etc.

comment:2 by pulkomandy, 5 years ago

I think the main problem is that the file doctype tells it's XHTML. We use our mime sniffing rules to set the MIME type to that, but it would seem the file is actually HTML (which is parsed in a more relaxed way). When getting files online, the server provides a MIME type so we don't use the sniffing. It should be possible to get the MIME type from the e-mail header and use that, instead of trying to guess. If the mail is set to use the more correct "HTML" MIME type it would work better. You can also force it by editing the file attributes.

In any case, probably Web+ should fallback to HTML when parsing as XHTML fails.

comment:3 by donn, 5 years ago

I am pleased to confirm that I can get it to work by having my email program set file type to text/html.

comment:4 by donn, 5 years ago

I'd be tempted to change this to a bug against the mime sniffing process. Haiku might take a tip from the httpd server, which gets better results for everyone by classifying everything as "text/html". At least where the file name is .html. I guess in practice this is a broader category that includes the various specifications.

For example if people search for html files, how likely is it that they will care which specification the files claim to adhere to?

Conversely, for files that really should have this distinct application/xhtml-xml type, probably should have a different icon as well, right?

comment:5 by pulkomandy, 5 years ago

I have introduced the separate MIME type because it is required by some of the tests in the WebKit test suite. Some other pages could be affected, the parsing rules for html and xhtml are different and incompatible. In your case it's HTML failing to render because it's parsed as XHTML, but it could be the reverse (because of extra features in XHTML like the use of XML namespaces). So it is important to have both types for proper operation of the web browser.

Now, the "sniffing" detection for these is difficult because they can look quite close to each other (especially if people use the wrong doctype in their documents as in the sample file here…). So the way to handle this is to try getting the information from elsewhere. When the file comes from an e-mail, the native Mail application could get it from the mail headers, which specify a website. When the page comes from an HTTP server, Web+ will use the HTTP header Content-Type. When downloading a file, it should probably also store the content type into the mime type attribute of the file. This way no sniffing would be needed.

comment:6 by donn, 5 years ago

Keep both types, for sure. I'm just saying that the sniffing should always return "text/html", just like httpd always says "text/html" when it serves up a .html file, regardless of what's in it.

We can deal with it at the application level by applying the type we get from email or httpd, as you say. We're doing that because email & httpd have got it right. They haven't searched the contents and discovered evidence of xhtml etc. They call it all text/html because that's what we call an .html file. The sniffing detection is not only difficult, it's inherently wrong.

comment:7 by axeld, 5 years ago

FWIW I tend to agree with Donn.

comment:8 by pulkomandy, 4 years ago

Blocking: 12163 added

comment:9 by waddlesplash, 7 months ago

Resolution: invalid
Status: newclosed

The file claims to be XHTML but is invalid as it; this isn't really our problem.

In fact, current Firefox (on Windows) refuses to open it and shows an XML error, so this seems to be fine behavior to me.

Note: See TracTickets for help on using tickets.