Opened 11 years ago

Last modified 22 months ago

#10067 new enhancement

Extension should determine file type with same sniffer rule

Reported by: humdinger Owned by: bonefish
Priority: normal Milestone: R1
Component: Servers/registrar Version: R1/Package Management
Keywords: Cc:
Blocked By: Blocking: #10917, #18195
Platform: All

Description

This is hrev46173.

If you have several file types with the same sniffer rule, the extension should determine what mime type a file is assigned.

For example, RAR archives and CBR (a comic book archive, consisting of rar-archived images) both have the rule '("Rar!")'. Naturally, since they are both RAR archives. In this case, the file extension should decide if it's a .rar or a .cbr.

Change History (14)

comment:1 by axeld, 11 years ago

Summary: Extention should determine file type with same sniffer ruleExtension should determine file type with same sniffer rule

comment:2 by diver, 10 years ago

Blocking: 10917 added

comment:3 by Giova84, 8 years ago

Hi, I have a suggestion for the implementing of this enhancement.

For those which aren't aware, there is an Haiku app called Filer (Humdinger certainly know this app ;-) ): https://github.com/HaikuArchives/Filer

Well, let's see the readme of this app:

Filer is an automatic file organizer. It takes the files it's opened with or that are dropped on it and moves, renames, copies or does all sorts of other things with them according to rules created by the user. Filer is accompanied by AutoFiler. Instead of working on a set of files provided by the user, it can be started (automatically with Haiku) to monitor certain folders and deal with new files appearing there according to the user-defined rules.


In practice, the "Autofiler" element, is able to monitor folders on the system and automatically performs a bunch of actions, also including shell commands; for some parts it acts like a sort of registrar.

The following is an example on my system:

As Humdinger stated

RAR archives and CBR (a comic book archive, consisting of rar-archived images) both have the rule '("Rar!")'. Naturally, since they are both RAR archives. In this case, the file extension should decide if it's a .rar or a .cbr.

Well, on my system I manage *.cbr files (I can open them using a QT *cbr and *cbz reader), and when I place them in my system (inside of certain folders), AutofileFiler, thanks to Filer, will automatically launch the following commands on such files, which end with the *.cbz and *.cbr extension

settype -t application/x-cbr
settype -t application/x-cbz

http://s33.postimg.org/6fb28674v/Autofiler_CBR_Rule.png

In this way I finally avoid the confusion with *.zip and *.rar archives (because if i set the same sniffing rule for *.cbr and *.cbz, zip and rar files will be mistakenly marked as cbr and cbz).

In facts now I lowered the value of the sniffing rule:

Rule for *.cbr
0.10 ("Rar!")
Rule for *.cbz
0.10 ("PK")

But thanks to Filer/AutoFiler, these files are properly recognized also with the lower value "0.10"

So: would be the case to grab/borrow the code from Filer/AutoFiler and implement this feature inside the registrar?

Last edited 8 years ago by Giova84 (previous) (diff)

comment:4 by humdinger, 8 years ago

You manually force-set the correct MIME type with the "settype" command according to the file extension. I'm afraid, AFAIKS, there is nothing in Filer that can be re-used for the registrar.

comment:5 by CodeforEvolution, 4 years ago

In reality, to make this much more simple, we can go with the BeOS route and have the registrar actually attempt to scan a file's extension first. This would avoid problems with rar and cbr files, in addition to cases where source code files (which are just text files with no standard format) are being mis-mimetyped as just plain text files.

According to https://birdhouse.org/beos/bible/exc_filetype.html:

Assigning a MIME Type Where There Is None

When a new file arrives on your system without a MIME type (as happens when bringing files over to BeOS from other operating systems), the Tracker and the Registrar work together to assign it one.

The Registrar's first recourse is to look for an extension on the end of the filename, like .jpg, .txt, or .html. If it finds one, it checks the FileTypes database to see whether you've connected that extension with any particular filetype. For example, you may have used the Extensions section of FileTypes to declare that files ending in .html were likely to be HTML documents, and that they should inherit the text/html filetype.

If no extension is found, the Tracker will actually read a small portion of the file with a "sniffer." If it encounters plain text, it will assume that this is a text document and give it the appropriate MIME type. A similar process occurs with GIFs, WAVs, and other common filetypes. Because the extensions technique is more likely to be accurate, it's run first. Assuming you've set up a few common extensions in your FileTypes database, BeOS can guess a file's type accurately the vast majority of the time, with the vast majority of files.

comment:6 by nephele, 4 years ago

Hey, Is there code currently to determine the mime type by the extension or was this never implemented in the first place?

I'm willing to work on this, it's somewhat annoying that sourcecode keeps getting opened with StyledEdit if it doesn't start with a C style comment :)

(text/x-source-code has .lua as an extension, but i don't get any files recognized as this)

comment:7 by pulkomandy, 4 years ago

You can improve the sniffing rule to detect more languages :)

Detecting require " near the start of the file should be enough?

In general extensions are considered a legacy thing from DOS/Windows that we should not still be using in this millenia. Files should be created with a MIME type, by the text or code editor you use to create them, or by the tool you use to import them into Haiku (for example WebPositive will use the MIME type specified in HTTP headers when downloading a file).

The sniffing rules are already a fallback and it seems a bit strange to need a fallback to the fallback.

comment:8 by nephele, 4 years ago

I don't know of any version control i could use that would preserve the mimetypes. Not even our own sourcecode has those assigned, .cpp files get opened by the Code editor only if they start with */, i would argue that detecting file extensions is still crucial even now.

None of my lua files start with require :), the problem is moreso that the only read difference for unknown files is .lua vs .txt, and even though we have the rule to make this match to text/x-source-code it never gets applied.

comment:9 by pulkomandy, 4 years ago

Yes, extensions are not used by the code currently. That's the point of this ticket. We don't have a "rule", we have an informative list of extensions that's not used by anything.

comment:10 by nephele, 4 years ago

Where roughly would this have to be implemented to be used for sniffing? (Assuming the higher priority rules aren't decisive enough)

comment:11 by pulkomandy, 4 years ago

There isn't a "decisive enough" threshold. Each MIME sniffing rule comes with a score, the matching rule with the highest score wins.

Interestingly, there is no rule for "text/plain" so I don't know how your file end up being opened by StyledEdit, unless something else sets the file type?

So, I see two possible ways to go about this:

  • Use the extensions only if no rule matched, as a fallback,
  • Factor in the file extensions as an extra hint for the sniffing rules, and somehow allow them to adjust their score according to the extension

The latter seems better for the original problem in this ticket, where we have multiple rules matching the content, and use the extension to decide between them. The former would be better in your case, where there is no matching rule at all and we would then rely only on the file extension.

comment:12 by CodeforEvolution, 4 years ago

The "text/plain", "text/x-source-code", and other similar mime types are assigned by a dedicated sniffer add-on: https://github.com/haiku/haiku/blob/master/src/kits/storage/mime/TextSnifferAddon.cpp

I can say that I've also experienced too many source code files just being MIME typed as "text/plain", so I'm in support of your endeavors nephele! :D

I agree with pulkomandy on the handling of file extensions as more of a "tie breaker" when deciding what MIME type to assign to a file. (That should be just enough to fix the problem with source code files not being identified properly too!)

in reply to:  7 comment:13 by X512, 4 years ago

Replying to pulkomandy:

In general extensions are considered a legacy thing from DOS/Windows that we should not still be using in this millenia.

Extensions are still actively used in all major systems. External storage (Git, other OS filesystems) often do not support MIME types and types are lost. One of annoying examples is Git operations on Haiku repository that destroy MIME types. Files do not exists only inside one OS. Files are to be exchanged.

File type detection by content is not possible in general and may be resource-consuming. Different file types may have the same format (ogg/ogv, mp4/m4a, zip/docx/jar, python/meson, etc.).

I think that file extensions should be used to detect file type if present.

Last edited 4 years ago by X512 (previous) (diff)

comment:14 by pulkomandy, 22 months ago

Blocking: 18195 added
Note: See TracTickets for help on using tickets.