Wednesday, April 18, 2012

Apache Tika - File Content Analyser

If you are fiddling around with files and documents day-in day-out and look to detect their content type for whatsoever reason - Apache Tika can save your day.

A valid large enterprise use-case I can think of.....

Require a system/solution to read through few hundreds of thousands of documents of a few hundred distinct MIME types every day - validate - process - and push to subscribed interfaces via various protocols.

From the architecture/design perspective, I would like my solution to have extensibility rolled in - This would ensure that I can have a single file reader which can handle/read multiple file formats so that I don't have to spend time & effort on IT for incorporating a new file format in near future. Also prudent would be to have the file formats validated before consuming/reading them by way of pre-processors which would be responsible for detecting the content-types before reading. This would really save a lot of time and gain performance.

Consider a scenario where a huge video file is morphed into a PDF by just altering the extension. It would not be very wise for the IT solution to just read the file extension and presume the validity of file format.

Apache Tika does exactly that - It has the intelligence to detect the exact MIME type of the file just by reading the meta data and remember it is not very invasive (It doesn't read the entire content of a 5 GB file to understand the actual content type). In its own words this open source project can be described as - "detects and extracts metadata and structured text content from various documents using existing parser libraries" Downside - It is a bit heavy (~25 MB library). But the benefit is that it can read, detect and parse a wide variety of file contents and provide with a specific MIME type which apparently is not the case with many of the java APIs out there.

More information on APIs and Documentation here


I happen to use this open source library in one of my implementation and found really helpful. Thought it might help someone out there looking for such requirements :)

No comments:

Post a Comment