So what is metadata, anyway?

Writing about metadata is risky business, since every post and every tweet potentially starts the same discussion: what exactly is metadata, anyway? So here's my ambitious attempt to cut to the chase, and open the can of worms again.

Why would you care, anyway? Isn't this just some highly technical or theoretical debate? Well, to some extent it is, but the fact remains that for any content technology, metadata is essential. Metadata is what allows us to use a system to manage content in the first place. Even if you take the brute force approach of using enterprise search, rather than meticulously tagging all your content with metadata, you'll find results will be disappointing, at best. (In fact, if there's no useful metadata available, search engines will have to create it themselves.) Metadata is so important that we now even get court rulings to define it.

Of course, the essence is easily defined. Metadata is data about data. The problem is that, in the end, you can't really define the distinction between data and metadata.

The examples are abundant: a document's author, the date content was created or published, the name of a database column, even the filename is metadata. You can see it in any system dealing with content, and often, helpfully, it will actually be marked as "metadata." There are standards for what metadata you could have (like Dublin Core, or EXIF) or how to store it in a document itself (like XMP). If that's all you want to know, now might be a good time to stop reading. Because from there, it starts getting tricky.

Some argue that the concept of metadata is just not very intuitive, because it's artificial, something we're not used to "in real life." I doubt it. (You need to look no further than the cover of a book to understand why.) In fact, we're quite used to those meta-levels of looking at things. We need them to communicate. ("The color of my car is green.") So used to them, in fact, you could argue that any kind of content is metadata, since it always describes something else. (Even a picture of a chair is not really a chair, just a reference to it; and this blog post is not just text -- it's about...)

In content management, we tend to define metadata by content's use or purpose, rather than its nature. Something is metadata, because we want to use it as metadata. A CMS will use that metadata as a "hook," to instigate an action, such as displaying content on a particular page in a certain way. A developer may want to sort based on date, an information architect or knowledge manager may want to display content based on how it's classified, or users need facets to refine the results in their search interface. Those uses are quite different, and sometimes at odds with each other.

Your records manager may want to keep all the metadata together with the data, as one "document." A developer would often prefer a system to treat metadata just as it does any data (because then it's accessible through the APIs in a uniform way, and the developer doesn't need to jump hoops to get to it). On the other hand, for performance purposes, you might want to keep metadata and data separate (store the "about" stuff in the database, and the huge video itself on the filesystem - as DAM systems often do). But a web editor will often wonder why some important fields (their distinction will often seem entirely arbitrary) are marked "metadata" and hidden two tabs and several clicks away.

You're unlikely to resolve those conflicts by arguing who's right. Some of these particular debates have been raging for thousands of years. Plato would say that you should consider metadata to be external to what it describes. Aristotle would tell you that these are inherent attributes of a file or record. A point excellently illustrated by Raphael's painting in the Vatican, with Plato, at left, pointing to The Cloud, obviously, and Aristotle controlling the files.

You may want to hire several expert philosophers to argue on your behalf, while you get on with the job of actually managing content. Because in the end, everybody is going to disagree on what metadata is, and nobody is going to be "right". For any content management project, you'll want to be clear on what everybody needs, and how the system needs to use content. That's what should define your metadata.

(And by the way, if you completely disagree with me on this -- have your philosopher contact my philosopher, and they can work out the epistemological and ontological fine print.)

Social Software-related blog posts from CMSWatch.com "Trends" blog.

Link to original post