We’re working with CERN at the moment, and on a recent trip over to Geneva we had the opportunity to spend some time with their search department.
While we were chatting we got onto the subject of semantics. They have a lot of very old content spread across the CERN domain, from obscure personal pages put together by staff in a text editor to full blown content managed sites that support the experiments.
At a markup level across the varying sites, the depth of semanticity is so diverse, making any sense of it for search is a Herculean task. On top of that, meta within the pages is poorly used, sometimes representing content that doesn’t even exist within that page, making the lives of the poor machines crawling all that content pretty darn miserable.
We’re encouraged to write markup as semantically as possible to help machines get around this problem. We aim to provide better context by wrapping our content in more meaningful containers that certainly help us make more sense of it (if it’s done well) and, yes it helps promote good use of standards, but it got me thinking, just how useful that markup is for machines given that idiot humans are the one’s making it. Machines like order, and although we strive for it, there’s too much opportunity for mistakes to happen or abuse. Microformats are a great addition to the semantic web, but again they’re subject to the same issues.
Regardless of both, for both legacy content and more modern content written with (intentionally) great semantics - we still have the problem of how to understand that content better so we can do more with it. And for that, we now have search. When all else fails, a broader picture tends to provide a more rounded view of things. Particularly when the specifics you have are prone to so much error. Even with incredible tools like Lucene running internally - better, more relevant results and meaning come from external search engines like Google or Bing.
Search engines figured out long ago that meta data, keywords and suchlike were horrible ways to categorise the web. Disambiguation aside, the system was prone to gaming and abuse so they canned it, looking instead to perceivably relevant content, relationships, reference and recommendation (through inbound links) etc. All of which suddenly starting sounding very familiar when you compare them to how we intend semantic markup to be used.
So I’m slightly confused. It’s certainly not news that this is the way things are. But with the context of a problem like CERN’s and extrapolating that across the web, I’m starting to question just how valuable semantics really are when search is getting so good.
It’d be great to hear your thoughts in the comments below.