Ghosts On The Internet

August 25th, 2009 | by admin |

By rights the inter­net should be full of pol­ter­geists, poor root­less things look­ing for their real homes. Many events on the inter­net are not prop­erly asso­ci­ated with their cor­rect time­frame. I don’t mean a server set to the wrong time, though that hap­pens too. Much of the con­tent pub­lished on the inter­net is sep­a­rated from any proper ref­er­ence to its pub­li­ca­tion time. What does pub­li­ca­tion even mean? Let me tell you a story…

“It is 2019 and this is Kathy Clees report­ing on the story of the moment, the shock pur­chase of Microsoft by Apple Inc. A Inter­net Explorer secu­rity scare story from 2008 was respon­si­ble, yes from 11 years ago, acci­dently pro­moted by an ana­lyst, who neglected to check the date of their sources.”

If you think this is fan­ci­ful non­sense, then cast your mind back to Sep­tem­ber 2008, this story in Wired or The Times (UK) about a huge United Air­lines stock tum­ble. A Florida news­pa­per had a auto­mated pop­u­lar story sec­tion. A ran­dom reader look­ing at a story about United’s 2002 Bank­ruptcy pro­ceed­ings caused this story to get picked up by Google’s later visit to the South Florida Sun Sentinel’s news home page.

The story was undated, Google’s news engine appar­ently gave it a 2008 date, an ana­lyst picked it up and pushed it to Bloomberg and within min­utes the United stock was tum­bling. Their stock price dropped from $12 to $3, then recov­ered to $11 over the day. An eight per­cent fall in share price over a mis-​​configured date

Com­plet­ing this out of order Christ­mas Carol, lets look at what is cur­rent prac­tice and how dates are man­aged, we might even get to clank some chains. Pub­li­ca­tion date used to be insep­a­ra­ble from pub­li­ca­tion, the two things where stamped on the same piece of paper. How can we deter­mine when things have been pub­lished, now?

Deter­min­ing pub­li­ca­tion dates

Time as defined by http://​www​.w3​.org/​T​R​/​N​O​T​E​-​d​a​t​etime extends ISO 8601, man­dat­ing the use of a year value. This is pretty well defined, we can even get very accu­rate tim­ings down to mil­lisec­onds, Ruby and other lan­guages can even han­dle Cal­en­dar ref­or­ma­tion. So accu­racy is not the issue.

One prob­lem is that there are many dates which could be inter­preted as the pub­li­ca­tion date. Pub­li­ca­tion can mean any of date writ­ten or cre­ated; date placed on server; last mod­i­fied date; or the cur­rent date from the web server. Cre­ated and mod­i­fied have par­al­lels with file sys­tems, but the large num­ber of data­base dri­ven web­sites means that this no longer holds much mean­ing, as there are no longer any files.

Check­ing web server HEAD may also not cor­re­spond, it might give the cre­ation time for the HTML file you are view­ing or it might give the last mod­i­fied time for a file from disk. It is too unre­li­able and lack­ing in con­text to be of real value. So if the web server will not help, then how can we get the right time­frame for our content?

We are left with URLs and the actual page content.

Look­ing at Flickr, this pic­ture (by Dou­glas County His­tory Research Cen­ter) has four date val­ues which can be asso­ci­ated with it. It was taken around 1900, scanned in 1992 and placed on Flickr on July 29th, 2008 and replaced later that day. Which dates should be rep­re­sented here?

This is hard ques­tion to answer, but cur­rently the date of upload to Flickr is the best rep­re­sented in terms of the date URL, /photos/douglascountyhistory/archives/date-posted/2008/07/29/, plus some Dublin Core RDF for the year. Flickr uses 2008 as the value for this image. Not accu­rate, but a rea­son­able com­pro­mise for the mil­lions of other images on their site.

Flickr rep­re­sents loca­tion much bet­ter than it rep­re­sents time. For the most part this is fine, but once you go back in time to the 1800s then the maps of the world start to change a lot and you need to ref­er­ence both time and place.

The Google time­line search offers another inter­est­ing win­dow on the world, show­ing results organ­ised by decade for any search term. Being able to jump to a spe­cific occur­rence of a term makes it eas­ier to get pri­mary results rather than later reporting.

The 1918 “Span­ish flu” results jump out in this timeline.

Timeline search result from Google

Any major news event will have mul­ti­ple analy­sis arti­cles after the event, find­ing the orig­i­nal report­ing of hur­ri­cane Kat­rina is harder now. Many pub­lish­ers are putting older con­tent online, e.g. Harpers or Nature or The Times, often these use good date based URLs, some­times they are unhelp­ful data­base ref­er­ences. If this con­tent is avail­able for free, then how much bet­ter would it be to pro­vide good meta­data on date of publication.

Date based URLs

A quick word on date based URLs, they can be bril­liant at cap­tur­ing first pub­lished date. How­ever they can be hard to inter­pret. Is /03/04 a date in March or April, what about 08/03/04? Obvi­ously 2008/03/04 is eas­ier to under­stand, it is prob­a­bly March 4th. Includ­ing a proper time­stamp in the page con­tent avoid this kind of guesswork.

Many sites rep­re­sent the date as a plain text string; a few hook an HTML class of date around it, a very few pro­vide an actual time­stamp. Asso­ci­at­ing the date with the indi­vid­ual con­tent makes it harder to get the date wrong.

Mov­able Type and Type­Pad are a notable excep­tions, they will embed Dublin Core RDF to rep­re­sent each post­ing e.g. dc:date="2008-12-18T02:57:28-08:00". Word­Press doesn’t sup­port date markup out of the box, though there is a patch and a howto for hAtom available.

In terms of news­pa­pers, the BBC use <meta name="OriginalPublicationDate" content="2008/12/18 18:52:05" /> along with opaque URLs such as http://news.bbc.co.uk/1/hi/technology/7787335.stm.

The Guardian use nice clear URLs http://www.guardian.co.uk/business/2008/dec/18/car-industry-recession but have no marked up date on the page.

The New York Times are sim­i­lar to the Guardian with nice URLs, http://www.nytimes.com/2008/12/19/business/19markets.html, but again no time­stamps. All of these papers have all the data avail­able, but it is not marked up in a use­ful manner.

Syn­di­ca­tion formats

Syn­di­ca­tion for­mats are bet­ter at sup­port­ing dates, RSS uses RFC 822 for dates, just like email so dates such as Wed, 17 Dec 2008 12:52:40 GMT are valid, with all the white space issues that entails.

The Atom syn­di­ca­tion for­mat uses the much clearer http://tools.ietf.org/html/rfc3339 with time­stamps of the form 1996-12-19T16:39:57-08:00. Both syn­di­ca­tion for­mats encour­age the use of last mod­i­fied. This is under­stand­able, but a pity as pub­lished date is a very use­ful value. The Atom syn­di­ca­tion for­mat sup­ports “pub­lished” and man­dates “updated” as time­stamps, see the Atom RFC 4287 for more detail.

Mark­ing up dates

How­ever the aim of this short arti­cle is to encour­age you to use micro­for­mats or RDF to encode dates. A good exam­ple of this is Twit­ter, they use hAtom for each indi­vid­ual entry, http://twitter.com/zzgavin/status/1065835819 con­tains the fol­low­ing markup, which rep­re­sents a human and a machine read­able ver­sion of the time of that tweet.

<span class="published" title="2008-12-18T22:01:27+00:00">about 3 hours ago</span>

The spec for date­time is still draft at the minute and there is still ongo­ing con­ver­sa­tion around the right for­mat and seman­tics for rep­re­sent­ing date and time in micro­for­mats, see the date­time design pat­tern for details.

The hAtom exam­ple page shows the min­i­mal changes required to imple­ment hAtom on well formed blog post con­tent and for other less well behaved con­tent. You have the infor­ma­tion already in your con­tent pub­li­ca­tion sys­tems, this is not some addi­tional oner­ous con­tent entry task, sim­ply some tem­plate formatting.

I started to see this as a seri­ous issue after read­ing Stew­art Brand’s Clock of the Long Now about five years ago. Brand’s book explores the issues of short term think­ing that per­me­ate our soci­ety, think­ing beyond the end of the finan­cial year is a stretch for many peo­ple. The Long Now has a world view of a 10,000 year time­frame, see http://​long​now​.org/ for much more infor­ma­tion. Free­base from Long Now Board mem­ber Danny Hillis, sup­ports dates quite well – see the entry for A Christ­mas Carol.

In con­clu­sion

I feel we should be mak­ing it eas­ier for peo­ple search­ing for our con­tent in the future. We’ve moved through tag­ging con­tent and on to geo-​​tagging con­tent. Now it is time to get the time­stamps right on our con­tent. How do I know when some­thing hap­pened and how can I find other things that hap­pened at the same time is a fair ques­tion. This should be some­thing I can sat­isfy sim­ply and eas­ily. There are a range of tools avail­able to us in either hAtom or RDF to spec­ify time accu­rately along­side the con­tent, so what is stop­ping you?

Think­ing of the long term it is hard for us to know now what will be of rel­e­vance for future gen­er­a­tions, so we should aim to raise the floor for pub­lish­ing tools so that all con­tent has the right time­frame asso­ci­ated with it. We are mov­ing from pub­lish­ing words and pic­tures on the inter­net to being able to asso­ciate pub­li­ca­tion with an indi­vid­ual via XFN and OpenID. We can asso­ciate place quite well too, the last piece of use­ful meta­data is timeframe.

You must be logged in to post a comment.