Web Site Are Here Today Gone Tomorrow

WEB SITES ARE HERE TODAY GONE TOMORROW

By Rick Weiss

It was in the mundane course of getting a scientific paper published that physician Robert Dellavalle came to the unsettling realization that the world was dissolving before his eyes.

The world, that is, of footnotes, references, and Web pages.

Dellavalle, a dermatologist with the Veterans Affairs Medical Center, in Denver, had co-written a research report featuring dozens of footnotes—many of which referred not to books or journal articles but, as is increasingly the case these days, to Web sites that he and his colleagues had used to substantiate their findings.

The average lifespan of a Web page today is 100 days.

Problem was, it took about two years for the article to wind its way to publication. And by that time, many of the sites they had cited had moved to other locations on the Internet or disappeared altogether, rendering useless all those Web addresses—also known as uniform resource locators (URLs)—they had provided in their footnotes.

“Every time we checked, some were gone and others had moved,” said Dellavalle, who is on the faculty at the University of Colorado Health Sciences Center. “We thought, ‘This is an interesting phenomenon itself. We should look at this.’”

He and his co-workers have done just that, and what they have found is not reassuring to those who value having a permanent record of scientific progress. In research described [in Science], the team looked at footnotes from scientific articles in three major journals—the New England Journal of Medicine, Science, and Nature—at three months, 15 months, and 27 months after publication. The prevalence of inactive Internet references grew during those intervals from 3.8 percent to 10 percent to 13 percent.

“I think of it like the library burning in Alexandria,” Dellavalle said, referring to the 48 B.C. sacking of the ancient world’s greatest repository of knowledge. “We’ve had all these hundreds of years of stuff available by interlibrary loan, but now things just a few years old are disappearing right under our noses really quickly.”

Dellavalle’s concerns reflect those of a growing number of scientists and scholars who are nervous about their increasing reliance on a medium that is proving far more ephemeral than archival. In one recent study, one-fifth of the Internet addresses used in a Web-based high school science curriculum disappeared over 12 months.

Another study, published in January [2003], found that 40 to 50 percent of the URLs referenced in articles in two computing journals were inaccessible within four years.

“It’s a huge problem,” said Brewster Kahle, digital librarian at the Internet Archive, in San Francisco. “The average lifespan of a Web page today is 100 days. This is no way to run a culture.”

Of course, even conventional footnotes often lead to dead ends. Some experts have estimated that as many as 20 percent to 25 percent of all published footnotes have typographical errors, which can lead people to the wrong volume or issue of a sought-after reference, said Sheldon Kotzin, chief of bibliographic services at the National Library of Medicine, in Bethesda.

But the Web’s relentless morphing affects a lot more than footnotes. People are increasingly dependent on the Web to get information from companies, organizations, and governments. Yet, of the 2,483 British government Web sites, for example, 25 percent change their URL each year, said David Worlock of Electronic Publishing Services Ltd., in London.

That matters, in part, because some documents exist only as Web pages—for example, the British government’s dossier on Iraqi weapons. “It only appeared on the Web,” Worlock said. “There is no definitive reference where future historians might find it.”

Web sites become inaccessible for many reasons. In some cases, individuals or groups that launched them have moved on and have removed the material from the global network of computer systems that makes up the Web. In other cases, the sites’ handlers have moved the material to a different virtual address (the URL that users type in at the top of the browser page) without providing a direct link from the old address to the new one.

When computer users try to access a URL that has died or moved to a new location, they typically get what is called a “404 Not Found” message, which reads in part: “The page cannot be displayed. The page you are looking for is currently unavailable.”

So common are such occurrences today, and so iconic has that message become in the Internet era, that at least one eclectic band has named itself “404 Not Found,” and humorists have launched countless knockoffs of the page—including www.mamselle.ca/error.html, which looks like a standard error page but scolds people for spending too much time on their computers (“This page cannot be displayed because you need some fresh air…”) and www.coxar.pwp.blueyonder.co.uk, which offers political commentary about the U.S. war in Iraq (“The weapons you are looking for are currently unavailable.”).

Not all apparently inaccessible Web sites are really beyond reach. Several organizations, including the popular search engine Google and Kahle’s Internet Archive (www.archive.org), are taking snapshots of Web pages and archiving them as fast as they can so they can be viewed even after they are pulled down from their sites. The Internet Archive already contains more than 200 terabytes of information—equivalent to about 200 million books. Every month it is adding 20 more terabytes, equivalent to the number of words in the entire Library of Congress.

“We’re trying to make sure there’s a good historical record of at least some subsets of the Web, and at least some record of other parts,” Kahle said. “We’re injecting the past into the present.”

But with an estimated seven million new pages added to the Web every day, archivists can do little more than play catch-up. So others are creating new indexing and retrieval systems that can find Web pages that have wandered to new addresses.

the page moves to a new URL address, it can always be found via its unique DOI.

Standard browsers cannot by themselves find documents by their DOIs. For now, at least, users must use go-between “registration agencies”—such as one called CrossRef—and “handle servers,” which together work like digital switchboards to lead subscribers to the DOI-labeled pages they seek. A hodgepodge of other retrieval systems is cropping up, as well—all part of the increasingly desperate effort to keep the ballooning Web’s thoughts accessible.

If it all sounds complicated, it is. But consider the stakes: The Web contains unfathomably more information than did the Alexandria Library. If our culture ends up unable to retrieve and use that information, then all that knowledge will, in effect, have gone up in smoke.

“On the Web, Research Work Proves Ephemeral,” Washington Post, Nov. 24, 2003.Rick Weiss is a staff writer with the Washington Post. Research editor Margot Williams contributed to this report.