14 January 1994
LOC94\TxtSrch.MS
5,500 words
By Norman Bauman
The Bay Area Rapid Transit system pushed transportation technology to the limit, in the 1970s. Trains didn't have conductors; they ran automatically, like self-service elevators.
Unfortunately BART also pushed technology beyond the limit. A robot car came into the Freemont station at the end of the line, didn't stop, and kept going--through the station and into the parking lot. The lawsuits over this and other problems came to about $250 million.
The ensuing lawsuits pushed litigation support technology to the limit too. Litigators for Bechtel, one of the parties in the suits, fed every document into the IBM STAIRS system, a full-text document retrieval system.
"If you put the full text of the document on the computer, what more could you want?" That was the paradigm of the day, said David C. Blair, now Associate Professor of Computer and Information Systems, Graduate School of Business, at the University of Michigan. It would seem that, with the right search, you could find anything you want.
But STAIRS only retrieved 20% of the relevant documents in the 350,000-page database. Even worse, the attorneys thought they were getting most of the relevant documents when they weren't, Blair concluded in a frequently-cited paper, "An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System," that he wrote with M.E. Maron, Professor (Emeritus) of Library and Information Sciences at the University of California, Berkeley (Communications of the ACM, March 1985, 28:3).
(When Blair and Maron wrote their article in Communications of the ACM, they were under a confidentiality agreement, and could not identify the BART case, or disclose specific details of the searches, which are revealed here for the first time.)
"People who were really angry about this situation," said Blair, "were very direct in the way they referred to the situation: the 'Freemont accident.'"
"People who felt vulnerable or culpable tended to refer to it in oblique or euphemistic terms: 'The incident of last Tuesday,' 'the unfortunate incident,' 'the unfortunate occurrence of last week.'" The minutes of one meeting didn't mention "Freemont" or "accident" or "train" at all. The opening statement of the minutes was, "We all know why we're here."
"How would you search on that?" asked Blair.
The much-cited 20% figure really doesn't represent an inherent limit of free text search either, as attorneys sometimes claim. There were other problems of document management in the BART case that would have reduced the success of any search technique. For example, an Austin, Texas firm, Brown McCarroll & Oaks Hartline, routinely handles similar cases, ran into the same problems, and figured out how to deal with them.
"The real conclusion of that paper was that full text retrieval was not a simple solution to the problem, it was one of many tools, and had to be used very cautiously," said Blair.
That was the optimism of a decade ago. Since then, computers have become more powerful, lawsuits have been getting bigger, and we have the experience of the refined multi-million dollar commercial services like Westlaw and Lexis. Can't we just throw all our documents into a computer and search for the ones we want?
The most important lesson is that small databases work better than large ones, and irrelevant documents clutter your searches. So you shouldn't fill up your database with junk documents that you'll never use, merely because disk storage is relatively cheap.
The second lesson is that, the better you understand the structure of your case, and the origin and significance of your documents, the more easily you can find your documents. You can search large full-text databases, provided you organize them well at the beginning.
"You don't get something for nothing," said John C. Tredennick, Jr., litigation partner at Holland & Hart, Denver, CO, and past chair of the American Bar Association's Litigation Interest Group. "No computer is going to do the work for you. You've got to study the documents and make decisions about what's relevant."
Beyond that, you can search for legal documents more easily than for discovery documents. You can search for your own documents more easily than for documents you have never seen before. You can search for names and dates more easily than concepts. Boolean searches are clever--but sometimes too clever, and they don't work as well as people thought they would.
First, one of the most effective applications seems to be document assembly. You can retrieve your own personal or firm documents for drafting new documents. Alternately, you could use an electronic form book, but for many lawyers, it's easier to find previous language with a text search program and paste it into a new document.
Second, after you have organized your documents into a structured collection, such as a trial notebook, you can use text search programs as another tool to find things that you know are in there.
And third, attorneys usually have their computer and paper files in a reasonably well-organized filing system, but sometimes you still have trouble finding a document, and text searching gives you another way to look. If you remember a name, you can find the document.
A fourth application is more problematic and certainly more difficult: Scanning through discovery document collections that are too big to read, and finding useful information--maybe even a smoking gun. You can scan for names and standard phrases, but you can't reliably find concepts. Computers can help you skim, but they can't read for you.
In huge document cases, lawyers want to find every document that's relevant to every point in their case. Computers can't do that.
"There are two problems with that," said Blair.
First, the munchkin has to make decisions about evidence and relevance, so the lawyer is delegating legal judgment to non-lawyers.
Second, the lawyer is giving directions to the munchkin that are "basically the same kind of direction you would give to a full text retrieval system." So the coding and abstracting is subject to the same kind of limitations. And there are inherent limitations to indexing.
"It's very difficult to say precisely what you're looking for," said Blair. "You can't tell a munchkin, 'Look for a smoking gun, look for a letter that says they knew what they were doing.' Evidence is something that you can't often describe ahead of time. It has to be a recognition: 'Aha! I didn't know he said that.'"
"Performance will depend on the quality of the indexing, and the quality of the searches," said Turtle. "A really well-indexed document collection, being searched by someone who knows the vocabulary very well, will work much better than a novice on the same collection, or a searcher in a poorly indexed collection."
"Manual indexing is guaranteed to be hard to use for big collections," said Turtle, "and to some extent any kind of indexing is going to be hard to use for big collections."
"You can never be sure you've got it all," said Turtle. "You can need to do your best to try to state your question in different ways."
"We handle pretty big cases," said Leslie Webb, Software Support Specialist. "They can be in the millions of documents," she said. For litigation support, they index their documents with standard fields in Paradox or Foxpro, and then import the data into Folio Views,
"For a while we were scanning everything," said Webb. An outside consultant scanned and imported the documents into Folio Views, which creates a searchable file called an Infobase. "If you have full text, you have everything you need," they assumed.
Or at least, that was the original idea. When the attorneys and paralegals searched, "they would get so much that they didn't know how to narrow it down," said Webb. "And there's a cost consideration." The consultant charges 75 cents to $1.50 per page for full text, depending on the quantity and number of fields.
Now, their scanning is selective. The full text of depositions goes into the Infobase, as do the Paradox and Foxpro indexes, and they image medical records, but they only scan documents into full text that are particularly important.
Sometimes they can get word processing text directly during discovery. One of their discovery question is, "Do you have this available on disk?" They don't get it very often, "but it has been done," said Webb.
"We usually sit down with the attorney and ask what they want," said Webb. "They give us an outline of the menu they want to have, and we put the document in wherever it fits."
Folio Views is easy to use, said Webb. "But to author a real good Infobase is difficult, because if you don't know what you're doing, if you don't know how to group things, how to link things, you'll never find what you're looking for. On the front end there's a ton of work involved. You're only going to get out of it what you put into it."
Typically the attorneys use a fairly easy search, and so they get bigger hits, said Webb. Attorneys might search medical records for "headache" or "head." Legal assistants "might use a bit more complicated search, to try to narrow it down," she said. "They'll search for names, phrases, and use proximity searches a lot."
"The most important thing for full-text searching is discovery responses and searching for a witness' name," said Webb. Attorneys also search for repeated themes in the depositions.
Some attorneys like to keep everything they can in an Infobase. But that takes a lot of work. "Usually they have a full-time person working on keeping everything grouped appropriately," said Webb.
"Text searching works well if you're looking for a document with a person's name," said Thede. "Then it works like a charm." It works well if you're looking for a subject matter where people tend to use the same terms consistently, such as terms of art in law, he said. "It works less well if you're looking for a concept that people express in lots of different ways."
Thede drew a distinction between, first, searching your own personal documents, and second, searching a database that is unfamiliar.
The "looking-for-something-I-forgot" search of your own work is much easier, said Thede. "Somebody asks you a question, and it rings a distant bell," he said. "You remember you did something on it a few years ago. You remember enough to search on a name, or a concept."
"In many cases," said Thede, "I can find the answer while I've got the person on the phone."
And indeed searching your own documents is one of the most popular and successful applications of text search programs in the law office.
"I don't know anything about computers, so I was the computer-illiterate tester," she said. "I don't even know DOS."
She would typically review an application for compliance with the law, and then write a memorandum to her supervising attorney, or to the Board, discussing any relevant legal issues that arose.
"Suppose the Community Reinvestment Act (CRA) were an issue in an application," she said. "Suppose a community group protested an application alleging that the Home Mortgage Disclosure Act (HMDA) data indicates discrimination against minorities." She would then write a memo addressing that issue.
"I would want to find the language I used in the last order on that CRA issue," she said. Not only is it easier, but the Fed wanted her to be "as conservative as possible and use as much precedent as possible," so she tried to use the same language as much as possible. Using dtSearch, she would search for:
(community reinvestment act or cra) and (home mortgage disclosure act or hmda)
"A little box pops up with search results sorted by name, date or number of hits," she said. "I would run it by reverse chronological order, so the newest ones were on top." She could view the documents in a window, copy the language from different files, and save the text to an ASCII file for import into WordPerfect.
This illustrates two of the ways text search programs work best: first, searching your personal documents, second, searching for concepts that can be defined by standard legal terms.
For example, Goudge wanted to look up a rule about prompt service to an employer and employee. One was promptly served; the other was not promptly served and is therefore not liable. Is the remaining party liable? So he searched for:
(diligence and dismissal) w/40 (employer or employee or principal agent or respondent superior) and w/40 (estoppel or res judicata)
"ZyIndex has a thesaurus," said Goudge. "You cursor on 'employee' and get a whole bunch of synonyms." In five minutes, he found a pleading on point, even though it was done four years ago by an attorney who has since left the firm. The answer: "Neither of the two can be liable," said Goudge.
"Lawyers tend to talk with certain buzzwords," said Goudge. "'Open and obvious.' 'Trespasser.' 'Licensee.' 'Invitee.' If you're talking about a bus, it's 'highest duty of care.' You hit these key words and it comes on out. I know what the jargon is that's used in by lawyers that practice in this area, and by judges when they write opinions in this area."
This illustrates another way in which text searching works well: finding fact situations, like those associated with "swings."
"The classic example is the entire body of U.S. case law," said Thede. "It is just tremendous."
A common technique is to perform one search, retrieve some documents, see what phrases appear, and use those phrases for another search. This is another technique that looks easy, and works well on documents that are written by lawyers, but can break down in large discovery databases.
"The general concept is to look at the pattern of words in the document and turn that into a manageable thing that you can store and search on," said Thede. You examine the documents, and try to find words that are relatively frequent in those documents but relatively infrequent in the entire database, he explained. It seems possible to write an algorithm that would allow a computer to do that automatically, but that goal has been elusive.
"It takes more intelligence than computers have today to recognize those documents, because you have to know the English language," said Thede. "You have to know what the words mean. The next frontier in text retrieval is to actually have the computer understand, in some sense, the content of the document, rather than treating it as a bunch of words it doesn't understand at all."
That frontier is a long way off. But, said Thede, "I think there are intermediate points."
"We searched through our Agenda data base," said Beckman. "All kinds of things came up--interviews with the nurses, depositions, interviews with families. One of the nurses had to leave the room because she couldn't stomach to watch it." This material was very useful in negotiation.
"We came up with things that didn't have the word 'pins' in them," said Beckman. "I couldn't figure out how Agenda found them. The manual says that, if there are items out there that don't match the word 'pin,' but have a lot of other items you've selected, it'll pick that item." For example, there were names of instruments used in taking the pins out.
"Nowhere in this data base did we set up a keyword that said 'pin,'" said Beckman.
The documents in a large unstructured discovery database, like the one used in the BART litigation, are far more difficult to search, because engineering terms are not as standardized, and engineers don't write as redundantly as attorneys. Engineers will frequently write to each other about mutually-understood concepts which need not be explicitly mentioned at all.
Blair often found himself following "a trail of linguistic creativity through the database" as he described in his paper. In searching for documents discussing "trap correction," they discovered other documents referring to it as "wire warp." Other documents referred to it as the "shunt correction system." The inventor was named "Coxwell" and documents he had written were in the database, but Coxwell referred to it as the "Roman circle method." The system had been tested in another city, where it was referred to as an "air truck." Finally, after 40 hours of searching, with "no reason to believe that we had reached the end of the trail," they ran out of time and quit.
One allegation was that a construction company had ordered excessive quantities of steel, and reabsorbed the steel into their inventory, said Blair. The lawyers wanted to search all the documents for the key phrase, "steel quantity." But engineers don't use the term "steel quantity." They will refer to "girders," "beams," "braces," or "frames." You have to know that these are all steel, and you have to translate your requests into those terms, he said.
"OK," some lawyers say, "you need a thesaurus." Many text search programs will expand your search term with a default dictionary of synonyms, or let you build a custom dictionary. But that didn't work either with this large database.
STAIRS had a thesaurus, said Blair. "They hired an engineer to spend a year and a half," to manually decide that one word was related to another word. "He did a pretty good job." But even with the thesaurus, the system "was not able to retrieve a single document that was relevant that could not be retrieved in the other way."
"The problem was that the engineers and lawyers used somewhat different vocabularies to talk about the same things," said Blair. "A thesaurus based on the engineers' vocabulary missed many of the words and semantic relationships that existed in the lawyer's vocabulary," he said. The engineer linked the terms "girders," "beams," "braces," and "frames," but he didn't link it to "steel quantity." Engineers, said Blair, don't talk about "quantities of steel."
It turned out that the documents relevant to the steel issue were the bills of materials, which are listings of the material delivered by a subcontractor. "What you didn't want was people just talking in general about steel," said Blair.
But, had they known, one search in the BART database would have been easy. People occasionally referred to a critical, embarrassing issue as a "smoking gun." So all you had to do was search for the key words "smoking gun," and you'd have it.
"When a lawsuit occurs revolving around the activities of people in a particular department, very often the law firm will say, "Put everything that came out of that department during this time frame onto the computer system." That's a bad move. Decide what's relevant, don't just gather everything, said Blair. "Everything can't be relevant."
If the documents start out with any kind of organization, keep that organization available, said Blair. For example, if you get documents in response to discovery questions, code them in a field to indicate the questions.
If they had kept the 13 issues distinct, then instead of searching the entire 40,000 documents, they could have directed their queries to a much smaller partition of documents, all relevant to, say, the Freemont accident, explained Blair. Searching the partition, "you can tolerate more ambiguity in the queries, and more sloshing around in the searching, because everything in the smaller database is related to an issue."
Documents are often clustered together, to perform a particular activity, such as a contract negotiation, said Blair. So you should be able to link those documents together, even though they may not have a distinctive key word in common. In the BART suit, for example, brake design was an issue, and the attorneys wanted to follow the negotiations between prime contractor and subcontractor over brake design. For the delivery of steel, the cluster of documents that recorded the quantity and price was the bills of material.
If you submit a query to a collection of 1,000 documents, you might get 50 documents back. "That's not a problem," he said. "I'll just paw my way through." But if you submit that same query to a collection of 100,000 documents, you might get 5,000 documents back, which is useless. This problem of information retrieval is known as "output overload."
The software enables you to add restrictive Boolean terms to a query until it reduces the responses to a manageable number. But with each new term, relevant documents are excluded. Most people realize that they're sacrificing something, but when Blair worked out the mathematics, the extent was "quite startling." With five search terms, using some reasonable assumptions, a query should yield only 1 relevant document in 1,000, he calculated.
Goldner has used InMagic on a Gateway 2000 Nomad laptop at depositions for over a year now. He's been preparing his computer for trial, but so far the opponents have settled first.
The last one was the "music video case," Frank v. Whitesnake, in the Eastern District of Pennsylvania. Goldner's client was a photographer who claimed to have fallen on electrical cabling which was concealed beneath the stage during the shooting of a music video. She fractured her wrist, which did not heal properly and interfered with her work.
The key word was "cable," said Goldner. "When I typed in 'cable', I was overwhelmed with responses," he said. He narrowed it down by looking for references to the color of the cable. "The witness said the cable was orange. Everyone [on the defense] denied that there was anything but black cable."
"My opponents were aware of my clicking through the deposition digests, and, with my Sharp Wizard, and cellular phone, they regarded me as something of a nerd," said Goldner. They made jokes about what would happen if he were struck by lightening. But juries are prepared for computers, he thinks. "LA Law routinely shows lawyers at counsel table with laptops," he notes. When the judge said, "Let's pick a jury," he said, "Fine, I'm ready to go." The computer helped him call their bluff, and they settled, Goldner says.
Goldner keeps a log of requests and answers to interrogatories, and other trial details, in his InMagic database, so when someone claims he never sent an answer, he can point to the file. "A lot of insurance adjusters do that on their computers too," he said. "Occasionally we'll get into a log-reading war."
With the judge waiting, Tredennick searched for "patients" within 4 lines of "sent"--and found nothing. So he searched again, this time simply for "patients," and found interrogatory number 17, in which the plaintiff's president had sworn that they had only sent nine patients. He could have found the same interrogatory manually, overnight, but it had much more impact to point it out immediately, he felt. The judge was displeased at the contradiction. The plaintiffs lost and Tredennick's client won a $14 million counterclaim.
###