”Tell Me a Story” Issues on the Design of Document Retrieval Systems

”Tell Me a Story” Issues on the Design of Document Retrieval Systems

Please download to get full document.

View again

of 15
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Internet & Web

Publish on:

Views: 0 | Pages: 15

Extension: PDF | Download: 0

  “Tell me a Story” Issues on the Design of Document Retrieval Systems Daniel Gonçalves, Joaquim Jorge Computer Science Department, Instituto Superior Técnico, Av. Rovisco Pais 1049-001 Lisboa Portugal djvg@gia.ist.utl.pt, jorgej@acm.org Abstract . Despite the growing numbers and diversity of electronic documents, the ways in which they are cataloged and retrieved remain largely unchanged. Storing a document requires classifying it, usually into a hierarchic file system. Such classification schemes aren’t easy to use, causing undue cognitive loads. The shortcomings of current approaches are mostly felt when retrieving docu-ments. Indeed, how a document was classified often provides the main clue to its whereabouts. However, place is seldom what is most readily remembered by users. We argue that the use of narratives, whereby users ‘tell the story’ of a document, not only in terms of previous interactions with the computer but also relating to a wider “real world” context, will allow for a more natural and effi-cient retrieval of documents. In support of this, we describe a study where 60 stories about documents were collected and analyzed. The most common narra-tive elements were identified (time, storage and purpose), and we gained in-sights on the elements themselves, discovering several probable transitions. From those results, we extract important guidelines for the design of narrative- based document retrieval interfaces. Those guidelines were then validated with the help of two low-fidelity prototypes designed from experimental data. This  paper presents these guidelines whilst discussing their relevance to design is-sues. 1 Introduction In recent years, computer hardware has become increasingly cheap. As a consequence  people tend to use computers not only at work, but also at home. Furthermore, PCs are losing their dominance and laptops or PDAs are ever more commonly used in all set-tings. Moreover, the advent of ubiquitous, pervasive computing will only increase the number of devices available from which documents can be handled. Because of this trend, more and more often users edit and store related documents in different loca-tions. Thus, new tools that allow users to more easily find a specific piece of informa-tion, regardless of where they are, or to visualize the Personal Document Space (PDS) as a whole will soon become imperative. One of the major challenges of HCI in the upcoming years will revolve around these issues, as pervasive computing becomes a reality [1] [2] [13].  The biggest problem with current hierarchic organization schemes is that they con-tinuously require users to classify their documents, both when they are named and when they are saved somewhere in the file system. Such approaches force users to fit their documents into specific categories. Also, since users know that a good classifica-tion determines their ability to later retrieve the documents, classifying ever increasing numbers of documents becomes a painful task, causing undue cognitive loads while choosing the category in which each document should be placed. This was first recognized by Thomas Malone [12] on his groundbreaking work where two main document organization strategies were identified:  files  and  piles . On files documents are classified according to some criteria, whereas Piles are ad-hoc  collections of documents. The latter were shown to be more common due to the diffi-culties inherent to the classification task. Nowadays, similar results are found not only for documents on computers but also for other applications in which hierarchic classi-fication has become the primary information organization strategy. Such is the case of email, where it was found [4] that most users’ inboxes are often filled with large num- bers of messages, given the difficulty and reluctance in classifying them into other folders. However, despite the apparent lack of classification, the same study found that the users think it easier to find email messages in the inbox than finding a document on the file system. This is because email messages are associated to useful information elements, ranging from the sender of a message to when it was sent and what messages were received at about the same time. This causes some people to overload their email tools to work as To Do lists or to maintain sets of unread documents [14]. Even con-sidering that email tools were not designed with those ends in mind, the trade-off in relation to traditional applications seems to be positive. This shows the importance of information other than a name or classification for re-trieving documents. Users more readily remember other contextual, real world, infor-mation, rather than some arbitrary classification made months or years ago. Several works try to make use of such additional information to help users retrieve their documents. One of the first was Gifford’s  Semantic File Systems  [7], where properties are associated to documents, either automatically inferred (from email headers, for instance), or explicitly created by users. Documents can then be found in ‘virtual-folders’, whose contents are determined by queries on the defined properties. This work inspired others such as Dourish et al’s  Placeless Documents  [4] and Baeza-Yates et al’s  PACO  [3], where enhancements for features such as support for multiple document locations and management of shared documents can be found. Other works, such as Freeman and Gelernter’s  Lifestreams  [6] recognize the importance of temporal information, presenting all documents in an ordered stream. Although alleviating some of the problems users must face, new problems appear with those approaches. Property-based systems require users to handle (and remem- ber) arbitrary sets of properties. Furthermore, each property is an isolated piece of information with no apparent relation to the others. Temporal-based approaches disre-gard other kinds of information. An integration of the several relevant information elements that could help users in finding their documents is lacking. The most natural way in which users can convey that information to someone is in the form of stories or narratives. Humans are natural-born storytellers. From early times have stories been told, first in oral tradition and later in written form. Elements in a story do not appear  separately but as part of a coherent whole. The relations between those elements make the story easier to remember. An interface that takes advantage of those abilities and allows users to tell a story describing a document in order to retrieve it will allow for a more natural and efficient interaction. The design of such an interface should take into account not only the most common and expected elements in a narrative, but also how they inter-relate. This will allow it to know what shape the stories might have, what will come up next at any given point in the narrative, and what information users might remember even if it wasn’t volun-teered in the first place, resulting in a dialogue that is natural, informative and not awkward. Thus, it is important to find out exactly what document-describing stories are like. To correctly address the aforementioned challenges, we performed a set of inter-views where several stories describing documents were analyzed. This allowed us to extract patterns for common narrative elements and ways in which they are used. Some recurrent story structures were found. From those, we extracted a set of guide-lines that systems for narrative-based document retrieval should follow to correctly address the users’ needs. Ultimately, we envision the design of a system that continu-ously gathers information about the users’ interactions with their documents and whose narrative-based interface is able to extract vital information about the docu-ments from the users, allowing the documents to be retrieved. We’ll start by describing how the study was conducted. Next, we’ll analyze the re-sults thus obtained. Then we will present the design guidelines, and how they were validated. Finally, we’ll discuss the main conclusions and possible future work on the area. 2 Procedure With this study, we tried to answer two main research questions: (1) in document-describing stories, what are the most common elements?  (2): how do they relate to  form the story?  To find the answers, we conducted 20 semi-structured interviews. The volunteers were interviewed at a time and place of their choice (previously arranged), often in their own offices or other familiar environments, to set them at ease. We asked for their consent in recording the interviews. Of the 20 subjects we interviewed, 55% were male and 45% female, with ages ranging from 24 to 56. Academic qualifications spanned all levels, from high-school to PhDs. Their professions were also fairly diversified: Computer Science Engineers, High-School Teachers, Law Students, economist, social sciences professor, etc. This accounts for the wide range of computer expertise we found, from programming skills to sporadic use of common applications (such as Microsoft Word). Overall, we feel we collected data from a diverse sample that won’t unduly bias the results. After explaining the study to the subjects, they were asked to remember specific documents from three different classes and to tell stories describing them. Those classes were: Recent Documents on which the user worked on in the past few days or weeks; Old Documents, worked on at least a year ago; and Other Documents, not  created by the user. They were chosen to allow us to evaluate the effect that time might have on the nature and accuracy of the stories (regardless of their correctness, since real documents were not available to validate them), and to find if stories are remembered differently for documents not created by the users themselves, since their interaction with those documents was different. We didn’t provide actual documents to be described because that would require the interviewer to have access to the sub- ject’s computer in order to choose those documents. Previous experiments [8] showed that users are reluctant to allow that kind of intrusion. Also, preliminary test interviews demonstrated computers to be distractive elements during the interviews, resulting in stories of poor quality. Furthermore, asking interviewees to remember the documents to be described better mimics the situations in which they might want to find a docu-ment in everyday life. For each document, the interviewees were instructed to “tell the story of the docu-ment”, and to recall all information they remembered about it. It was specifically rec-ommended that information besides the one resulting from the interaction with the computer itself was important. Additional questions regarding several expected ele-ments were posed in the course of the interview. They were asked only when the in-terviewees seemed at a loss of anything else to say, to see if some other information could still be elicited from them, or whenever they had started talking about some unrelated subject and we needed to make them go back to describing the document at hand. Three test interviews were conducted to tune and validate this procedure Stories usually took five minutes to be told. Their transcripts averaged two to three  plain text pages, although some users told longer stories. A typical story might start like this translated excerpt from a real interview: Interviewer: So, now that you have thought of a document, please tell me its story… Interviewee: It’s a paper I had sent to my supervisor. We had sent it to a conference some time ago. It was rejected… meanwhile I had placed the document on my UNIX account… 3 Interview Analysis All interviews were subjected to a Contents Analysis [15]. We coded for several ele-ments we expected to find in the stories (Table 1). New elements could be considered if required during the analysis process. As it turned out, no new elements were neces-sary after the initial encoding. Since the users were free to tell their stories as they chose, we’re fairly confident that we considered all relevant elements. Table 1.  Story Elements Time Place Co-Authors Purpose Author Subject Other Docs. Personal Life World Events Doc Exchanges Doc Type Tasks Storage Versions Contents Events  Name  Contents analysis is often performed by defining a coding dictionary which con-tains, for each specific word or expression that might occur in the interviews, the class to which it belongs [11]. In our domain such a dictionary could contain an entry stat-ing that the occurrence of the word “hours” is a reference to a “Time” element. This approach would allow the encoding to be made automatically. However, it requires the researcher to anticipate all relevant words or expressions that might appear. This was impossible in our experiment since the subjects were free to say whatever they chose about documents previously unknown to us. Hence, no coding dictionary was used. Instead, we conducted the coding manually with the help of a set of heuristic rules that clearly define what should belong to each category, considering not only specific words or expressions but also their meanings. We coded for frequency rather than for occurrence, since frequency can give us an estimate of the relative importance of the elements in terms of the amount of information of each kind in the stories. Also, we took notice of what elements were  spontaneous  (proposed by the interviewees) and induced   (promptly remembered by the interviewee after a question or suggestion from the interviewer). We also considered that not knowing something is different from knowing something not to have happened. An element was recorded only in the latter case. For instance, some users remembered that a document had no co-authors, while others couldn’t remember if that was the case or not. We also performed a Relational Analysis [15] to estimate how the several elements relate in the story. We considered the strength of all relationships to be the same. The direction of the relationships was given by the order in which the elements appear in the story. The signal of a relationship (whether two concepts reinforce or oppose each other) wasn’t considered since it isn’t relevant in this case. This allowed us to create a directed graph whose nodes are story elements, arcs represent the relationships be-tween those elements, and arc labels contain the number of times the corresponding transition was found. No transition was considered when the destination element was induced, since in that case no real connection between the elements existed in the interviewee’s mind. 4 Results Overall, we collected and analyzed 60 different stories, 20 for each document type. We produced not only quantitative results relating to the relative frequencies of the different story elements and transitions between those elements, but also qualitatively analyzed the stories’ contents. We took care to compare stories for different document kinds. Finally, we were able to infer archetypical stories about documents. Several statistical tests were used whenever relevant. In what follows, all quantitative values are statistically significant to 95% confidence. More results can be found in the ex- periment’s technical report [9].
Related Search
Similar documents
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks