Evaluation of Visual Information Browsing Displays

by

Emile L. Morse

B.S., Duquesne University, Pittsburgh, PA, 1968

B.S., University of Pittsburgh, Pittsburgh, PA, 1985

M.S.I.S., University of Pittsburgh, Pittsburgh, PA, 1993

Submitted to the Graduate Faculty of

the Department of Information Science and Telecommunications

of the School of Information Sciences in partial fulfillment

of the requirements for the degree of

Doctor of Philosophy

University of Pittsburgh

1999

 

Evaluation of Visual Information Browsing Displays

Emile L. Morse, Ph.D.

University of Pittsburgh, 1999

ABSTRACT

Visual displays as components of information retrieval systems are widely deployed yet have had little testing with users. The kinds of testing that have been performed have suffered from various problems related to trying to infer which observed effects are due to the interface and which things are attributable to the user's understanding of the visual display itself. The approach of this work is to test displays in isolation. The question being addressed is: what do people understand about a document set given a particular way of viewing the documents and relations among them? The central idea behind the testing paradigm is to perform bottom-up layered testing.

Preliminary studies were conducted using Boolean data about a document collection. Prototype displays mimicked lists of text or words, lists of icons, numeric tables, x-y graphs, and a visual display (called a 'spring') that implemented the placement algorithm found in the VIBE visual information browsing system. Displays were presented in random order to 439 subjects who performed a series of Boolean tasks. Subjects also ranked the displays according to their personal preferences. The results showed that 1) all the displays were very learnable; 2) visual displays (icons and 'spring') were preferred; and 3) more difficult scenarios were handled most effectively with the 'spring' visual display.

The current studies extend the testing paradigm to data that represents documents as vectors. The aim of this study is to determine how well users can understand displays based on vector representations. The prototype displays were presented to 192 subjects using computer-mediation. A taxonomy of visual tasks guided development of task sets. Subjects performed visual tasks better with the prototype visual display than with any of the other display types and they preferred displays that supported their performance. In summary, the utility of a visual taxonomy to guide visual display and interface development is supported. In addition, the use of this bottom-up, layered approach has been validated in Boolean and vector conditions and at various levels of difficulty within these categories. This evaluation method might be extended in many directions, including 1) the testing of 3-dimensional prototypes, 2) testing more complex keyterm combinations, and 3) testing the successive application of interface features.

 

Dedication

This dissertation is dedicated to the memory of Robert R. Korfhage -- a kind and gentle man, an intelligent and insightful teacher and advisor, and a personal role model. He led his many students with grace, humor, and the strength of his convictions.

Acknowledgments

So many people support and guide a student through the process of obtaining a doctorate; it is impossible to mention them all, but I would like to extend personal thanks to at least the following people.

Mike Lewis for encouraging without nagging and for financial, emotional and intellectual support.

Kai Olsen for great debates about how to test visual interfaces, especially VIBE, and for supplying student subjects for the preliminary studies.

Edie Rasmussen for introducing me to the folks at Digital Libraries and for subtle mentoring especially with respect to the issues of women in science.

Louise Su for many discussions, especially those that happened early on. She never made me feel inadequate even when I hardly knew how to ask my questions.

Jim Williams for many years of collegial support, for encouraging me to apply to the Masters program and for allowing me to contribute to some joint publications.

Michael Spring for so many things -- convincing me that I could succeed in the doctoral program, for giving me opportunities to work on projects and allowing me the freedom to choose my own direction within them, and for long discussions.

Dr. Tony Debons for sharing

I am grateful to Jeff Campbell and Terri Lenox for reading, proofing, debating issues, and contributing many suggestions that have been incorporated into this paper. Swantje Willms has proved to be a most capable editor. Wasu Chaopanon and Guoray Cai have never let me down when I needed to bat around some ideas. Rick Brienzo, Oliver Chen, Tammy Datri, Dave Dubin, Jane Greenberg, Susan Hahn, Jeff Jacobsen, Cindy Martincic, Linda Roberts, Bryan and Molly Sorrows, and Bill Yurcik are my dear friends and have helped in innumerable ways to make this process tolerable and my years at SIS enjoyable.

Last, but certainly not least, my husband Russ, my most enthusiastic supporter. He has promised to show me that there is 'life after graduate school.'

TABLE OF CONTENTS

1 Introduction *

2 Background Literature *

2.1 Visual Interfaces for Information Retrieval and Browsing *

2.1.1 Description of Selected Visualizations *

2.1.2 Analysis of Reference Point Visualizations *

2.2 Evaluation of Visual Interfaces for Information Retrieval and Browsing *

2.3 Task Models *

2.3.1 Domain-Dependent models *

2.3.2 Domain-Independent Visual Taxonomies *

2.4 Representation of Text as Document Vectors *

2.4.1 Text Analysis-the basic method *

2.4.2 Refinements of the Basic Method *

2.4.3 Alternative Methods for Encoding Documents *

3 Preliminary Studies *

3.1 2-term Boolean Study *

3.2 3-term Boolean Study *

3.2.1 Methods *

3.2.2 Results *

3.2.3 Discussion *

4 Research Design *

4.1 Problem Statement *

4.2 Definitions *

4.2.1 Information Retrieval *

4.2.2 Visualization types *

4.2.3 Display types *

4.3 Hypotheses *

4.4 Limitations *

5 Methodology *

5.1 Document data generation *

5.2 Displays *

5.3 Tasks *

5.4 Subjects *

5.5 Administration of test *

5.6 Statistical Considerations *

6 Results *

6.1 Subjects *

6.1.1 Demographic Characteristics *

6.1.2 Computer Experience *

6.1.3 Equipment Used for Experiment *

6.1.4 Factors Related to the Test Situation *

6.2 Comparison of Overall Time and Correctness Measures *

6.2.1 Correlation with Demographic Variables *

6.3 Instruction Times *

6.4 Performance with Respect to Displays *

6.4.1 Time to completion *

6.4.2 Correctness *

6.4.3 Summary *

6.5 Performance with Respect to Order of Presentation *

6.5.1 Time to completion *

6.5.2 Correctness *

6.5.3 Summary *

6.6 Performance with Respect to Task Types *

6.6.1 Effect of question types on performance *

6.6.2 Question type vs. display type *

6.7 Preferences *

6.7.1 Rankings *

6.7.2 Qualitative Ratings *

6.7.3 Comments *

6.8 Summary of Results *

7 Discussion and Conclusions *

7.1 Demographics *

7.2 Differences among Display Types *

7.3 Effect of Order of Presentation *

7.4 Effect of Using Tasks Created from a Visual Taxonomy *

7.5 Preferences *

8 Summary of Future Directions *

BIBLIOGRAPHY *

 

LIST OF TABLES

Table 2-1: Reference point visualizations *

Table 2-2: Information seeking dimensions (Belkin et al 1995) *

Table 2-3: Visual task taxonomy (Zhou & Feiner 1998) *

Table 2-4: Visual implications and related elemental tasks *

Table 3-1: Effect of display type on performance (mean + SE) *

Table 3-2: Preference ratings of various displays *

Table 3-3: Performance as a function of question type *

Table 5-1: Comparison of taxonomic categories *

Table 5-2: First level mapping from taxonomic categories generalized task statements *

Table 5-3: Second level mapping from generalized to specific task statements - 2-term questions *

Table 5-4: Second level mapping from generalized to specific task statements - 3-term questions *

Table 5-5: Parameter number for specific tasks *

Table 5-6: Segment of data captured from single subject in 3-term study *

Table 6-1: Summary of demographic characteristics *

Table 6-2: Summary of computer-related personal characteristics *

Table 6-3: Characteristics of computer equipment used for the study *

Table 6-4: Hardware used by participants working outside the laboratory (n=120) *

Table 6-5: Miscellaneous factors *

Table 6-6: Time spent on instruction page (seconds; Mean ± S.E.M.) *

Table 6-7: Within-subject contrasts for 2-term displays *

Table 6-8: Pairwise comparisons for 2-term displays *

Table 6-9: Within-subjects contrasts for 3-term displays *

Table 6-10: Pairwise comparisons for 3-term displays *

Table 6-11: Effect of display type on time to complete task set - 2-term vs. 3-term *

Table 6-12: Within-subject contrasts for 2-term displays -- correct *

Table 6-13: Pairwise comparisons for 2-term displays *

Table 6-14: Within-subjects contrasts for 3-term displays *

Table 6-15: Pairwise comparisons for 3-term displays *

Table 6-16: Correct answers -- comparison of 2-term and 3-term studies *

Table 6-17: Mean time and correctness for individual questions *

Table 6-18: Results of Kruskal-Wallis analysis of ranking data with respect to study type *

Table 6-19: Percentage of subjects categorizing display according to various criteria *

Table 6-20: Comment types for 2-term and 3-term studies *

Table 6-21: Summary of hypothesis testing outcomes *

Table 7-1: Comparison of display effectiveness across all studies *

Table 7-2: Subject preferences across studies (percent of subjects) *

 

LIST OF FIGURES

Figure 2-1 Examples of the visual elements of the reference point systems *

Figure 3-1: Samples of presentation types: Panel A is a 'text list'; B is an 'icon list'; D is a 'table'; D is a 'graph'; and E is a 'spring' display. *

Figure 3-2: Effect of order of presentation of various displays on performance of OR task (Morse et al 1998) *

Figure 3-3: Prototype display used in 3-term Boolean study *

Figure 4-2: Multidimensional scaling results of Lohse et al (1990) showing visual display types. *

Figure 5-1: Specialized WebVIBE interface used to select term sets *

Figure 5-3: Example of each of the five display types used in the 2-term study *

Figure 5-3: Examples of each of the 4 display types used in the 3-term study *

Figure 6-1: Age distribution *

Figure 6-2: Relationship of time to completion and correctness for 2-term study. *

Figure 6-3: Relationship of time to completion and correctness for 3-term study. *

Figure 6-4 Comparison time to completion of 10-question task set vs. display type. *

Figure 6.5: Mean score per 10-question task display type *

Figure 6-6: Time to completion with respect to order of presentation for 2-term study *

Figure 6-7: Time to completion with respect to order of presentation for 3-term study *

Figure 6-8: Effect of order of presentation on number of correct answers for 2-term *

Figure 6-9: Effect of order of presentation on number of correct answers for 3-term *

Figure 6-11: Preference rankings for 2-term displays *

Figure 6-12: Preference rankings for 3-term displays *

1 Introduction

Researchers in information retrieval (IR) have long searched for ways to make their systems more accessible to end-users and to develop new ways for users to explore data. Visualization techniques (computer methods for displaying large quantities of information graphically) appear promising as a means for achieving both goals. Information visualization can make multidimensional relationships that are difficult to extract from tabular data apparent to a trained searcher. However, unlike scientific visualizations, which are largely developed and used within elite specialties, IR visualizations are targeted toward guiding the public through newly accessible oceans of online information as well. Users employ many strategies when engaged in information seeking including bibliographical search, analytical search, search by analogy, empirical search, browsing, and check routine (Pejtersen 1988). Each of these activities might be augmented by using visualizations but browsing and analytical search are the strategies cited most frequently as benefiting from visual support (Marchionini 1995). According to Lin (1997), browsing is a superior strategy when

Work on IR visualization systems is at a relatively early stage. In the past ten years or so, systems such as Bead (Chalmers 1996), InfoCrystal (Spoerri 1993), and LyberWorld (Hemmje 1994), have been developed as visual information exploration tools to aid in retrieval tasks. Researchers at the University of Pittsburgh have contributed to the development of information visualization systems with VIBE (Olsen et al 1993), GUIDO (Nuchprayoon 1996), and BIRD (Kim & Korfhage 1994).

Previous user study research into IR visualization systems has largely focused on formative usability evaluations to provide feedback for system enhancement. Newby (1992) tested subjects with the SPACE IR visualization system and standard text-based system called Prism. His findings showed that subjects preferred the Prism system for many of the tasks performed. Spoerri (1993) compared InfoCrystal's interface with a standard Boolean interface. Koshman (1996) conducted extensive comparisons between VIBE and AskSAM, a commercially available text-based retrieval interface, using expert online searchers recruited from local libraries and novice searchers. Results showed that performance differences between AskSAM and VIBE were minimal. Novices and experts performed the test tasks with similar speed and accuracy. There was some evidence that particular tasks were better suited to performance in the graphical environment while others were more appropriate for the text-based tool.

Results of preference measures were disturbing in that users in both groups rated the textual interface slightly higher while the investigator states that she had expected a preference for the graphical interface whatever its effect on performance. She further suggests that the expanded capabilities offered by visual browsing are masked in comparisons of this sort which limit tasks to those which can be performed using text-based interfaces. For tasks expressible as Boolean queries a text-based interface may be both more direct and less complicated and less difficult to learn than a full featured visual information retrieval interface (VIRI).

Fully-featured visualizations are complex to evaluate for several reasons. First, choosing a control system is difficult. Second, performance is related not only to the tasks that are performed but also to the mode of interaction and to the choice of feature for performance of the task. Dissecting out the critical factor is impossible. Third, the presence of many features requires considerable training so that subjects are fully capable of selecting the proper tool for a task. Fourth, tasks are normally devised to test features of the interface rather than to test tasks that occur in a natural information-seeking setting.

How could a study be designed to test visualizations in relative isolation? We first pursued the idea of 'defeaturing' interfaces so that users could learn the remaining core functions quickly (Morse & Lewis, 1997). Our preliminary usability evaluations demonstrated that training problems, comprehension problems, performance problems and "ratings" problems seemed to be diminished for the de-featured interfaces (Morse & Lewis 1997).

An even more rigorous de-featuring is possible -- one in which just the visualization remains. This 'back to basics' approach guides the overall development of the current research including the preliminary studies based on Boolean data. It is bottom-up testing paradigm. If suitable sets of displays can be devised and if appropriate tasks can be developed, then it should be possible to test increasingly more complex situations and compare the results within a level of difficulty as well as across levels. Therefore, a plan has been developed to examine simple Boolean displays followed by simple vector displays. Preliminary studies have been performed in the Boolean mode that will be presented in Section 3. Both 2-term and 3-term conditions were tested in separate experiments and the results are supportive of the approach.

The next level of difficulty would be to move to a vector representation of documents. Weighted vectors underlie most modern retrieval systems. Documents in these systems are represented as elements of term occurrences. Often adjustments of various types are applied to the vectors to account for document length or other factors, but each document is characterized as a collection of numeric values. Text-based systems can use these weights to compare with user queries but few systems support users in constructing weighted queries. Visual representations can show users all the documents that contain any of the terms, which documents have which terms and information about the weighting of the terms.

The results of the preliminary studies clarified the factors that influence the ability of subjects to make appropriate inferences from various presentations of document data. These factors include the type of display, the type of task, the order in which the display types are presented, and the number of keyterms embedded in the displays. Each of these elements forms the basis for the fundamental hypotheses formulated in this work. In addition to measuring the ability of subjects to perform in this test environment, these subjects were also probed for their preferences regarding the various displays.

The next chapter will discuss the salient literature that underlies the design of the current studies. Chapter 3 will present two major preliminary studies performed based on a Boolean representation of documents. Chapter 4 will outline the overall research design of the current investigations. In Chapter 5 the methodology regarding all aspects of implementing the study design will be discussed. The results of the studies will be presented in Chapter 6 followed by a chapter that discusses the major findings of this set of studies and compares and contrasts these results with those of the preliminary Boolean studies. Finally, chapter 8 will present conclusions and directions for future study.

2 Background Literature

Previous research pertinent to this research includes four major items. First, it is important to know what types of visualizations have been developed for information retrieval and browsing. Section 2.1 provides a discussion of the various systems. Second, the extent to which evaluation has been applied to such interfaces needs to be determined. Although there are quite a few visual interfaces for IR, there have been few user studies. Section 2.2 discusses the details of the designs of the user studies performed to date. Third, visual task taxonomies and information retrieval taxonomies will be discussed. Task models can be constructed at many levels of granularity. The library literature is replete with high-level models of user strategies. Section 2.3.1 discusses some of these. It is difficult to determine from these models what elemental tasks might be selected for testing visual interfaces. However, it is clear from scanning the categories proposed that visual tools could be applied as alternatives to text-based presentations. At a much lower level of task specification, one finds visualization-specific task typologies. These are discussed in some detail in Section 2.3.2. Finally, the results of two studies of user performance and preferences in 2-term and 3-term Boolean test scenarios are presented in section 2.4. Finally, approaches to developing vector representations of documents are central to the current discussion that focuses on rendering just this type of data. An overview will be presented in Section 2.4.

2.1 Visual Interfaces for Information Retrieval and Browsing

Visual interfaces for information retrieval and browsing, as discussed in Section 1.3.2, can take many forms, e.g., reference point systems, map displays, 3-dimensional systems. The focus of the current work is on low-dimensional reference point systems. There are many map-type visualizations including SPIRE (Wise et al 1995), Themescape (Wise et al 1995), self-organizing maps (Lin 1991, 1997), and BEAD (Chalmers 1996). These systems are relevant in the overall context of visualization of information but are outside the focus of this paper. The suggestion could be made that maps are quantitatively but not qualitatively different from reference point systems. This study, however, does not depend on the validity of this suggestion. Also excluded from review, although closely allied, are visualizations that are based on data derived from non-document sources, e.g., databases. Also missing from this analysis are 3-dimensional systems such as LyberWorld (Hemmje et al 1994), VR-VIBE (Benford et al 1995) and the set of systems developed at NIST and reported by Cugini et al (1996). The increased difficulty of interpreting these systems is outside the scope of the simple approach employed here. Future extension to such systems would need to be evaluated before applying the method.

Table 2-1 shows a list of pertinent low-dimensional document visualization systems. Each of these displays relies on representation of documents as vectors, although systems such as InfoCrystal contain Boolean vectors. The purpose of exploring these interfaces in the current context is to determine whether there are unifying features or common rendering techniques that map to the prototype displays planned for evaluation in this study. Such an analysis should provide an indication of the utility of the intended approach and possibility for its scalability.

Table 2-1: Reference point visualizations

Visualization System

Reference

Tested?

Component Scale Drawing

Crouch & Korfhage 1990

yes

Cougar

Hearst 1994

no

GUIDO

Nuchprayoon 1996

yes

InfoCrystal

Spoerri 1993

no

SIRRA

Aalbersberg 1995

no

Space

Newby 1992

yes

TileBars

Hearst 1995

yes

VIBE

Olsen et al 1992

yes

WebVIBE

Morse & Lewis 1997

yes

2.1.1 Description of Selected Visualizations

Component Scale Drawing (Crouch & Korfhage 1990) is shown in Figure 2-1a. The graph shows query terms on the x-axis; the order of the terms is determined by the user's weighting. The y-axis indicates classes of term weights. Documents are represented as broken lines and the query itself is represented as a solid line. The purpose of the system is to assist users in determining the similarity of query and documents.

 

Figure 2-1 Examples of the visual elements of the reference point systems

Cougar (Hearst 1994), shown in Figure 2-1b, uses a Venn diagram to represent the relationship between documents and query terms. Each query term is mapped to a circular area of the display. Document identifiers are shown as icons in the list box in the appropriate sector of the graph.

GUIDO (Nuchprayoon 1996) uses a novel type of display to allow sophisticated mathematical manipulation of similarity metrics (Figure 2-1c). The display allows the selection of two reference points that are then shown as points on the x- and y-axes. The resulting document set is then displayed in the plank that is generated at a 45˚ angle in the graph. Various retrieval caps and metrics are provided to enhance selection of desirable subsets of documents.

InfoCrystal (Spoerri 1993) is another example of a reference point system that is based on the Venn diagram model. Figure 2-1d shows the results of a 4-term Boolean query. The query terms are indicated at the vertices and the resultant subsets are associated with the other shapes shown in the bounding box. The number of edges of an included shape indicates the number of query terms and the direction of the vertex points to the query term. InfoCrystal is useful for determining the distribution of documents in a document collection that satisfy each of the possible Boolean queries. Spoerri describes higher dimensional InfoCrystals. He also illustrates a version that allows the creation of weighted vector queries, although the display looks tremendously complex.

An example of SIRRA (Aalbersberg 1995) is shown in Figure 2-1e. The visual interface incorporates a list of multicolor icons. Each query term is assigned a color and each icon represents a single document. Users can compare documents with respect to the strength of a query term within and across documents in a set.

Space (Newby 1992) is an IR system based on the principles of navigation, which he defines as human behavior to make sense of an information space. The example of Space shown in Figure 2-1f is a part of the interface termed the 'Navigation' window. Keywords and document identifiers float in the field. The placement of the documents with respect to the keyterms is based on the relative strength of attraction of the document for the term. Other windows in the interface include a map view and a key term list.

TileBars (Hearst 1992) is based on segmentation of the underlying full-text into topics. Figure 2-1g shows the results of a 3-term query. Each large box represents a single document and the grayscale rectangular areas with them show the relative amount of the term in sequential fragments of the text. Each row within the large rectangle represents a query term or combination of terms. This visual method has been incorporated into the Scatter/Gather interface.

VIBE (Olsen et al 1992) is shown in Figure 2-1h. Query terms are shown at the vertices of the figure and the resulting retrieved sets of documents are shown as icons scattered around the enclosing triangle. VIBE can be used to represent the results of Boolean as well as vector queries. This visual is part of a fully featured interface that allows users to interact with term lists and moderately large document collections.

WebVIBE (Morse & Lewis 1997) is an obvious relative of VIBE (Figure 2-1i). An attempt was made to reduce the number of features, which was hypothesized to affect the ability of people to use the interface effectively. Since the information source is not local, the vector representation of the document content must be determined 'on-the-fly'. The naive physics metaphor of magnetism was employed to encourage interaction and learnability.

2.1.2 Analysis of Reference Point Visualizations

Several types of visual approaches can be seen when the above examples are analyzed. Three major categories are: Venn diagram, icon lists, spatial systems. Both Cougar (Hearst 1994) and InfoCrystal (Spoerri 1993) are based on the Venn diagram. SIRRA (Aalbersberg 1995) and TileBars (Hearst 1992) present icon lists. VIBE (Olsen et al 1992), WebVIBE (Morse & Lewis 1997) and Space (Newby 1992) employ a spatial method to render the relationship of documents and key terms. Although GUIDO (Nuchprayoon 1996) and Component Scale Drawing (Crouch & Korfhage 1990) are both graphical representations that show an x-y graph, there does not appear to be much that makes them similar. The latter uses a line graph and nominal scales while the former uses icons and continuous scales.

The prototypes described for testing in the present studies contain representatives of the icon list and spatial types of displays. The Venn diagram appears to be useful for displaying Boolean data but falls short of making compelling displays of vector data. The x-y graph type that is to be used in the 2-term prototype testing is clearly unrelated to either GUIDO or Component Scale Drawing, but has some characteristics that are related to BIRD (Kim & Korfhage, 1994).

2.2 Evaluation of Visual Interfaces for Information Retrieval and Browsing

Of the reference-point visualizations discussed in the previous section, only Component Scale Drawing, GUIDO, Space, TileBars and VIBE have been subjected to user studies. Each of the studies will be reviewed briefly here. The purpose in reviewing these evaluation methods is to determine what tasks were given to the subjects and also to determine which interfaces were subjected to usability evaluation as opposed to task performance evaluations. Other pertinent aspects of the studies, such as number of subjects used and characteristics of the user/subject populations, will be noted where such information exists.

Component Scale Drawing (Crouch & Korfhage 1990) was tested by presenting a user with a display based on each of 15 queries. The task of the user was to use the Component Scale Drawing tools to rank the documents with respect to their similarity to the underlying query. The rankings were then compared with the known relevance rankings. The results showed that there was a highly significant relationship between the user's rankings and the known rankings (Spearman coefficient 0.85 across queries). The number of users is not clear from the paper but may be limited to a single person. The task is clearly highly specific to the interface.

GUIDO (Nuchprayoon 1996) was subjected to usability testing. Sixteen subjects were charged with performing nine information retrieval tasks. Tasks were graded as 'easy' and 'hard'; 'easy' tasks presented the test subject with pre-selected metrics, retrieval threshold, and POIs, while 'hard' tasks required the subject to select each on his own. The primary goal of each task was to choose the 8 'best' documents from the resultant display. The primary measure was the amount of time that it took the subjects to perform the document selection. The results showed that there were some interactions among the retrieval threshold and metric. Subjects provided positive feedback on the GUIDO system.

Newby (1992) tested Space with 20 users. They were provided with a full system display which included the multi-window display and a mouse and PowerGlove. His primary goal was to test the ability of users to navigate abstract spaces. Users performed two information retrieval tasks: 1) a closed-ended question that was based on key-term synonymy and 2) an open-ended task based on a vague statement of information need. The 'Space' system was compared with a traditional IR system (Prism). Newby demonstrated considerable learnability of the Space system and high user ratings. Comparison with the more traditional system showed that users preferred the system with which they were already familiar.

TileBars (Hearst 1995) has not been subjected to the same type of user studies mentioned thus far. The TileBars interface itself has not been user tested but the algorithm underlying its segmentation of text into topics has been compared with human segmentation of the same text. High correlations were found between the two types of segment generators. This study, however, is not particularly relevant for the purposes of the proposed work.

VIBE has been subjected to user testing by Koshman (1996). She compared performance of expert and novice searchers using VIBE or a conventional text-based interface (AskSAM). There were 15 novices, 12 online search experts and 4 subjects who had VIBE system expertise. Due to the small sample of VIBE experts, the study concentrates on the former 2 groups. This was a thorough usability study of the VIBE interface in that it sought to measure users' performance at tasks that required use of novel interface features. Subjects performed seven tasks that were chosen for their likelihood to represent 'normal' user IR tasks. The tasks were structured to cover a variety of 'information tasks' as opposed to 'navigation tasks' since many of the latter tasks could not be realized in the VIBE interface (p58). In general, the tasks have a Boolean flavor, e.g., how many documents contain (all, one, or two) terms. Scenarios were constructed to provide a naturalistic information seeking setting. Usability was assessed by measuring: 1) system familiarity time, 2) task performance speed, 3) frequency with which online help is accessed, 4) number of errors in task results, 5) subjective satisfaction, and 6) system feature retention. Familiarity time showed no difference for interface or for expertise level. She showed that time to complete tasks was inversely related to expertise. Users preferred the familiar, text-based interface to the visual VIBE interface. She states that users retained what they learned from one session to the next but believes that this was due to increased 'familiarity with the kinds of task and the tools needed to perform the tasks'. It is reasonable to conclude that the Boolean nature of the tasks chosen for this study influenced the outcome, in that Boolean tasks are probably accomplished more effectively with Boolean systems such as AskSAM.

WebVIBE was subjected to usability testing (Morse & Lewis 1997). The overall aim of these studies was to determine whether defeaturing existing IR interfaces could produce interfaces that could be used successfully in 'walk-up' systems, especially on the Web. The results showed that users could indeed form correct inferences about retrieved documents and their relationship to the query terms without extensive training.

2.3 Task Models

Several frameworks for information visualization have been proposed (Kennedy et al. 1996, Rogowitz & Treinish 1993, Wehrend & Lewis 1990). Some of these structures include modeling of the user. Increasingly, user-centered design is being adopted. In this paradigm, explicit representation of the user is important. The user can be modeled in the system by assessing the user's goals and/or defining the tasks the user needs to perform.

This section will present several task models, some of which are domain-dependent and others that are independent of domain. The granularity of analysis runs the gamut from very fine-grained to very high level.

A classification scheme supports the development of task sets for system evaluation and lays the groundwork for the development of automatic visualization systems. By knowing the data that exist, the requirements of the interface, and the goals of the user, it becomes possible to ask how one might build visualizations automatically. The purpose of this paper is to discuss the issues that contribute to understanding how best to approach the evaluation of document visualization systems.

2.3.1 Domain-Dependent models

Modeling users in information retrieval situations has a long history in library science. Systems have changed from having only titles and minimal other metadata to having abstracts to the present situation in which most texts are available as full-texts. Systems have increased in capacity to accommodate the requirements of full-text storage and systems have taken advantage of increased computing power to perform searches. Where once an intermediary worked with a user to formulate a query which would be submitted in essentially batch mode, many current systems are used by the end user and searches are interactive. Only recently have visualizations been developed that might help satisfy some of the user's information needs. The models developed by library scientists have changed to accommodate evolving resources.

The following task models developed for use in library environments were chosen to show how varied the approaches are and to describe some models that may have utility in evaluating visual interfaces. Reviews of the historical evolution of information retrieval can be found in Spink (1997) and Bates (1989).

2.3.1.1 Marchionini

The breakdown of the information seeking provided by Marchionini (1992) describes a network of tasks that are performed in various, user-defined orders until the information-seeking problem is solved. Marchionini states clearly that there are two basic forms of information needs - fact knowledge and browsing. The subtasks that he provides are not different for the two types of needs and are:

This particular task list is relevant to interface design but provides little guidance on what subtasks might be. It is also highly grounded in the traditional information retrieval paradigm in that it relies on query formation and an iterative performance of steps to arrive at a satisfactory solution.

2.3.1.2 Bates

Bates (1989) describes a 'berrypicking' model of information retrieval, which she contrasts with the classical method. Her description of browsing in a world of text seems to offer similarities to visual representations. She presents a list of 6 tasks:

2.3.1.3 Belkin

Belkin et al (1995) propose that information seeking can be defined with respect to four dimensions as shown in the following table.

Table 2-2: Information seeking dimensions (Belkin et al 1995)

Searching as a method of interaction refers to trying to find some known item, while scanning refers to trying to find something interesting. The goal of interaction might be to learn something about an item or it might be to select the item. When looking for items, the user might specify what should be looked for or he might find it by recognizing it. The distinction between information and meta-information is the same distinction that has been made in this paper.

Belkin (Belkin et al 1995) notes that there are 16 possible information-seeking strategies if each of the components is viewed as a Boolean value. For instance, traditional information retrieval might be characterized as Selecting + Specification + Meta-information + any method of interaction and information visualization would be described as Learning + Recognition + Information + any method of interaction but frequently Scanning.

2.3.1.4 VIRI Research Group Tasks

The VIRI (Visual Information Retrieval Interface) research group led by Professor Korfhage at the University of Pittsburgh developed a set of tasks that we term 'tool-enabled' tasks. The idea behind the name is that visualizations provide ways of doing things that might not be possible or that might be much more difficult using less visual means. The list is as follows:

2.3.2 Domain-Independent Visual Taxonomies

2.3.2.1 Wehrend & Lewis

The task classification of Wehrend & Lewis (1990) is a low-level, domain-independent taxonomy of tasks that users might perform in a visual environment. Domain-independence allows generalizability. Wehrend & Lewis' classification consists of the following set of user actions.

2.3.2.2 Zhou & Feiner

Zhou & Feiner (1998) have developed a visual task taxonomy. This taxonomy extends that of Wehrend & Lewis (1990) by defining additional tasks, by parameterizing the tasks, and by developing a set of dimensions by which the tasks can be grouped. Table 2.3 shows a list of the elemental visual tasks together with task parameter lists (shown in angle brackets).

Table 2-3: Visual task taxonomy (Zhou & Feiner 1998)

[Table not shown due to poor conversion from WORD97.]

The major dimensions of visual tasks that they describe are visual accomplishments and visual implications:

"Visual accomplishments describe the type of presentation intents that a visual might help to achieve, while visual implications specify a particular type of visual action that a visual task may carry out." -- Zhou & Feiner (1998)

The structure that results from applying the visual accomplishments dimension is a hierarchy. The major branches describe tasks that 'Enable' and tasks that 'Inform'. The former are further decomposed into exploration tasks and compute tasks, while the later are described as elaborate and summarize tasks. The breakdown along the line of visual implications seems that it might be useful in developing domain-dependent tasks. Zhou & Feiner propose three types of implications: 1) visual organization, 2) visual signaling and 3) visual transformations. The overall structure of the implication dimension of the visual taxonomy is shown in Table 2-4.

Table 2-4: Visual implications and related elemental tasks

Implication

Type

Subtype

Elemental tasks

Organization

Visual grouping

Proximity

associate, cluster, locate

Similarity

categorize, cluster, distinguish

Continuity

associate, locate, reveal

Closure

cluster, locate, outline

Visual attention

cluster, distinguish, emphasize, locate

Visual sequence

emphasize, identify, rank

Visual composition

associate, correlate, identify, reveal

Signaling

Structuring

tabulate, plot, structure, trace, map

Encoding

label, symbolize, portray, quantify

Transformation

Modification

emphasize, generalize, reveal

Transition

switch

2.3.3 Summary of Task Models

Each of the task models presented above presents a different view of the problem of evaluating visual information retrieval systems. Domain-dependent models are very high-level abstractions of user tasks. While it would be desirable to employ such a model, there is no obvious method for selecting possible types of tasks. The fact that the scales of the various typologies are presented at several different levels of granularity adds to the difficulty of determining which of the domain-dependent models to use. In addition, the studies done with library patrons focus on the tasks that the users seek to accomplish are perhaps learned behaviors due to their prior knowledge of how libraries work-they tend to ask questions that they know can be answered. Visualizations might support a different way of asking questions and getting answers. The task list developed by the VIRI Research group is not structured as a typology at all. Even the more structured typologies do not answer the question of what fraction of an information browsing user's needs are contained in each taxonomic category. The difficulty in using the domain-independent models is the number of possible mappings to a particular domain.

2.4 Representation of Text as Document Vectors

The use of natural language processing to generate better document vectors has been the object of intense investigation for a long time. Methods for detecting phrases (Croft et al 1991) and for extracting names (Rau & Jacobs) and topics (Hahn 1990) have enriched the arsenal of information retrieval (IR) researchers. The advent of full-text rather than mere surrogates has opened the question of whether the old methods, which were developed to handle short text pieces, would scale up to handle full-text. The evidence shows that there is some degradation of processing effectiveness (Blair & Maron 1985). One of the possible factors that inhibit scalability is that long pieces of text are actually strings of related and dependent ideas whose major theme emerges from their juxtaposition. In order to capture the meaning of these longer texts there has been a considerable effort to detect and encode the content of subpassages of documents.

The purpose of this section is to guide the reader to an appreciation of the difficulty in producing the requisite data for visualizing text. The starting material for text characterization is usually full-text, but in some cases only surrogate documents comprised variously of title, authors, abstract, citation list are used. Methods for processing these text pieces are generally lexical in nature. Systems that are more ambitious employ syntactic and semantic parsing. There is some evidence that detection of phrases is useful in improving the effectiveness of retrieval. Other methods rely on neural networks to detect patterns in text. There is some intriguing evidence that the purely statistical methods and neural networks produce results that are highly similar (Schütze et al 1995). The problem with all these methods is similar to the problem of people trying to understand each other. The overriding hope is that the words that are spoken or written convey some meaning that is intended by the speaker/writer and understood by the listener/reader. To ask machines to do what people often fail to do is a big task. The goal of all the methods is to capture some core essence of a passage, document, or collection. The hope is that the content being examined is sufficiently clear, long enough, redundant on topic and sparsely populated by extraneous material. Two important criteria bear investigation:

Willett (1988), Schütze et al (1995) and Lewis & Sparck Jones (1996) have presented reviews of data generation methods. The essential thing to keep in mind when performing visualizations based on this type of data is that the data is fuzzy at best. The computer slogan 'garbage in, garbage out' serves as a warning to those who attempt making pictures of questionable data.

2.4.1 Text Analysis-the basic method

Regardless of whether a Boolean, extended Boolean, fuzzy Boolean, probabilistic, or vector model is used for information retrieval, the document is represented in the computer system as a vector of terms. In some cases, the vector contents are binary (0, 1) to represent the presence or absence of a term. Other systems use numeric values to indicate the strength of a relationship between a document and a term element. The permissible range of values is of little consequence; systems using values between zero and one are common as are those that use positive integers. The first step in processing any text collection is to count the frequencies of words in the texts. Usually one or more stop lists are employed at this stage in order to speed up processing and to generate more meaningful term sets. A generic stop list contains words that are too common in the language to allow reasonable retrieval characteristics, e.g., 'the', 'a', 'an', 'of', 'that.' Additional stop lists may be employed in a particular domain to prevent inclusion of words that are prevalent in the local environment, e.g., 'rock' in a geology textbase, or 'computer' in a computer and information science collection. In addition, words are usually stemmed by any of several methods (e.g., Lovins (1968), Porter (1980)) so that the set of potential keywords is compacted.

2.4.2 Refinements of the Basic Method

The resulting raw count data is subjected to further processing by several methods. Depending on the domain and size of the collection, the number of terms that may be identified at this stage may be in the range of a several hundred to tens of thousands or more. Among the most common methods used at this point are: normalization for document length, application of a term discrimination method, term intercorrelation determination, and thesaurus expansion.

        1. Normalization for document length
        2. Collections can vary widely in the size of documents that they contain. A book and an abstract might both contain the same number of occurrences of a particular term. It is clear that, in this case, the term is probably a better descriptor of the shorter document. In order to control for document length, it is customary to normalize the term counts for document length. The necessity for this correction factor depends on the similarity measure chosen for subsequent calculations. If the cosine is used to determine similarity, then no correction need be applied. The process of weighting by frequency of occurrence in the total document collection is an attempt to normalize document representatives with respect to expected frequency distributions.

        3. Term discrimination value
        4. Thus far, a list of words or stems has been produced together with a frequency of occurrence of those elements in each text of a collection. The only adjustment has been for document length. One of the major purposes of a term list to allow a user to appreciate differences and similarities among texts. Terms that appear in nearly every document are useless for this purpose, as are terms that occur rarely. The inverse document frequency in one of several forms is applied to normalize for term set size (Harman 1992). Alternatively, a commonly applied heuristic for the lower bound is that a term should appear in over 20% of all documents. Similarly, terms that appear in over 80% of all texts can be ignored. The term discrimination value is another method for determining which terms provide the best indexing terms for a collection of documents (Salton 1989). The hundreds or thousands of terms generated during the concordance phase of text processing can be viewed as a multidimensional term space within which the documents are suspended. It is theoretically possible to determine the effect of adding or removing a term on the placement of documents in the space. If adding (or deleting) a term causes a significant change in shape of the space then the term is considered important. If adding (or deleting) a term produces little effect on document distribution then it could probably be ignored. The Exact method of Willett (1985) compares each multidimensional document descriptor with each other document vector using the cosine similarity measure. Terms that produce positive cosine values indicate 'good' discriminators; terms that produce negative cosines are useful for dissecting out regions of space that indicate 'not'; intermediate and zero values are neutral for the process of discriminating.

          The Exact method is an O(n2) process. Even though the calculation of discrimination values is not performed dynamically during a browsing or retrieval session, the number of terms can lead to processing times in the order of tens of hours even on powerful processors. The method described by Salton (1989) proposes to calculate a centroid document, which is used for comparison with each document vector. This process is clearly O(n). A study by Crouch (1988) showed that the results of using this approximate method was as good as the exact method in terms of specificity of term identification with the expected huge reductions in processing time.

        5. Intercorrelation determination
        6. The terms identified by either a pure concordance or those filtered by calculation of term discrimination value (TDV) are likely to be intercorrelated, i.e., different terms produce the exact same documents in response to a query. The implication is that the number of terms can be reduced without affecting the quality of the index terms. In addition, a reduction of correlating terms is indicated in the situation of vector model retrieval in which a usual assumption is that the terms are pairwise orthogonal. Raghavan and Wong present a detailed description of the side effects of violating this assumption (1986). They admit, however, that applications based on vectors as notational convenience rather than a formal model of IR concepts have been successful. Clustering methods are frequently used to identify terms that co-occur. The review by Willett (1988) presents a lengthy discussions of the available methods and the advantages and disadvantages of each. Chen et al (1995) review various methods and present results derived using several different clustering algorithms.

        7. Thesaurus expansion

Thesauri can be applied to documents collections to generate broader, narrower, synonymous and related terms. Research in this area comprises both creation of and use of thesauri. Chen et al (1995) describe a method for creating a thesaurus using multiple sources. In addition to using the methods described thus far in this paper-term frequency, document frequency, weighting for length, co-occurrence analysis-they subjected the term lists to one of two generative methods. In the first, they treated the terms as a single collection, regardless of source. In the other, they processed separately the terms from each of four different sources about the same topic. Their study concentrated on trying to determine if better methods could be devised for coping with the problems of information overload and language fluidity. This seems to be a major thrust of automatic thesaurus generation research-automating takes care of the 'overload' problem and creative indexing takes care of the 'fluidity' problem.

The work of Losee and Haas (1995) is a typical study in the field of thesaurus development. Their work concentrates on sublanguages, the languages used by people working in a particular field or discipline. This area is particularly concerned with language that is changing rapidly to accommodate advances in science. Although all languages undergo gradual change, the world of scientific endeavor experiences even more rapid turnover due to the introduction of new concepts that need to find expression. A related problem is the borrowing of terms from one discipline to cover the needs of another. For automatic indexing systems, it is a special problem to know what the introduction of new terms might imply.

2.4.3 Alternative Methods for Encoding Documents

Although keywords and vector representations are the most commonly encountered methods of representing text, especially in situations in which automatic encoding is desired, e.g., large on-line collections and/or the WWW, there are significant advantages to using different approaches to text processing. For instance, several of the projects from Xerox PARC employ citation tracing to support browsing of large information stores found in distributed sites (Mackinlay et al 1995). The researchers undertaking these projects cite the utility of using the built-in schemes of large IR suppliers such as DIALOG. One of the side effects is the ability of such systems to use querying based on relational databases. While it would be difficult to encode in vector form the information about the year of publication, the names of the authors, or similar demographic information, systems that rely on relational databases can use this information quite effectively. Several projects are attempting to merge the two approaches to characterizing text-statistical and relational database (Blair 1988, Croft & Parenty 1985, Lynch & Stonebraker 1988, McLeod & Crawford 1983). Considerable interest exists but there is also much dissension regarding the proper methods to use (DeFazio et al 1995). If the information sources that eventually become available include significant amounts of classical database material, then the possibilities for leveraging some of the methods that have been developed for visualizing databases will become immediately applicable to the visualization of document information.

The method called latent semantic indexing or LSI (Deerwester et al 1990) seeks to leverage the correlations among terms in documents to yield superior indexing parameters. The method reduces the dimensionality required to render a document space. LSI uses a singular-value decomposition (SVD) method. A term-document association matrix is constructed using at least 100 terms. Transformation using SVD produces a series of matrices that have reduced dimensionality. In fact, this method generates orthogonal variables, which as mentioned earlier are a requirement for implementation of formally correct vector models (Raghavan and Wong 1986). Deerwester et al (1990) showed that LSI was superior to several other methods with respect to both precision and recall. This method has been incorporated into other IR systems; e.g., Schütze et al (1995) have found that LSI provides superior pre-processing for neural network inputs.

Kohonen maps have also been used to characterize information spaces (Lin 1991, 1992, 1997). Lin has developed displays that can show both content and structure of a document space. He provides as inputs to his algorithm N-dimensional vectors. Through a series of iterations of weight adjustments, the system converges. Sample experiments are described in which input vectors consist of a hundred to more than a thousand elements. The outputs were mapped to grids that were either 10 by 14 or 14 by 14. The mapping that is produced has large areas for concepts that are focal in the collection and smaller areas for less well-mentioned topics. In the examples shown (Lin 1997) the reduction in dimensionality was in the order of 10:1 or greater.

3 Preliminary Studies

Two preliminary studies were performed differing in the number of key terms used to order the displays. The first study used 2 terms, which makes this the simplest possible testing condition, and the next was based on 3 terms. The discussion of the 2-term study is presented briefly since this work has been published (Morse et al 1998). The 3-term study is given in more detail since there is no published source for an interested reader to consult for more information.

3.1 2-term Boolean Study

The first step in the proposed bottom-up testing plan involved presenting a variety of simple interfaces to groups of undergraduate students. The students were registered in programs at the University of Pittsburgh or Molde College, Norway. The simple interfaces are shown in Figure 3-1; they are labeled text, icon-list, table, graph and 'spring'. The last of these is based on the VIBE display model (Olsen et al 1992).

Figure 3-1: Samples of presentation types: Panel A is a 'text list'; B is an 'icon list'; D is a 'table'; D is a 'graph'; and E is a 'spring' display.

Subjects took the test as a paper-and-pencil exercise during a normal class session. Instructional materials were limited to a short presentation of the displays using dummy data. The order of presenting the displays was randomized. The total randomization entailed 120 different orderings. For each display the subjects were asked two questions: 1. Circle the item(s) that contain term X and Y, and 2) How many items contain the term X. After answering all the questions for each display the subjects were asked to rank the displays according to their personal preference. Demographic information was also collected at this time.

Two hundred sixteen (216) subjects took part in the study. There were 121 men and 95 women. Seventy-five students were from Norway and the remainder took the test in Pittsburgh. Although the test was administered as part of the requirements of an undergraduate class, there were 9 graduate students enrolled in these classes. There were 71 freshman, 19 sophomores, 43 juniors, and 72 seniors in the sample. One hundred eighteen students were under 23 years of age; 71 were between 23 and 30 and the remaining 21 were over 30. Of the students in the Pittsburgh sample, 29 reported that English was not their native language.

The performance of subjects was similar with respect to gender, age, and year in program. Initial analysis of the data showed a significantly better performance by subjects in the 'Norwegian' group compared with the 'American' group. However, all of the variation could be explained by the high number of subjects in the latter sample that spoke a language other that English as their native tongue. When native language was factored into the analysis, the discrepant performance was eliminated.

Performance was affected by question type, display type, and order of presentation. The first question ('Circle items about terms X and Y') was answered correctly more often that the OR question question ('How many items contain the term X?') regardless of display type. This finding indicates that attention must be paid to the construction of probe questions for interface testing. Average performance for each display type showed that the text and icon lists were easiest to comprehend, followed by the table and finally the graph and 'spring' displays were most difficult. The order in which the displays were presented was also a significant predictor of successful performance. Figure 3-2 shows the results of this interaction. It shows that displays such as the 'spring', which were difficult to understand and use if presented first, were easier to use just because of the practice with answering similar questions during other display trials.

Figure 3-2: Effect of order of presentation of various displays on performance of OR task (Morse et al 1998)

The subjects' preferences showed that performance was not the best predictor of preference. Fully 47% of the subjects rated the text display as their least favorite despite superior performance with that display. Over 60% of the respondents indicated a preference for the visual displays, i.e., the icon list and 'spring' display. It is possible that there was a Hawthorne effect, since the subjects might easily have assumed that the investigators were testing the visual displays and might be looking to demonstrate their superiority. Users prefer visual displays even if these displays do not always provide the best environments for task performance.

This study serves as a baseline against which more complicated IR designs can be tested. It shows that subjects can learn novel interfaces with a minimum of effort, that they prefer visual interfaces and that there is sufficient sensitivity of the measures employed in this study to make clear observations of differences in both performance and preference. The results of this study are presented in Morse et al (1998).

3.2 3-term Boolean Study

This study builds on the previous by making the displays more complex by adding another term to the query. This move from two to three query terms is a significant step, since the number of conceptual groupings that can be formed, i.e., the number of Boolean combinations, is a function of the number of key terms. The function that describes the cognitive load is 2n-1, where the number of possible Boolean combinations is 2n. The set of documents that is retrieved by the conjunction of NOTs is a set that contains none of the terms of interest. This set is not renderable under normal conditions and the formula reflects this fact by subtracting one. Two query terms involve inspection of three kinds of document representations and three query terms generate seven types. This shift encompasses the limits of short-term memory, which leads to a conjecture that visual methods will have constant difficulty while list-based display methods will suffer a performance decrement.

3.2.1 Methods

3.2.1.1 Subjects:

Students enrolled in a variety of undergraduate courses in the Department of Information Science and Telecommunications at the University of Pittsburgh and at Molde College in Molde, Norway were used as the subjects for this study. A paper-and-pencil version of the test was administered to 32 subjects and 191 performed the experiment on-line (http://vidcap.sis.pitt.edu/b3IntroForm.html for English and http://vidcap.sis.pitt.edu/nb3IntroForm.html for Norwegian). An additional seven subjects were videotaped, audiotaped and subjected to post-test debriefing.

3.2.1.2 Prototypes:

The displays that were selected for testing comprise the types of visual displays commonly employed in IR interfaces. Samples of each are shown in Figure 3-3.

Figure 3-3: Prototype display used in 3-term Boolean study

The text-based display is similar to the lists provided by Internet search engines. Icon displays have been used in the Scatter/Gather system (Hearst 1995) and have been observed as a component of some search engine results (Aalbersberg 1995). Tables are another common method for presenting summaries of search results. The final display, which is labeled 'triangle' in Figure 3-3, is based on the VIBE display using its Boolean variant. Documents that are about a single term will appear as part of the count of the node labeled with that term; documents that are about two terms will contribute to the count of the node located at the midpoint of the line connecting the terms; and documents that contain all three terms referred to in a query will be summed in the node located in the center of the triangle.

3.2.1.3 Tasks:

The user's task was to answer questions related to the displays. The same questions were used for each display. The basic form of the questions conformed to a set of Boolean operations using the AND and OR connectives. For instance, the question "How many documents have all the three terms in them?" is equivalent to the A AND B AND C.

3.2.1.4 Measures:

Performance was assessed as number of correct answers. Computer-mediated sessions were also assessed based on time-to-completion for each display. In addition, subjects were asked to rank the displays with respect to their preference for using them. A post-test questionnaire captured information about the subjects (age, gender, year in program, whether the experiment was performed in their native language), their computer and Internet experience, and, for the computer-mediated group, some specifics about their equipment configuration (modem speed, CPU speed, and monitor size).

3.2.2 Results

The results show that there was no significant difference between computer-mediated administration and paper-and-pencil. One hundred ninety-eight subjects performed the experiment in the computer-mediated mode. Timing data (Table 3-1) for these subjects showed highly significant differences among the groups when analyzed with repeated-measures ANOVA. The 'spring' and table displays were used more quickly than either the icon or text displays.

Table 3-1: Effect of display type on performance (mean + SE)

Display Type

Time to answer set of 4 questions (seconds)

Number Correct

Text List

186 + 9

3.56 + 0.06

Icon List

175 + 7

3.64 + 0.04

Table

147 + 7 *

3.63 + 0.05

Triangle

145 + 7 *

3.35 + 0.06 *

* Indicates p<0.01 compared with non-marked categories in the same column.

The primary hypothesis that was being tested in this experiment was that the enhanced difficulty of the setting (2-term vs. 3-term Boolean) would show a superior performance with visual displays. This immunity to performance decay would be accompanied by an increased preference of subjects for the visual displays.

Preference results are shown in Table 3-2. It is clear that the text display was not acceptable to the subjects, while the icon list and triangle display were considered very useful.

Table 3-2: Preference ratings of various displays

Best

Second

Third

Worst

Text

18

17

53

95

Icon List

79

62

31

11

Table

22

57

75

29

Triangle

64

47

24

48

The second variable that was tested in this study was question difficulty. Table 3-3 shows the average number of correct responses for each question independent of which display was used in generating an answer. There is a highly significant difference between the levels of difficulty which is related to the number of AND's and OR's that were required. As in the previous 2-term study, question requiring the use of OR, were more difficult. In this case, Question #2 was phrased so that it required the subject to use an OR.

 

Table 3-3: Performance as a function of question type

Question #

Composition

Correct answers (mean + SEM)

1

3 AND

3.8 + 0.3

2

3 AND + 1 OR

3.1 + 0.2

3

4 AND

3.5 + 0.3

4

3 AND

3.6 + 0.3

Briefly, the other findings of this experiment were that there was no significant effect of any of the demographic parameters except for native language. All of the subjects who performed the testing in Norwegian spoke that language as their native tongue, while approximately 15% of the subjects who took the test in English spoke a language other than English as their first language. These non-native English speakers performed more slowly on each of the displays except for the table display. This difference in performance was also noted in the previous 2-term study (Morse et al 1998) where it was found to be stronger. None of the computer-related parameters showed any significance whether they were from the expertise or environment categories.

3.2.3 Discussion

The results of this study confirm in large part the study done previously using 2-term queries. The methodology has been refined so that question difficulty can be assessed more directly. It is satisfying that some unexpected findings from the original study, such as the native language and cross-cultural effects have been replicated. These findings indicate that the method is robust enough to consider its further use.

The next step in a logical series of experiments requires consideration of data that are not Boolean in character. In the field of IR, there is extensive use of alternative models of retrieval, all of which are based on a vector representation of the documents in a collection. These vectors indicate not just whether or not a term occurs in a particular document but how frequently a term occurs and/or how discriminating a term is with respect to a whole collection. This issue takes on added importance when considering Internet collections, which undergo indexing by each search engine but in which the actual vector representation is hidden from direct inspection. We need to know what people see when they look at displays of weighted data. Perhaps the use of visual displays can assist users in figuring out what those lists of hits have been trying but failing to reveal.

 

4 Research Design

1. Problem Statement

Information retrieval has been an interesting topic since long before computers existed. After the advent of computers, IR was among the first areas to be explored that was not based solely on the number-crunching capabilities of the machine. As computers advanced in power and in ability to present graphics, IR systems incorporated graphic features. From the mid-80's, visualizations have been developed to assist the users of interactive IR systems to satisfy their information needs. The advent of the Internet has provided additional pressure to develop efficient retrieval engines as well as effective methods for interfacing these engines with users. It is pertinent to note that users of the Internet and its IR tools form a more diverse audience than users of library-based retrieval systems. In summary, the need for powerful IR systems that can assist users with different levels of expertise and diverse interests has increased. In addition, visualizations have been incorporated into many IR systems but whether they can be understood by users or whether they can lead to more effective IR interaction is an open question.

The approach taken in this proposal to address the question of understandability of IR visualizations involves the testing of prototype displays. Initial studies have been conducted using these displays coupled with Boolean data and tasks based on Boolean combinations. These results are discussed in Section 2.4. The first study tested two-term displays and the second used three-term displays. This extension involves the use of weighted vector data. Vector representations are the basis of most IR systems in use today. Vectors may be based on raw term frequencies, probabilities, or normalization. Clearly, the Boolean tasks that were used in these preliminary studies are not suitable for testing vector document representations. Task typologies for visual display are available and will be used to develop the tasks for this set of experiments. The aim of this study is to determine how well users can understand displays based on vector representations.

Five types of displays have been tested -- word list, icon list, table, graph, and a visual display based on the VIBE placement algorithm (Olsen et al 1993). The tasks that were applied to these displays were chosen from a set of domain-independent visual tasks.

4.2 Definitions

4.2.1 Information Retrieval

Information retrieval as a discipline concerns itself with the storage and access of information primarily in the form of text documents. The term information retrieval is also used in a more restrictive sense to talk about a particular kind of information seeking task, namely a combination of information access and document procurement. Other parallel activities have been variously categorized to include information organization, query formulation, query reformulation, and browsing. This list is not inclusive and evidence is presented in Section 2.3 of the background material showing that the types of activities that users engage in when seeking information need satisfaction may be categorized using a wide variety of schemes.

Visualizations used in information retrieval systems are usually thought to support browsing or inferencing or query reformulation. Rather than stating the list of potential activities each time that these terms might be used, this paper will often use the more general term, information retrieval.

4.2.2 Visualization types

Experimental information retrieval visualizations can be categorized in various ways. Lin (1997) suggests that there are four types-hierarchical, network, scatterplots, and maps. For the purposes of this proposal, a breakdown by dimensionality of the rendered data seems appropriate. In the simple visualizations proposed herein, it is most appropriate to imagine existing systems, such as VIBE (Olsen et 1993), BIRD (Kim & Korfhage 1994), GUIDO (Nuchprayoon & Korfhage 1994, Nuchprayoon 1996), InfoCrystal (Spoerri 1993), and the series of displays developed at NIST (Cugini et al 1996). All of these systems attempt to render just a few of the dimensions of the underlying hyperdimensional vector. Other systems that attempt to show relationships among much larger parts of the document vectors resemble maps. Examples of these systems are Lin's self-organizing semantic maps (Lin 1991, 1997), SPIRE (Wise et al 1995), and BEAD (Chalmers 1996). Each of these interfaces is created by applying a dimension-reducing algorithm, such as simulated annealing, Kohonen maps, or latent semantic indexing to the full document vectors. According to Korfhage (1997), low-dimensional systems are properly called reference-point systems. Their primary characteristic is that a few points of interest (POIs) are used as anchor points in the display. The POIs are frequently keywords or query terms selected by the user, but may also represent full documents, user profiles or sets of terms or phrases. Map displays, on the other hand, tend to be used to represent the entire collection or document set. Some of the map systems allow the user to see sequential snapshots (SPIRE: Wise et al 1995) or to change the granularity of the detail (Lin 1997). Each of the map systems is dependent on complicated algorithms that limit their creation in an interactive environment. No doubt as computer systems become faster, the use of maps in interactive IR will increase in utility and variety.

4.2.3 Display types

A distinction is made in this paper between various types of displays. The terms that are used are text-based, word-based, tabular, graphical, and visual. A clarification of the use of these terms is useful. Text-based presentations show words in their usual semantic context. This is the usual form for text lists returned by Internet search engines. Word-based displays show each occurrence of the query term in a document listing. Tables are defined here as two-dimensional listings in which the values of the elements are numeric. Graphical displays are defined as the set of usual graph types, e.g., pie chart, bar chart, histogram, and scatterplot. Visual displays, in contrast, are composed of icons and connecting lines that do not have the normal Cartesian coordinate interpretation.

Discrete Data Type continuous



Low Cognitive Effort High

Figure 4-2: Multidimensional scaling results of Lohse et al (1990) showing visual display types.

Lohse et al (1990) investigated how visual displays are categorized by subjects who sorted 40 display instances. The results of hierarchical clustering analysis of the data showed five clusters-icons, maps, diagrams, network charts, and graph and tables. Multidimensional scaling (MDS) of the same data revealed two dimensions, which Lohse et al (1990) named cognitive effort and discreteness of data. Both scales range from low to high. The types of displays chosen for investigation in the current proposal seem to correlate best with the tables, graphs, and network charts. Figure 4-2 shows the MDS of Lohse and has been annotated to show the region covered by the proposed display types.

4.3 Hypotheses

The development of hypotheses for this study entails consideration of each of the dependent and independent variables. The dependent measures of performance are number of correct answers and time-to-completion of a task set, where a set refers to all the tasks for a single display type. The measure of preference is the user's rankings of each display. The independent variables are display type, order of presentation, individual task, and scenario difficulty. Scenario difficulty is defined as the number of terms depicted in a display, i.e., 2-term or 3-term. Subjects will perform the experimental tasks with a single level of difficulty.

The null form of the main hypothesis regarding display type and understandability is:

H1: Performance of tasks will not differ significantly regardless of display used.

H1a: The number of correct answers will not differ significantly regardless of display used.

H1b: Time-to-completion will not differ significantly regardless of display used.

Rejection of this hypothesis will indicate that there are differences with respect to understandability and usability of the displays. If the hypothesis is rejected, post-hoc testing will be performed to determine the differences. The data for both number of correct answers as well as time to completion will be subjected to this analysis.

If users perform differently with different displays, it is possible that the difference is due to the order in which the displays were presented. The null hypothesis statement is:

H2: The order of presentation of a display does not affect the correctness of answers.

H2a: The number of correct answers will not differ significantly regardless of order of the display presentation.

H2b: Time to completion will not differ significantly regardless of the order of the display presentation.

If this hypothesis is rejected, it will provide evidence that users can learn how to perform tasks by using alternate display formats. Since the types of tasks being presented to users may not be the types of tasks that they are used to performing, it might be possible that there is some general trend to learning all displays. On the other hand, it might be the case that only displays that are difficult to use when presented early in the series are learned by using other displays, while some displays might be easy to use at first sight. Rejection of hypothesis H2 will allow subsequent analysis for these types of correlations.

To this point, the tasks for each display have been grouped to provide a total score. It is possible that there is a range of difficulty of tasks. The formulation of the domain-independent tasks includes parameter lists that range in number from one to three. It is possible that the number of parameters is predictive of the difficulty of the tasks. The null hypothesis to support testing of this idea is:

H3: Scores for individual subtasks are not significantly different from each other.

Rejection of this hypothesis allows exploration of a secondary hypothesis:

H3a: There is no correlation between the number of parameters for an individual subtask and users' performance with the task.

Taken together, the rejection of this hypothesis allows the conclusion that test formats can be produced that can range from very easy to very difficult depending on the cardinality of the parameter list of a task.

The final hypothesis related to performance measures is to determine the effect of scenario difficulty. In its null form, the hypothesis is:

H4: The number of terms depicted in a display does not affect either time to perform a task set nor the number of correct answers.

The initial studies showed that visual displays were used less well than the other prototype displays in 2-term situations but were advantageous in the more difficult, 3-term display series. The important sub-hypothesis related to this point would try to determine whether this trend persists in the vector condition.

H4a: Each three-term display is associated with poorer performance than the paired two-term display.

The final major hypothesis of the proposed research is related to the measure of user preferences.

H5: There is no significant difference among the displays with respect to user preference.

H5a: Users express no preference for displays with which they perform better.

H5b: Users express no preference for displays based on order of presentation.

H5c: Users express similar preference rankings regardless of scenario difficulty.

4.4 Limitations

A issue to consider is the range of display types that might be considered. In the preliminary study the displays that were tested were text-based (full-text and word lists), tabular, graphical (scatter plot), icon list, and a visual method based on the VIBE positioning algorithm. These display types are representative of two of the five types of displays described by Lohse (1990). The other types are diagrams, icons and maps. Perhaps a Venn diagram would be an appropriate instantiation of the diagram category. The 'icon' type does not have an obvious mapping into the IR domain. 'Maps' are only suitable for rendering higher dimensional data than the simple 2- and 3-term conditions being studied here. Since this study is based on a reference-point model of information retrieval and visualization, further investigation would be required to show whether other display types could be successfully tested with the proposed method.

The nature of the study might appear at first glance to be a 'strawman' situation since 'visual' tasks are being tested in a 'visual' environment. The argument could be made that it is only reasonable that such tasks would be performed better than non-visual tasks. However, until a taxonomy of these non-visual tasks exists, it is not possible to compare performance across task types. In addition, there is no evidence that these 'visual' tasks are performed better with a 'visual' interface. In fact, it is possible that all tasks are performed better with full-text and that only a subset of all tasks is suitable for visual presentations.

5 Methodology

The topics that will be discussed in this section include the selection and processing of the document data, selection and implementation of the various displays, recruitment of subjects, the procedure for administering the test, and statistical methods.

5.1 Document data generation

The document set that was used for creating the various displays used in this study was selected from the TREC collection. Specifically, the AP 1989 newswire collection was chosen because:

The documents were extracted and placed in individual files. Raw frequency counts of word stems were gathered, while using a 443-term stop-list. The embedded SGML markup tags were added to the stop-list. The resulting list of >18,000 terms was trimmed by removing words that occurred in >95% or <15% of the documents. The remaining terms were subjected to term discrimination value analysis (Willett 1985) using the cosine measure.

Figure 5-1: Specialized WebVIBE interface used to select term sets

The first display that was created was the 'spring' whether linear or triangular. The remaining displays were generated automatically from the data selected for inclusion in the respective 'spring' display. This visual method for creating displays guaranteed that features that were to be selected for in the task mapping activity would exist in a particular set of documents. The data set created from the AP newswire collection was loaded into a customized WebVIBE application (Figure 5-1). Terms were chosen from the term discrimination value list from the regions that exhibited very positive and/or very negative values. Positive values indicate terms that tend to pull documents apart in the hyperdimensional space created from the key terms, while negative values indicate terms that will tend to aggregate documents as a whole but that prevent particular documents from joining the clusters. Pairs of terms and triplets of terms were entered into the interface of WebVIBE and the threshold value was adjusted so that approximately 10 documents per keyterm remained in the display. Approximately 20 sets of terms, thresholds, and document vectors were collected for each of the 2-term and 3-term experiments.

5.2 Displays

The displays that were used for this study were the same as those used in preliminary studies discussed earlier (Section 3), namely 'word', 'icon', 'table', 'graph' and 'spring' displays. As in the previous studies, the 'graph' display was only tested in the 2-term condition so that the problems of 3-D displays could be avoided. Figures 5-2 and 5-3 show typical examples of each of the chosen types for the 2-term and 3-term test, respectively. The full set of experimental displays can be found in Appendix B.

 

Figure 5-3: Examples of each of the 5 display types used in the 2-term study

 

Figure 5-4: Examples of each of the 4 display types used in the 3-term study

 

5.3 Tasks

Elemental visual tasks were chosen from the taxonomy of Zhou & Feiner (1998). The full taxonomy contains approximately 50 tasks. In order to create a test that could be taken within a target of one hour, it was necessary to prune the task tree. The rationale applied to the pruning was

The result of the selection process is shown in Table 5-1. The boldface entries are the tasks that were selected for inclusion. It should be noted that, in general, the super-categories were selected, but the actualization of the test questions often relied on a formulation that was captured as a sub-type of the selected category. For example, the Identify task has Name, Portray, Individualize, and Profile as subtasks (Table 2-3). The test question developed for this item was of type Profile.

Table 5-1: Comparison of taxonomic categories

Wehrend & Lewis (1990)

Zhou & Feiner (1998)

Associate

Correlate

Locate

Distinguish

Rank

Categorize

Cluster

Compare within entities

Compare between relations

Identify

Distribution

Associate<?x, ?y>

Correlate<?x1,..,?xn >

Locate<?x, ?locator>

Distinguish<?x, ?y>

Rank<?x1,..,?xn ,?attr>

Categorize<?x1,..,?xn>

Cluster<?cluster,..,?xn>

Compare<?x, ?y>

Identify<?x, ?identifier>

Encode<?x>

Background<?x, ?bkg>

Emphasize<?x,?x-part >

Reveal<?x,?x-part >

Generalize<?x1,..,?xn >

Switch<?x, ?y>

Table 5-2 demonstrates the first stage in mapping from the above taxonomic categories to actual question or task formulations. These statements are quite general since specific document data was not considered.

Table 5-2: First level mapping from taxonomic categories generalized task statements

#

Type

Formulation

1

Compare

Which keyterm has the most documents about ONLY it?

2

Associate

Which key term is associated with more documents?

3

Distinguish

One of the documents is unlike any of the others. Can you identify it? What is different about it?

4

Rank

Rank document w, x, and y with respect to the amount of term B that they contain.

5

Cluster

Which of the following sets are similar? What is the basis for your judgment?

6

Correlate

What significance do you attach to the indicated region (or point in a list)? [Region is a gap where no documents are found?]

7

Locate

If a new document was discovered that had these characteristics (x, y), where would it be placed in the display? [between which two labeled documents]

8

Categorize

What general category would you place the indicated documents in? [show documents that are related to a single POI]

9

Identify

Find a document that is likely to be about both terms in equal proportion.

10

Compare

If you definitely wanted to read documents that had BOTH [ALL] terms in them, which documents would you ignore?

The following tables show the actual questions presented to the subjects. Table 5-3 refers to the 2-term test and Table 5-4 to the 3-term test.

Table 5-3: Second level mapping from generalized to specific task statements - 2-term questions

  1. Are there more documents that contain ONLY the term Romania or ONLY the term Czechoslovakia?
  2. Which is the most frequent keyterm in this set of documents? A. Oil; B. York
  3. One of the documents is unlike any of the others. Can you identify it? Place the document number in the text box.
  4. Rank documents A, B, and C with respect to the amount of term 'Soviet' that they contain.
  5. Which of the following documents are most similar with respect to the relative amount of the keyterms?
  6. What of the following statements is true?
  1. There are no documents that contain roughly equal amounts for the two terms.
  2. If a document talks about 'Oil' then it also talks about 'Texas'.
  3. 'Texas' and 'Oil' are not very highly related.
  4. A and C
  5. All of the above
  1. Location
  2. Which of the following statements is true?
  1. The indicated documents contain 'Govern' but not 'Office'.
  2. The indicated documents contain 'Office' but not 'Govern'.
  3. The indicated document contain either 'Office' OR 'Govern' but not both.
  4. The indicated documents contain 'Office' AND 'Govern'.
  5. None of the above.
  1. Find a document that is most nearly about both terms in equal proportion.
  2. If you definitely wanted to read documents that had BOTH terms in them, which documents would you choose? Select all that apply.

Table 5-4: Second level mapping from generalized to specific task statements - 3-term questions

  1. Are there more documents that contain ONLY the term 'earthquake' or ONLY the term 'California' or ONLY the term 'death'?
  2. Which is the most frequent keyterm in this set of documents? A. Vatican; B. Embassy; C. Noriega
  3. One of the documents is unlike any of the others. Can you identify it? Place the document number in the text box.
  4. Rank documents A, B, and C with respect to the amount of term 'Company' that they contain.
  5. Which of the following documents are most similar with respect to the relative amount of the keyterms?
  6. Which of the following statements is true?
  7. A. At least one document contains all three terms.

    B. At least one document contains the terms 'Arab' and 'bomb'.

    C. 'Vatican' and 'Arab' are not very highly related.

    D. B and C

    E. All of the above.

  8. Location
  9. Which of the following statements is true?
  10. A. The indicated documents contain 'Nicaragua' but not 'Bush' and not 'America'.

    B. The indicated documents contain 'Bush' but not 'Nicaragua' and not 'America'.

    D. The indicated documents contain 'America' but not 'Bush' and not 'Nicaragua'.

    C. None of the above but the indicated documents are about a single keyterm.

    E. None of the above.

  11. Find a document that is most nearly about all three terms in equal proportion.
  12. If you definitely wanted to read documents that had ALL three terms in them, which documents would you choose? Select all that apply.

The final step in creating and characterizing the tasks to be performed in these studies was to determine the number of parameters that were involved in each task instantiation. These data are presented in Table 5-5. The Compare, Associate, Distinguish, Locate, and Identify tasks require only two parameters as defined in Table 5-1. In each Ranking task, subjects were asked to rank three documents according to a single criterion -- a total of four parameters. Clusters contained three documents in the 2-term study and four documents in the 3-term study. Correlation and Categorization required judgments across the entire document set presented in the display, leading to parameter lists of the same cardinality as the size of the document set.

Table 5-5: Parameter number for specific tasks

Parameter Number

Taxonomic Category

2-term

3-term

Compare<?x, ?y>

2

2

Associate<?x, ?y>

2

2

Distinguish<?x, ?y>

2

2

Rank<?x1,..,?xn ,?attr>

4

4

Cluster<?cluster,..,?xn>

5

4

Correlate<?x1,..,?xn >

20

30

Locate<?x, ?locator>

2

2

Categorize<?x1,..,?xn>

20

30

Identify<?x, ?identifier>

2

2

Compare<?x, ?y>

2

2

 

5.4 Subjects

Subjects were recruited for the study by advertising in the University of Pittsburgh student newspaper (The Pitt News), by posting notices on bulletin boards in University buildings and at kiosks in city neighborhoods, and by distributing handbills at the entrance to the Information Sciences Building between 9 and 11 a.m. Samples of the recruitment documents can be found in Appendix A. The rationale for distributing handbills in mid-morning is that the Information Sciences building is used for general undergraduate classes during that time. Later distribution, when a high proportion of the classes would be from departments of the School of Information Sciences, would have skewed the population toward these students.

5.5 Administration of test

Subjects were instructed to access a particular URL, which served as the starting point of the test. The initial screen gave general information and was intended primarily for users who accessed the site from a remote location. When the subject selected to 'Continue', he was randomized to receive either the 2-term or 3-term experimental condition. The program that performed the randomization can be found in Appendix C. Subjects then received overall instructions for the particular type of test that they would be performing.

Upon selecting 'Continue', the subject was given a unique identification number that was passed as a hidden field in all subsequent forms. This number indicated the order in which the various displays would be delivered. The program that managed the delivery of this and all subsequent pages can be found in Appendix C. At this point, the user received specific instructions for the first display type. This was followed by 10 pages that contained a question, a 'Submit' button, and a display configuration. The process of instruction and 10 displays was repeated for each display to be evaluated, which means that 55 pages were delivered to 2-term study participants and 44 pages for those inducted into the 3-term study.

The next step was to deliver a post-test questionnaire. The data in the survey was of several types:

    1. Ranking of the displays indicating the user's preference.
    2. Demographic information about the user, the user's computer experience, and the computer equipment being used for the test session.
    3. User assessment of displays that were 'hard', 'easy', 'fun', 'annoying, boring or tedious'.
    4. Open text block for commenting.

Each time that a new page was delivered, a web-based server program collected any data sent by the user client. Table 5-6 shows a sample record from a single subject's session. Each element in the record was time-stamped. The timestamps allowed calculation of the elapsed time for individual tasks which were either time to answer a question, time spent reading a set of instructions, or time spent answering the post-test questionnaire.

The final screen presented the subject with instructions for receiving payment for participating in the study.

Table 5-6: Segment of data captured from single subject in 3-term study

Mon Nov 23 21:55:24 1998

REMOTE_HOST = 136.142.22.173

REMOTE_ADDR = 136.142.22.173

HTTP_USER_AGENT = Mozilla/4.04 [en] (Win95; U)

BROWSER = Netscape 4.04+%5Ben%5D+%28Win95%3B+U%29

21:58:12 ID=2431&panel=2&pass=0&question=-1&answer=

21:58:32 ID=2431&panel=2&pass=0&question=0&answer=b

21:59:03 ID=2431&panel=2&pass=0&question=1&answer=c

22:00:20 ID=2431&panel=2&pass=0&question=2&answer=13

22:01:40 ID=2431&panel=2&pass=0&question=3&answer=cba

22:03:10 ID=2431&panel=2&pass=0&question=4&answer=e

22:04:40 ID=2431&panel=2&pass=0&question=5&answer=d

22:05:36 ID=2431&panel=2&pass=0&question=6&answer=b21

22:07:07 ID=2431&panel=2&pass=0&question=7&answer=e

22:07:53 ID=2431&panel=2&pass=0&question=8&answer=22

22:08:45 ID=2431&panel=2&pass=0&question=9&answer=Anrsu

22:09:55 ID=2431&panel=4&pass=1&question=-1&answer=

...

22:30:58 ID=2431&panel=1&pass=3&question=7&answer=e22:31:33ID=2431&panel=1&pass=3&question=8&answer=

22:33:1 ID=2431&panel=1&pass=3&question=9&answer=Aacdh

22:37:24 ID=2431&pass=5&r1=table&r2=icon&r3=word&r4=spring&easy=i&hard=s&fun=n&annoy=w&distraction=N&sname=<name+removed+for+privacy>&age=22&year=staff&gender=female&area=1&major=Biology&esl=Y&netuse=Weekly&compuse=A&competence=average&speed=b&monitor=a&modem=g&comment=None

5.6 Statistical Considerations

Once all the data from all subjects in both studies was collected, the records were parsed (see Appendix C) for preliminary analysis. Fields were created to designate whether the subject performed the experiment locally (i.e., in the Usability Lab) or remotely, and whether the subject received payment. All subjects tested locally were paid at the time of performing the test but remote subjects were paid only if they appeared to request payment. The initial analysis flagged four subjects who appeared to have negative times for a single question. This aberration was found to be due to remote subjects performing the experiment around midnight. The rudimentary time calculation did not account for a test that was taken on two days.

At this point the data was transferred to SPSS for all subsequent analysis. Performance was calculated by two measures -- time to complete an activity and correctness of the answer. Data based on time was shown to be suitable for analysis by parametric statistical methods. When answers were pooled to form total scores for a particular display (10 answers) or for a whole experimental session (40 or 50 answers), the criteria for using parametric methods was met. When individual answers were inspected, however, the value was binary (right or wrong); this situation called for non-parametric methods. A similar rationale was used in treating co-variates. Age was a continuous variable, while all the other variables were categorical having seven nominal values at most. Therefore, age was treated parametrically and the others were handled non-parametrically. In cases where data was compared that came from both types, the analysis defaulted to a non-parametric treatment.

6 Results

6.1 Subjects

Advertising produced 195 subjects who were randomized to receive either the 2-term or 3-term experimental study. Information about the subjects and their test sessions could be roughly broken down into four categories: demographics, users' computer experience, the computer hardware and software, and other factors related to the test situation. Each of these classes of data is discussed below with special attention being paid to the relative composition of the subgroups.

6.1.1 Demographic Characteristics

Demographic information collected in this study included gender, current educational level, area of study at a broad categorization of physical science, social science or humanities, and native language. Table 6-1 shows the number of individuals in each category for each of the studies (2-term and 3-term). In addition, percentages are included to facilitate comparison across studies. There were no statistically significant differences between the studies for any of these variables.

 

Table 6-1: Summary of demographic characteristics

2-term

3-term

#

%

#

%

Gender

female

71

59

45

61

male

50

41

29

39

Status

Freshman

19

16

8

11

Sophomore

31

26

16

22

Junior

25

21

17

23

Senior

22

18

9

12

Masters

3

2

6

8

Ph.D.

4

3

5

7

staff

3

2

7

9

other

14

12

6

8

Area

Physical Science

31

26

17

23

Social Science

16

13

18

24

Humanities

26

21

39

14

Other/NA

48

40

29

Native English

No

14

12

9

12

Yes

107

88

65

88

Figure 6-1 shows the distribution of age across the studies. The mean age of subjects in the 2-term and 3-term studies were 23.2 and 23.6 years, respectively. The median ages were 20 and 21, respectively. Note that the data in the figure are frequencies rather than percentages.

Figure 6-1: Age distribution

The preliminary studies were performed in classroom settings within the Department of Information Science and Telecommunications or in a parallel program at Molde College. When the current data are compared to the data for subjects that participated in the preliminary studies, differences were found. For instance, the ratio of men to women is reversed in the current vector studies (40/60 vs. 60/40). There are significantly more subjects having advanced degrees. The number of subjects indicating that English is not their native language is relatively smaller than in the previous studies. Each of these aberrations from the previous experiments confirms sampling from a broader University pool.

 

6.1.2 Computer Experience

This group of attributes includes indicators regarding the amount of time that the subject uses a computer currently, the length of time that he has used a computer and a self-assessment of his skill level. The data shown in Table 6-2 show that the 2-term and 3-term groups are well matched. There were no statistically significant differences between the groups. It is interesting to note that the average subject uses a computer on a daily basis and has been using computer technology for over five years. These subjects somewhat modestly assert a level of expertise that is only rated as 'average'.

Table 6-2: Summary of computer-related personal characteristics

2-term

3-term

Number

Percent

Number

Percent

Network Usage

Daily

66

55

42

57

Weekly

40

33

27

36

Monthly

5

4

3

4

<monthly

8

7

2

3

Never

2

2

0

0

Computer Usage

> 5 years

83

69

49

66

1-5 years

33

27

22

30

6-12 months

1

1

1

1

< 6 months

3

2

2

3

Never

1

1

0

0

Expertise

Expert

17

14

13

18

Average

78

64

53

72

Novice

18

15

7

9

None

8

7

1

1

These data indicate that subjects recruited from the University community are by-and-large computer-literate and, therefore, the on-line format of these studies should have been familiar to most of them. Analyses will be shown in later sections that attempt to correlate these factors to performance and preferences for the various displays.

 

6.1.3 Equipment Used for Experiment

Since some prior studies (Morse: unpublished results) had shown subjects' preferences to be influenced by the quality of the hardware that was used for testing, information was gathered about quantifiable attributes of the computers used in this study. The laboratory equipment consisted of 21-inch monitors, 200 MHz Pentium 2 PC's running on Ethernet. Both Netscape 4.05 and Internet Explorer 4.0 were used in the Usability Lab.

Table 6-3: Characteristics of computer equipment used for the study

2-term

3-term

#

%

#

%

Hardware

 

 

 

 

 

CPU speed

< 66

1

1

0

0

 

66-100

10

8

3

4

 

101-150

8

7

4

6

 

151-200

51

42

27

38

 

> 200

15

12

20

28

 

Unknown

36

30

20

28

Monitor Size

<= 14

24

20

10

14

 

15-16

21

17

22

31

 

17-18

22

18

11

15

 

19-20

10

8

5

7

 

>=21

44

36

26

36

Modem Speed

<14.4

2

2

0

0

 

14.4

4

3

0

0

 

28.8

7

6

4

6

 

>28.8

33

27

20

28

 

None

44

36

25

35

 

Unknown

31

26

25

35

Software

 

 

 

 

 

Browser

Netscape

95

79

61

85

 

Internet Explorer

26

21

13

18

Version

3

13

11

6

8

 

4

108

89

68

94

The data in Table 6-3 include information from subjects who performed the studies in the Usability Laboratory as well as those who found another site from which to work. In order to determine the type of equipment that is used by free-ranging subjects, the data has been reformulated in Table 6-4 to include only data from subjects who worked outside the laboratory. This data allows an estimate of the type of computing equipment used by individuals associated with the University community. The median subject in this sample has access to a computer that operates at >200 MHz via a modem that runs at > 28.8 kbps and views his work on a 15-16 inch monitor. In addition, it is interesting to note that 75% of the subjects used Netscape as their browser while the remaining 25% employed Internet Explorer.

Table 6-4: Hardware used by participants working outside the laboratory (n=120)

Percentage

CPU speed

< 66

1

66-100

10

101-150

10

151-200

16

> 200

23

Unknown

40

Monitor Size

<= 14

28

15-16

36

17-18

28

19-20

8

>=21

2

Modem Speed

<14.4

2

14.4

3

28.8

8

>28.8

43

None

3

Unknown

41

6.1.4 Factors Related to the Test Situation

Table 6-5 presents a summary of information about the actual test taking. Subjects who performed the study in the Usability Laboratory are recorded as 'local', while all others are tallied as 'remote'. No effort has been made to sub-categorize the remote locations, although IP addresses were stripped off and stored. All subjects who performed the study locally were paid upon completion. Subjects who worked from another location were instructed about how to be paid for their participation; 21 subjects (~10%) did not appear to request payment. The final information considered in this category is self-assessment of whether distractions occurred during the study. Of the 31 subjects reporting distractions, eight took the test locally and 23 at other sites.

Table 6-5: Miscellaneous factors

 

 

2-term

3-term

Location

Remote

74

46

 

Local

47

28

Paid

No

14

7

 

Yes

107

67

Distracted?

No

101

63

 

Yes

20

11

Statistical analysis of this data showed that there were no significant differences between the 2-term and 3-term study populations. In addition, these factors were not correlated with either of the performance measures or with the preference measures.

6.2 Comparison of Overall Time and Correctness Measures

In order to determine whether the two performance measures employed in this study - time to completion and number of correct answers - were correlated, the data for overall test performance on both scales was analyzed visually and statistically. Figures 6-2 and 6-3 show the results for the 2-term and 3-term studies, respectively. Open squares indicate outliers. Diamonds show data for remaining subjects. The trendline shows association between measures for diamonds. This comparison shows that the primary measures used in this study are not correlated. In other words, performance measured by time to complete a task is not predictive of the score that the subject is likely to achieve. Subjects who completed the total battery of tasks in a relatively short amount of time were no more likely to achieve a high score than subjects who took longer. Similarly, subjects who scored particularly well or particularly poorly were not associated with skewed performance times. The Pearson Correlation Coefficient was 0.038 and 0.177 for 2-term and 3-term data, respectively; neither value was statistically significant.

Figure 6-2: Relationship of time to completion and correctness for 2-term study.

Figure 6-3: Relationship of time to completion and correctness for 3-term study.

Visual inspection of Figures 6-2 and 6-3 reveals three outlying points - one in the 2-term and two in the 3-term study. Linear regression analysis flagged the existence of these outliers. Data from these subjects has been excluded from subsequent analysis. It should be noted, however, that the data in Section 6.1 has not been adjusted for outlying data.

From the distribution of values, it appears that time exhibits a wider range of values while correctness is more constrained. An inference could be made that analyses in subsequent sections will show that time is a more sensitive measure. It might also be suggested that the type of test that was administered was quite easy and that subjects performed too well to allow correctness to be discriminating. A final observation regarding measures is that the Power Analysis that had been used to estimate sample size for these studies was based on timing data.

6.2.1 Correlation with Demographic Variables

Of all the subject characteristics collected, only age, status (year in school or highest degree completed), and English as a native language were correlated with performance. In both the 2-term and 3-term studies, age was positively correlated with performance measured by time to completion (p<0.001 for both study types) but not with total score. Educational status was similarly related (p<0.05). Subjects whose native language was not English also required more time to answer the question sets, but they achieved scores that were not significantly different from native English-speakers. There was a correlation among these factors. Non-native English speakers tended to be younger. People who had advanced degrees tended to be older and subjects who claimed 'Other' for this category were older. The important observation is that a few subject characteristics were associated with performance as measured by time, but no characteristic was correlated either positively or negatively with correctness of answers.

6.3 Instruction Times

Subjects were presented with a short description of an upcoming display type. The material consisted of an explanation of the key elements in the display and an example of how it could be interpreted. When the subject was finished using this information, he submitted a request for the first display of this type. The time elapsed was captured and labeled as instruction time. Table 6-6 shows a statistical summary of the data. On average, the instructional material was viewed for less than a minute. The amount of time spent learning about a display was similar for the word, icon, and table display and the triangular 'spring' display used in the 3-term study. In the 2-term study, both the graph and the linear 'spring' required significantly longer times.

Table 6-6: Time spent on instruction page (seconds; Mean ± S.E.M.)

2-term (n=120)

3-term (n=72)

Word

35.35±2.91

39.30±4.39

Icon

35.17±2.46

32.93±3.05

Table

35.60±2.92

32.49±3.04

Graph

54.38±3.44*

NA

Spring

51.76±4.55*

35.14±3.14

*: p<0.05 compared with displays in the same column without an asterisk.

These longer times seem to indicate a degree of novelty of the displays. The fact that the 3-term 'spring' was not accompanied by a longer instructional period would not be expected. It might be conjectured that the 'triangle' was less confusing than its 'linear' counterpart, but no data was gathered that could support or refute this idea.

6.4 Performance with Respect to Displays

The first hypothesis proposed in this study was:

H1: Performance of tasks will not differ significantly regardless of display used.

H1a: The number of correct answers will not differ significantly regardless of display used.

H1b: Time to completion will not differ significantly regardless of display used.

As shown in Section 6.2, correctness and time to completion are not correlated measures and, therefore, should be treated separately. The data for timing will be presented in Section 6.4.1 and for correctness in Section 6.4.2.

Although this section is primarily intended to address H1, it is convenient to compare 2-term and 3-term situations. Therefore, in each of the major sections, beginning with this one on display comparisons, there will be an attempt to address issues that pertain to evaluation of H4. As presented in Section 4.3,

H4: The number of terms depicted in a display does not affect either time to perform a task set nor the number of correct answers.

H4a: Each three-term display is associated with poorer performance than the paired two-term display.

6.4.1 Time to completion

The results of the analysis of time to completion with respect to display type are shown graphically in Figure 6-4. There are several important observations that can be made upon inspecting the data. First, for the 2-term study, there are significant differences among the displays with respect to performance times. Analysis of variance showed a p value < 0.001 for this comparison. Within-subjects contrast are summarized in Table 6-7; using the 'spring' as the pivot case, all of the other display types are shown to take a significantly longer time. Another analysis was performed (data not shown) using the Word display as the pivot group; it showed that all of the other displays supported faster performance. Pairwise comparisons of the data are summarized in Table 6-8. Inspection of the table confirms the findings of the previous analysis. Generally, for the 2-term displays, the Word display is slowest, the 'spring' is fastest, and the other displays are intermediate.

Figure 6-4 Comparison time to completion of 10-question task set vs. display type.

Table 6-7: Within-subject contrasts for 2-term displays

Contrasted Sets

F

Sig.

Word vs. Spring

37.800

.000

Icon vs. Spring

14.733

.000

Table vs. Spring

23.724

.000

Graph vs. Spring

8.386

.005

 

Table 6-8: Pairwise comparisons for 2-term displays

Base Display

Comparison Display

Mean Difference

Std. Error

Sig.

Word

Icon

75.767

26.661

.053

 

Table

54.158

23.926

.254

 

Graph

92.033

22.422

.001

 

Spring

142.625

23.198

.000

Icon

Word

-75.767

26.661

.053

 

Table

-21.608

22.648

1.000

 

Graph

16.267

20.702

1.000

 

Spring

66.858

17.418

.002

Table

Word

-54.158

23.926

.254

 

Icon

21.608

22.648

1.000

 

Graph

37.875

18.856

.468

 

Spring

88.467

18.163

.000

Graph

Word

-92.033

22.422

.001

 

Icon

-16.267

20.702

1.000

 

Table

-37.875

18.856

.468

 

Spring

50.592

17.470

.045

Spring

Word

-142.625

23.198

.000

 

Icon

-66.858

17.418

.002

 

Table

-88.467

18.163

.000

 

Graph

-50.592

17.470

.045

The second major point to be made in this section pertains to the 3-term displays. The four displays were shown by ANOVA to be significantly different (p<0.001). Within-subject contrasts, using the 'spring' display as the pivot case, showed it to be highly different from each of the other displays (Table 6-9). Further analysis by pairwise contrasts (Table 6-10) showed that the Word and Table displays were roughly equivalent in terms of speed of performance, while the icon display was faster and the 'spring', once again, was the fastest.

Table 6-9: Within-subjects contrasts for 3-term displays

Contrasted Sets

F

Sig.

Word vs. Spring

50.374

.000

Icon vs. Spring

18.356

.000

Table vs. Spring

34.234

.000

 

Table 6-10: Pairwise comparisons for 3-term displays

Base Display

Comparison Display

Mean Difference

Std. Error

Sig.

Word

Icon

117.611

39.516

.024

 

Table

74.611

40.060

.400

 

Spring

223.347

31.469

.000

Icon

Word

-117.611

39.516

.024

 

Table

-43.000

28.868

.845

 

Spring

105.736

24.679

.000

Table

Word

-74.611

40.060

.400

 

Icon

43.000

28.868

.845

 

Spring

148.736

25.421

.000

Spring

Word

-223.347

31.469

.000

 

Icon

-105.736

24.679

.000

 

Table

-148.736

25.421

.000

The final major observation for the data is a comparison across study types, which is focussed on evaluating H4. The data are shown in Figure 6-4 and the statistical analysis in Table 6-11. The data were analyzed by repeated-measures ANOVA using study type as the Between-subjects factor. For the word, icon, and table displays, the 3-term condition required more time for the subjects to complete than the corresponding 2-term condition. The results for the 'spring' display, however, did not achieve significance (p=0.086).

Table 6-11: Effect of display type on time to complete task set - 2-term vs. 3-term

Display type

F

Significance

Word

8.643

.004

Icon

6.581

.011

Table

10.126

.002

Spring

2.979

.086

H4a states that " Each three-term display is associated with poorer performance than the paired two-term display". Statistically, this formulation allows the use of a one-tailed test rather than the 2-tailed test. Application of the one-tailed significance test yields p<0.05 for the 'spring' in the comparison of 2-term vs. 3-term displays.

6.4.2 Correctness

Analysis of the second method of assessing performance, correctness of answers, is shown in Figure 6-5 and Tables 6-12 to 6-16.

Figure 6.5: Mean score per 10-question task display type

Table 6-12: Within-subject contrasts for 2-term displays -- correct

Display Comparisons

F

Sig.

Word vs. Spring

30.829

.000

Icon vs. Spring

.542

.463

Table vs. Spring

1.714

.193

Graph vs. Spring

19.429

.000

Table 6-13: Pairwise comparisons for 2-term displays

Base Display

Comparison Display

Mean Difference

Std. Error

Sig.

Word

Icon

-1.042

.127

.000

 

Table

-1.125

.111

.000

 

Graph

-1.592

.149

.000

 

Spring

-.925

.167

.000

Icon

Word

1.042

.127

.000

 

Table

-8.333E-02

.127

1.000

 

Graph

-.550

.151

.004

 

Spring

.117

.158

1.000

Table

Word

1.125

.111

.000

 

Icon

8.333E-02

.127

1.000

 

Graph

-.467

.137

.009

 

Spring

.200

.153

1.000

Graph

Word

1.592

.149

.000

 

Icon

.550

.151

.004

 

Table

.467

.137

.009

 

Spring

.667

.151

.000

Spring

Word

.925

.167

.000

 

Icon

-.117

.158

1.000

 

Table

-.200

.153

1.000

 

Graph

-.667

.151

.000

* The mean difference is significant at the .05 level.

Table 6-14: Within-subjects contrasts for 3-term displays

DISPLAY

F

Sig.

Word vs. Spring

33.659

.000

Icon vs. Spring

3.924

.051

Table vs. Spring

5.970

.017

 

Table 6-15: Pairwise comparisons for 3-term displays

Base Display

Comparison Display

Mean Difference

Std. Error

Sig.

Word

Icon

-.931

.190

.000

 

Table

-.875

.202

.000

 

Spring

-1.333

.230

.000

Icon

Word

.931

.190

.000

 

Table

5.556E-02

.181

1.000

 

Spring

-.403

.203

.309

Table

Word

.875

.202

.000

 

Icon

-5.556E-02

.181

1.000

 

Spring

-.458

.188

.102

Spring

Word

1.333

.230

.000

 

Icon

.403

.203

.309

 

Table

.458

.188

.102

The final analysis in this section is a comparison of 2- and 3-term display. The results are shown in Table 6-16 and Figure 6-5. The statistical results confirm the visual impression that there is no difference in number of correct answers between the two types of studies (repeated measures display vs. correct answers: F=0.236, p=NS). As mentioned in the previous section on time-based performance, H4a describes a one-tailed testing scenario. If applied in the case of correctness, however, the result is that the 'spring' display is still not different. The tail that is to be tested is that the 2-term display is better than the 3-term display, while the data show that there is a marginal superiority of the 3-term 'spring'.

Table 6-16: Correct answers -- comparison of 2-term and 3-term studies

Display Type

F value

Significance

Word

0.126

0.724

Icon List

0.011

0.917

Table

0.475

0.491

Spring

3.238

0.074

6.4.3 Summary

The displays used in this current investigation have been shown to be significantly different in terms of users' performance whether measured by time to completion of a set of tasks or by total correct answers to that battery of tasks. The two measures provided similar information; time to completion, however, appeared to be more sensitive. The variation in timing data, assessed as the standard error of the mean, was larger than the relative standard error for correctness data.

6.5 Performance with Respect to Order of Presentation

This section presents the data that were collected and analyzed to test Hypothesis 2.

H2: The order of presentation of a display does not affect the correctness of answers.

H2a: The number of correct answers will not differ significantly regardless of order of the display presentation.

H2b: Time to completion will not differ significantly regardless of the order of the display presentation.

The order of presentation was randomized. A complete block for the 2-term display series was comprised of 120 different orderings. The 3-term display series required only 24 orderings; each was administered three times. This design permitted analysis of display differences presented in the previous section without the need of worrying about the order. Order of presentation has been analyzed in the preliminary and the current design supports dissection of this aspect of the overall plan of testing. Once again data will be presented based on each of the measures employed-time in Section 6.5.1 and correctness in Section 6.5.2.

If this hypothesis is rejected, it will provide evidence that users can learn how to perform tasks by using alternate display formats. Since the types of tasks being presented to users may not be the types of tasks that they are used to performing, it might be possible that there is some general trend to learning all displays. On the other hand, it might be the case that only displays that are difficult to use when presented early in the series are learned by using other displays, while some displays might be easy to use at first sight. Rejection of hypothesis H2 will allow subsequent analysis for these types of correlations.

6.5.1 Time to completion

Performance as a function of time for the 2-term and 3-term studies is shown in Figures 6-6 and 6-7, respectively. It is clear from Figure 6-6 that each display type was associated with poorer performance when it was presented first in the series. There were progressive decreases in the time that it took the subjects to answer the full set of questions associated with each display. Statistical analysis of the effect of ordering showed that the first point was different from the others, but that subsequent presentations were not different from each other. This finding may seem contrary to the visual appearance of the figure; the later points appear to be steadily decreasing albeit at a slower rate than between the first and second points. It should be noted that the number of observations at each point is 24 rather than the full 120. That is, 24 subjects received one of the displays first, second, third, fourth and fifth. The standard error of the mean of these values was in the order of 10% of the mean. Such variation prevents detection of changes among the data points.

Figure 6-6: Time to completion with respect to order of presentation for 2-term study

Time data for the 3-term study (Figure 6-7) bear a striking resemblance to the 2-term data. The slopes of the lines, however, are initially steeper. For the word display, the time in the 2-term condition for a display seen first in sequence is 629 seconds which decreases to about 320 seconds if seen fourth; the same values for the 3-term study are 825 and 330, respectively. The 'spring' display appears to be more flattened than the other curves in the 3-term study. (490 to 270 vs. 410 to 280). Statistical analysis using multivariate ANOVA showed that the Word, Icon and Tables displays were significantly different between the 2-term and 3-term study, while there was no difference between ordering effects for the 'spring' displays. This indicates that increasingly complex data might be more amenable to visual treatment.

Figure 6-7: Time to completion with respect to order of presentation for 3-term study

 

6.5.2 Correctness

Figures 6-8 and 6-9 show the effect of order of presentation on performance as measured by number of correct answers. The visual appearance was confirmed statistically; there is no effect of ordering on this measure.

Figure 6-8: Effect of order of presentation on number of correct answers for 2-term

Figure 6-9: Effect of order of presentation on number of correct answers for 3-term

This effect might seem at variance with the results of the preliminary studies. In the 2-term Boolean study, performance was measured by correctness only, since the test was administered using paper-and-pencil (Morse et al 1998). In the 3-term Boolean study, where both time and correctness were used to measure performance, both variables showed a significant difference across display types (Section 3.2.2). There are significant differences in the study design between preliminary and current studies that may give rise to such an apparent discrepancy.

6.5.3 Summary

The order of presentation has a notable effect on time-to-completion but none on number of correct answers. The key observations regarding the time effect are: 1) there is a steep drop in time required between the first and second display regardless of which displays are seen in these slots; and 2) the 'spring' display is handled extremely rapidly in the 3-term condition; the 'spring' display is the only display that is not influenced by the increased complexity of the 3-term condition when compared with the paired 2-term display.

6.6 Performance with Respect to Task Types

The third hypothesis that was proposed for testing in this study is:

H3: Scores for individual subtasks are not significantly different from each other.

H3a: There is no correlation between the number of parameters for an individual subtask and users' performance with the task.

The role of individual tasks chosen from a visual taxonomy and implemented with a known number of parameters will be discussed in this section. Section 6.6.1 will discuss question types independently of display type. Then the results of individual tasks vs. the different displays will be presented in Section 6.6.2.

6.6.1 Effect of question types on performance

Analysis of the overall study design using a repeated measures analysis showed highly significant differences (p< 0.001) between subject performance and question type. Paired contrasts were performed to determine the source of these flagged differences. Arbitrarily, the first question was used as the pivot group. Both performance measures showed a significant difference for each pair of values, except for the 'Distinguish' question for time and the 'Rank' question with respect to correctness. This analysis was run with data pooled from both the 2-term and 3-term studies.

Figure 6-10: Relationship of time to number of correct answers.

Figure 6-10 shows the relationship between time and correctness for each question type, which are labeled at each data point. The data for this comparison ignores the values for the graph format since it was available only in the 2-term condition. Therefore, the maximum possible value on the y-axis is four (4). The error bars represent the standard error of the mean. For both the 2-term and 3-term study, there is an inverse relationship between these measures. In general, questions that are answered quickly are also answered correctly and vice versa. On average across displays, each question takes longer to answer in the 3-term condition than in the 2-term one. On the other hand, average number of correct answers is not significantly different between the two studies.

The data from which Figure 6-10 was drawn are presented in Table 6-17. Variation is shown in the figure but has not been included in the Table to clarify the presentation and support comparisons across columns. The standard error of both measures averaged 5% of the mean with a maximum departure of 10%. Inspection of Table 6-17 reveals that the Associate, Identify and Rank task were performed in very short time periods and were associated with a very high fraction of correct answers. The Cluster, Locate, and some of the Compare tasks were prone to error and took significantly longer to perform.

Table 6-17: Mean time and correctness for individual questions

2-term

3-term

Task Type

Parameter #

Time (sec)

Correct

Parameter #

Time (sec)

Correct

Compare1

2

127.5

3.87

2

216.3

2.90

Associate

2

66.8

3.88

2

126.0

3.72

Distinguish

2

123.5

3.56

2

204.3

2.81

Rank

4

161.5

3.30

4

149.9

3.76

Cluster

5

213.9

0.88

4

212.1

2.07

Correlate

20

139.9

2.04

30

169.6

2.90

Locate

2

200.1

1.81

2

248.3

1.85

Categorize

20

137.6

2.31

30

169.5

1.94

Identify

2

96.8

3.64

2

117.2

3.47

Compare2

2

195.4

2.58

2

182.4

2.82

In order to test the hypothesis that the number of parameters that specify a question determines its complexity, it is necessary to compare the rankings of the measures in Table 6-17 to the parameter number. The information from Table 5-5 is duplicated here to aid in the comparison. Clearly, there is no relationship between these pieces of data. This leads to acceptance of H3a, which asserted that there would be no correlation between parameter number and subjects' performance.

6.6.2 Question type vs. display type

Repeated measures analysis of variance of the overall study design showed that there were significant differences between question types depending not only on type of study (2-term or 3-term) but also with respect to individual display types. In order to dissect of this difference, the analysis was repeated using a question X display factor ordering instead of the display X question ordering used previously. This manipulation forces generation of the desired within-subjects contrasts. The observed power for this set of comparisons was lower than for the results presented to this point, which had been >0.9.

The results showed that both Icon and/or 'spring' displays were accompanied by significantly faster performance times than the base case (word) display. This difference was found for the Rank, Cluster, Correlate, and Locate tasks (p<0.05). The table, on the other hand, was relatively ineffective at producing fast response times and often was slower than the base case.

The format of a question could have had a significant impact on difficulty over and above the number of parameters. Questions 6, 8, and 10 (Correlate, Categorize and Compare2) were posed in such as way that there could have been many ways that an answer could have been partly right. The scoring used for the previous analysis was an 'all-or-nothing' scheme. The correct answer was known and a particular answer could be assigned unambiguously as right or wrong. In the case of questions that had more complex formats, this might have led to underestimating a subject's performance. For instance, the correct answer for Question #6 is 'A and C'. If a subject answered 'A' or 'C', the strict grading process would have marked the answer wrong. Similarly, Question #10 asked subjects to select as many items as were applicable from a set of 21 items. Only a single set of selections is correct, but there are many answers that could be called partly right. A more lenient grading criterion was implemented and tested to determine whether it produced significantly different interpretations of the correctness measure. It is unfortunate that the questions with the highest number of parameters are found in this ambiguous category of question formats. Even if grading was changed to the more lenient scheme, there was no effect on the analysis of the relationship of parameter number and performance.

6.7 Preferences

The fifth hypothesis of the current study is:

H5: There is no significant difference among the displays with respect to user preference.

H5a: Users express no preference for displays with which they perform better.

H5b: Users express no preference for displays based on order of presentation.

H5c: Users express similar preference rankings regardless of scenario difficulty.

After performing tasks with each of the display types, subjects were asked to rank the displays. In addition, they were given a free choice area in which they could assign zero or more displays to categories such as 'hard', 'easy', etc. The rankings will be discussed in the next section and the categorization information will be presented and discussed in section 6.7.2. Finally, an analysis of the subjects' free-form comments will be presented.

6.7.1 Rankings

Subjects ranked the displays after using all of them. The results are summarized in Figures 6-11 and 6-12 for the 2-term and 3-term studies, respectively. Analysis showed that there was no relationship of these preference rankings and subject performance, when measured by time to completion. There was, however, a correlation between rankings and correctness for both the 2-term and 3-term groups. In each case, the 'spring' display was preferred by subjects who received high scores when using it. In the 2-term study, the same observation was made for Graph. Finding this relationship between performance and preference leads to rejection of H5a.

Figure 6-11: Preference rankings for 2-term displays

Figure 6-12: Preference rankings for 3-term displays

The rankings were tested for a relationship to the order in which the subject encountered the display type. Non-parametric analysis was used and the results showed no correlation between the position in which any display was seen and any positional ranking assigned by the subjects in either the 2-term or 3-term study. This finding leads to acceptance of H5b.

The final sub-hypothesis related to preferences is the cross-study comparison. In order to compare the studies, the data were adjusted by removing references to the Graph presentation in the 2-term study. The Kruskal-Wallis test was applied to the resultant data and it showed that the rankings for best and for worst display were significantly different (Table6-18). The inference than can be drawn from this data is that the 'spring' display was preferred more often in the more difficult 3-term study than in the easier 2-term condition.

Table 6-18: Results of Kruskal-Wallis analysis of ranking data with respect to study type

 

Best

Second

Third

Worst

Chi-Square

6.308

1.389

2.187

26.746

Significance

.012

.239

.139

.000

6.7.2 Qualitative Ratings

In addition to ranking the display, the subjects were given the opportunity to rate the displays as 'Easy', 'Hard', 'Fun', and/or 'Annoying'. Every subject voted in at least one category and many people selected more than one display as exhibiting a certain characteristic. The data are shown in Table 6-19 and are given as percentages to facilitate comparisons between the studies. The differences between the 2-term and 3-term studies were assessed using non-parametric statistics.

Table 6-19: Percentage of subjects categorizing display according to various criteria

Easy

Hard

Fun

Annoying

2-term (n=120)

Word

10

50

8

74

Icon

51

9

34

6

Table

39

7

7

21

Graph

48

13

30

17

Spring

10

55

15

50

3-term (n=72)

Word

3

78**

4

89*

Icon

56

3

17*

4

Table

33

7

7

17

Spring

29**

24**

47**

21**

*: p<0.05; ** p<0.01

These data confirm the results of the rankings. As the difficulty of the scenario increased, i.e., 2-term to 3-term condition, the Word display became significantly more difficult to use, while the 'spring' display became more useful. The 'spring' display was perceived in the harder environment to be easier and more fun to use.

6.7.3 Comments

The survey provided a block in which subjects could enter any comments they wished to provide. Analysis of these comments showed that there were basically four types of responses - no comment, explanations for periods of distraction, observations related to the test itself and other types of observations. The following table shows the frequencies for each type of comment and for each test type.

Table 6-20: Comment types for 2-term and 3-term studies

2-term

3-term

None

68

44

Explanation of distraction

17

8

Test-related

28

17

Other

8

5

Similar proportions of subjects in each study volunteered each type of comment. Over half the subjects in each group provided no feedback. When distractions were explained, no other type of information was ever added. The items in the 'other' category were mainly statements of overall appreciation for having done the study, for being paid for it, and words of encouragement for the researcher.

Test-related comments deserve closer attention. Once more, several categories could be identified. Several comments were extensive and addressed many issues while most were 1-3 short sentences. The short comments usually identified one or more of the displays by name and attributed positive or negative value to it (them). This type of information mirrors the data captured in other parts of the survey. Subjects appear to have used it to make very clear their distaste for or pleasure with particular displays. In this context, the word display was noted as a particularly poor display method five times in 2-term study and three times in the 3-term study. The icon was mentioned as superior once (3-term) and the table once (2-term). The graph display used in the 2-term study was given good marks by one subject and was cited as being poor by two. The 'spring' engendered the most comments - both good and bad. Six subjects in the 2-term study disliked it and one thought it was clearly the best. In the 3-term study, there were 3 favorable reports about the 'spring' compared with 4 negative ones. This more positive response to the 'spring' parallels the observations that were made in the previous section regarding preferences.

The substantive comments in the test-related group addressed issues related to the design of the study. The subjects who wrote these comments were not significantly different from non-commenters with respect to age, area of study, level of education, computer experience or expertise, or test performance or personal display preferences. Their remarks were professional in tone and provided valuable responses about format of questions, sources of confusion and other topics related research design and pedagogy. The content of these notes will be useful when the current studies are extended in the future.

6.8 Summary of Results

This section is intended to show in a condensed form the results presented on prototype display evaluation. Hypotheses are listed and a notation indicates whether a particular hypothesis or sub-hypothesis was accepted or rejected based on the analysis of the data.

H1: Performance of tasks will not differ significantly regardless of display used.

H1a: The number of correct answers will not differ significantly regardless of display used.

H1b: Time to completion will not differ significantly regardless of display used.

H2: The order of presentation of a display does not affect the correctness of answers.

H2a: The number of correct answers will not differ significantly regardless of order of the display presentation.

H2b: Time to completion will not differ significantly regardless of the order of the display presentation.

H3: Scores for individual subtasks are not significantly different from each other.

H3a: There is no correlation between the number of parameters for an individual subtask and users' performance with the task.

H4: The number of terms depicted in a display does not affect either time to perform a task set nor the number of correct answers.

H4a: Each three-term display is associated with poorer performance than the paired two-term display.

H5: There is no significant difference among the displays with respect to user preference.

H5a: Users express no preference for displays with which they perform better.

H5b: Users express no preference for displays based on order of presentation.

H5c: Users express similar preference rankings regardless of scenario difficulty.

Table 6-21: Summary of hypothesis testing outcomes

 

2-term

3-term

2- vs. 3-term

H1a: Display differences: Correctness

Reject

Reject

H1b: Display differences: Time to completion

Reject

Reject

H2: Presentation Order

Reject

H2a: Correctness

(Accept)

(Accept)

H2b: Time to completion

Reject

Reject

H3: Correctness for subtasks

Reject

Reject

H3a: Correlation with parameter number

(Accept)

(Accept)

H4: Cross-study comparison of display types

Reject

H4a: 3-term inferior to 2-term

Reject

H5a: Preferences vs. Performance

Reject

Reject

H5b: Preferences vs. Order of presentation

(Accept)

(Accept)

H5c: Preferences vs. Scenario difficulty

Reject

Table 6-21 presents a statistical view of the data. By way of interpretation, the data show that

7 Discussion and Conclusions

At this point in the development of this report, the key literature related to evaluation of information retrieval visualizations especially those based on vector representations of documents has been reviewed. This exposition of the literature led to the conclusion that the approach of all prior evaluations was flawed in that each study attempted to evaluate an interface rather than a particular display representation. The question that needed to be answered is: what do people understand about a document set given a particular way of viewing the documents and relations among them? The preliminary studies reported in chapter 3 used Boolean representations of documents and tested subjects using Boolean test questions. These studies constitute the foundation of a bottom-up testing paradigm that has been extended in the studies carried out in this dissertation work. The rationale for this extension was presented in chapter 4, the methodology in chapter 5 and the results in chapter 6. This discussion will attempt to compare and contrast this current set of results with those of the preliminary studies and, to the extent possible, to the pre-existing literature discussed in chapter 2.

7.1 Demographics

Before proceeding to compare the study results, it is important to describe any differences in the subject population that were drawn on for these experiments. The data shown in Section 6.1 show that the composition of the groups that were inducted into the 2-term and 3-term vector studies were equivalent in terms of all the variables examined. This distribution information validates the randomization procedure that was used. Comparisons between these sets of vector studies can, therefore, be pursued without misgivings about the appropriateness of doing so.

The subjects recruited for the Boolean studies were quite different in many respects from the subjects of the vector studies. The subjects for the preliminary studies performed the paper-and-pencil tests during class periods of various undergraduate courses of the Department of Information Science & Telecommunications and Molde College. This mechanism for testing subjects in what were termed 'pilot' studies was useful since it allowed access to large numbers of students who by-and-large were novices to the visualizations being investigated. There was a conscious effort to expand the pool of subjects when the vector studies were being planned. The preliminary studies had shown that differences between subjects were small compared with differences among the test variables. This fact allowed consideration of broadening the study population without much risk to the sensitivity of the measurements that would be made.

The key differences between the Boolean and vector subjects are related to age and level of education. Both of these factors are related to the exclusive use of undergraduate students in the preliminary studies and a wider range of people in the later studies. The age range for the 2-term Boolean subjects was 32 years. The 3-term Boolean study group age range was 35 years. The 2- and 3-term vector groups had age ranges of 41 and 51 years, respectively.

Other differences include a decrease in the number of students reporting that English was their second language in the vector studies than in the Boolean studies (13% vs. 21%). The smaller number of these individuals makes comparison with the previous such subjects inappropriate. There had been a large contingent of students from Molde College inducted into the Boolean experiments. Since the subjects were to be paid for their participation in the vector, a decision was made to exclude subjects who wanted to be paid who could not arrange to come to be paid. Since the performance of Norwegian subjects was not different from the rest of the original test groups, their absence from the vector study test groups would not be expected to make a difference.

The following conclusions can be made regarding the demographic composition of the subject groups used:

7.2. Differences among Display Types

The preliminary Boolean studies showed that prototype displays had varying abilities to support user task performance. The prototypes were based on plain text, tables, list of iconic representations, and, for 2-term studies, a graphical display. In addition, a display was created based on the VIBE algorithm (Olsen 1992, 1993) for portraying a set of documents in a space defined by a set of keyterms. Since VIBE positioning is based on the characteristics of the physics of a spring, the prototype has been referred to as a 'spring' display. In two dimensions, i.e., when two keyterms are chosen for display, the picture shows a line and in three dimensions, a triangle is shown. These elemental types can be used with minor modification to visualize data that has weighted or vector characteristics.

In the multidimensional scaling done by Lohse (1990) and discussed earlier, these displays could be ranked along the 'data type' dimension as: Text -> icons -> graph -> table -> 'spring'. Using the 'cognitive effort' dimension, the ordering would be: graph -> text -> table -> icons -> 'spring'. A third ranking of the prototypes used in these studies could be developed by rating them according to the type of symbols they contain; this contention would generate the following order: text -> tables -> graphs ~ icons -> 'spring'. Such an ordering captures an increasing amount of visual information as opposed to reading information. The symbols in text are the letters and words. In tabular presentations, numbers stand as surrogates for the words. Graphs use positioning as a substitute and icons use color and shape. The 'spring' display uses size, shape, position, and color.

The preliminary studies provided evidence that users could indeed use each of these display types effectively, albeit with differing degrees of success. The measure of performance in these early studies was based on how many correct answers the subjects produced. This constraint was due to the nature of the testing situation, which was paper-and-pencil. Data on time to completion would have required individual observation. The idea behind the strategy for collecting data was to sample broadly from large numbers of subjects rather than deeply from a necessarily smaller group. The use of a large population enables robust predictions, if differences can be detected at all.

Later series of measurements added the ability to measure time by using computer delivery of test materials to the subjects and computer collection of data from them. This methodology overcame the problems of paper-based tests without sacrificing the power that comes from using a large population for testing.

A short period of interim testing using about 60 subjects was used to validate the shift from paper to computer. Subjects in those studies showed similar performance based on number of correct answers. Additionally the data from timing that was available for the on-line group showed highly significant differences among the displays.

The current studies use this computer-based method only. This choice is somewhat fortuitous in that the correctness of answers has been less sensitive to differences between displays. This is true for both the 3-term Boolean displays as well as both the 2- and 3-term vector-based displays. The 3-term Boolean tests presented users with a single display about which four questions were to be answered. All four questions were presented simultaneously which allowed collection of only aggregate timing data. The timing of these tests showed that the four questions were answered in approximately 150 seconds for the 'spring' and table displays while the icon and word lists required an additional 30 seconds. On average, subjects needed 45 second to answer a single question.

The vector studies required a different document set for each question. For example, if the task was to find an outlier in the document set, then there had to be a definite outlier in the set. This condition would have been impossible to achieve if the same display had to support unambiguous determination of clustering. Therefore, display-question pairs were developed. Each subject had to view 10 different patterns for each display type and had to answer a single question about each one. Subjects in these studies needed about 300 seconds for the 'spring' display and up to about 600 seconds for the other display. On average, the amount of time per question is very similar to that found in the Boolean study - 45 seconds in each case.

Differences in the various study designs preclude a statistical analysis of the data across studies. The design differences include the measures used for assessing performance, the number of different displays shown per question, the number and type of tasks. Nevertheless, such a comparison, if only in qualitative terms, could serve as a key to the kinds of results that have been amassed using this bottom-up layered testing paradigm. It could also serve as a basis for determining the validity of this testing model, showing strengths and weaknesses. The following table shows such a summary.

Table 7-1: Comparison of display effectiveness across all studies

# terms

Data type

Measure

Best

2nd

3rd

4th

Worst

2

Boolean

Correctness

Icons

Text

Spring = Table = Graph

3

Boolean

Correctness

Icons

Table = Word

Spring

Time

Spring

Table

Icons

Word

2

Vector

Correctness

Graph

Spring = Icons = Table

Word

Time

Spring

Graph

Icons

Table

Word

3

Vector

Correctness

Spring = Icons = Table

Word

Time

Spring

Icons

Table

Word

Clearly, this comparison shows that Word and Text displays were always associated with poor performance when time to perform a series of task was the measure. Just a clearly, the 'spring' display is superior in producing quick responses. It is important to note that these finding are true regardless of the difficulty of the test or the type of data being rendered. Performance across trials with respect to correctness presents a less clear picture. In all cases, it appears that the questions being posed were of insufficient difficulty to elicit clear-cut performance differences among the groups of subjects. The mean score for all studies was about 85%, which is very high. In such situations, it is possible that the 15% error rate could be due to unintentional causes.

The fact that similar error rates were encountered with drastically different performance times indicates that the 'spring', an instance of visual display, supported rapid extraction of information contained in that display at an optimal level of accuracy. If the 'spring' display had been only superior with respect to speed but had led to more wrong answers, then the question would still be open about whether visual displays are superior. The only stronger case would have been if both measures had shown a positive effect.

Conclusions that might be drawn from analyzing the effect of display type on performance include:

7.3 Effect of Order of Presentation

All the studies, current and preliminary, were performed using enough subjects that randomization for display presentation was controlled for. Enough subjects were tested, however, to allow testing of the order of presentation on performance. The data were presented in Section 6.5. In each of the studies, there was a clear effect of order of presentation on performance using whatever the primary measure was. The 2-term Boolean study showed that each of the displays was difficult to use if it was the first display that was seen by a subject. The data showed that each display became easier to use the later it was presented in the series; there was no significant difference between the fourth and fifth positions indicating that a plateau was reached. The 3-term Boolean study showed that the first display that a subject saw was accompanied by diminished performance, while later displays were used at a plateau level of performance. The data used to make this conclusion was based on number of correct answers.

Neither of the vector studies showed an effect on number of correct answers but instead did show differences with respect to the amount of time that it took to answer a battery of test questions. The subjects who performed the 3-term Boolean study in the computer-based condition also showed evidence of a time effect, which paralleled the effect seen for this group with respect to number of right answers - a plateau was reached after the second display regardless of which type was seen.

There appears to be a discrepancy among all these data. What could cause 'correctness' to be sensitive to differences in the 2-term Boolean study, less so in the 3-term Boolean study and not at all in the vector experiments? Part of the answer to this question could be related to the design of the individual experiments, especially with respect to the number of tasks that were associated with each display. In the 2-term Boolean study, there were 5 different displays and two questions about each of them, making a total of 10 questions. The 3-term Boolean study had only 4 display types (removal of graph from series), but four tasks were presented for each of them - 16 questions. The vector studies presented the same number of display types as in the paired Boolean study, 5 and 4, respectively. There were 10 questions associated with each of these display types and, additionally, separate actual display variants were shown for each question. A qualitative analysis of these features of the different studies shows that a plateau developed after about eight questions were answered in the 2-term Boolean study. In the 3-term Boolean study, the plateau occurred after the second display, i.e., after eight questions. In the vector experiments, there was no discernable effect of ordering on correctness. If, however, one assumes that a subject becomes immune to excess errors after seeing and answering eight questions, then the possibility exists that there was a correctness effect but it was totally within the first display type!

In order to explore this possibility, an attempt was made to correlate the probability of a right answer for each individual question with the time that it took a subject to answer this question. The data, however, were too coarse-grained to allow a conclusion. Another confounding factor in this analytical approach is that the questions had been purposely chosen to be different from each other in type. This arrangement of questions produced a situation in which the probability of a right answer was highly unrelated to order of presentation. That is, hard questions and easy questions were inter-mixed so that trying to overlay them with another ordering was impossible.

The conclusion that can be drawn from cases in which correctness was a suitable measure is that subjects learned from one display to another regardless of the display type. This finding reduces to saying that the subjects learned how to answer questions just by answering questions; the tool that they used to do this made no difference.

A similar inference can be made regarding the effect of time to completion in the cases where this type of data is available. There is a single exception to this generalization and it concerns the 'spring' display in the vector studies. Without exception, subjects became quicker at using a display if it can later in the series when compared with early presentations. The 'spring' display, when shown in the 3-term vector study, required significantly less time to use even when shown to a subject as a first display type. Subjects performed at almost peak with respect to the 'spring' display. This is a finding that confirms a conjecture of the Boolean studies. As the testing situation becomes more difficult, i.e., more terms and more documents, visual methods are easier to use.

Some conclusions that may be drawn from the study of presentation order include:

7.4 Effect of Using Tasks Created from a Visual Taxonomy

The tasks performed by subjects in the Boolean studies were implemented as Boolean tasks, that is, they were constructed as sequences of AND and OR segments. The vector studies presented a challenge, since Boolean tasks were not appropriate when the underlying data were not intrinsically Boolean. A visual taxonomy developed by Wehrend & Lewis (1990) and refined and parameterized by Zhou & Feiner (1998) was employed in developing a set of tasks for the 2-term and 3-term vector studies. This decision, in retrospect, was a good one although there are some aspects of the implementation that deserve discussion.

The method used to map from the taxonomic categories to a test task set was a 3-step mapping. First, generalized questions that conformed to the descriptions of a taxonomic category were generated for the IR domain. For instance, a Locate task in the IR domain was interpreted to mean 'where would a document with certain characteristics be found in the display'. This process continued for each category. The second step matched particular 'spring' displays with each generalized statement. And the final step was to create an actual test question based on the keyterms used in the display and taking into account the features of the display, such as clusters, outliers, or gaps.

At each stage in this process, the possibility existed of producing a task that might be ambiguous or misleading. The fact that the subjects performed with a very high degree of accuracy tends to suggest that the questions were well-formed by-and-large. The results also showed that there were very sharp differences among the various tasks with respect to the amount of time that it took to answer the questions. This difference in time is indicative of a level of difficulty.

A confounding factor that was not anticipated was the interaction with the actual format of a question. As alluded to several times in the Results section, a question that is posed to determine the level of a subject's knowledge may be asked in many different ways. The notorious questions found on standardized exams that have answers, such as 'A', B', 'C', 'A and B', 'A and C', or 'All of the above', are very convoluted. In fact, the format of the question makes it very likely that a student would possess sufficient knowledge to answer many more straightforward questions about the same material. The relevance to the current situation is that some of the final mappings produced tasks that were of a somewhat involved nature. These questions were less frequently answered correctly. Unfortunately, these particular tasks were those that were associated with a greater number of parameters, which led to an inability to test hypothesis H3a properly. H3a was set up to correlate the number of parameters with performance.

Although the use of the visual taxonomy failed at the lowest level, it succeeded in producing task sets that supported superior performance when visual displays were used. This finding differs from that of the preliminary Boolean studies, in which the 'spring' display was associated with poorer performance than most of the other displays. That visual tasks are performed better with visual displays has not been demonstrated previously, making it novel information. When this statement is viewed in light of the conclusion regarding learnability of displays, it seems that the challenge lies in the development of interfaces that encourage subjects to ask questions in a visual manner. Before Boolean retrieval systems existed, people would have had a hard time phrasing Boolean queries. Even today Boolean systems are difficult for many people to use. Systems that support visual inquiry will need to allow users to phrase questions visually. In developing such systems, the contention could be made that a visual taxonomy would provide a useful, if not necessary, set of guiding principles.

The major conclusions regarding the utility of a visual taxonomy are:

7.5 Preferences

Of all the data gathered in these experiments, preference information is the most consistent. In all studies, subjects were asked to rank the displays that they had used. In the vector studies, subjects were also asked to rate the displays according to several qualitative categories - 'hard', 'easy', 'fun' and 'annoying'. A third kind of preference information was collected from the optional comments that subjects could provide on the posttest survey form found in the vector format only. The three methods for gathering information in the vector studies produced the same results.

 

Table 7-2: Subject preferences across studies (percent of subjects)

 

 

 

Word

Icons

Table

Graph

Spring

Boolean

2-term

Best

14

33

15

9

29

 

 

Worst

47

8

15

20

9

 

3-term

Best

10

43

12

 

35

 

 

Worst

52

6

16

 

26

Vector

2-term

Best

7

35

25

32

2

 

 

Worst

45

2

2

9

42

 

3-term

Best

3

51

26

 

19

 

 

Worst

74

3

28

 

22

At first glance, these data might seem to indicate that the visual prototype used in this study was highly distasteful to the subjects. Closer examination shows clearly that the 'spring' display was significantly more appreciated in the more complex 3-term situation than in the easier 2-term paired condition. Subjects not only ranked it less often as the 'worst' display but also ranked it significantly more often as the 'best' display. This observation is especially noteworthy when viewed in the context of how subjects performed on the tests in which they preferred the visual display. In these studies, there was a positive correlation between performance and preference. It appears, therefore, that subjects like to use things that make them successful. In the context of developing interfaces to assist users in exploring document spaces, it seems that making interfaces that can be used successfully will be met with acceptance by those users.

Icon displays were in each test scenario very highly values by the subjects. It is interesting to note that several of the interfaces that have been developed for IR systems use displays that incorporate icons of the sort embodied by the icon prototype of this study. TileBars (Hearst 1995) and SIRRA (Aalbersberg 1995) are clearly relatives of the prototype icon display. In addition, there are instances that are unattributed in various Web search engine reports. These visual displays are based on bar graphs and/or histograms and are very familiar to the average user of systems, which contributes to their utility.

The most notable observation of the preference data is that 'text' and 'word' displays were extremely ill-preferred, regardless of performance. In the 2-term studies, performance with this display type was very good, yet user acceptance was very low. This is the normal anecdotal experience of users of Internet search engines and of library searchers. Text alone may be sufficient to allow users to solve problems when browsing but text alone is unsatisfying.

Conclusions regarding preference data include indications for interface design:

8. Summary of Future Directions

The continued utility of a bottom-up, layered testing paradigm has been demonstrated. The method has been employed in Boolean and vector scenarios and both were tested at two levels of difficulty, i.e., numbers of keyterms. There are many directions that research could proceed with this framework in place.

Considering that the method is layered, the first possible avenue for future development would be to increase the number of keyterms in the displays. The increase to four terms would tax the power of the VIBE-based 'spring' display. Koshman's evaluation of the VIBE interface (Koshman 1996), however, used displays of this level of difficulty with some success. The fact that visual displays offer the most support when the number of items to be displayed is large seems to imply that this is an area that must be investigated.

Secondly, 3-dimensional displays were purposely ignored in the current testing. The rationale in not creating a 3-dimensional graph display was based on the fact that interfaces that attempt to render 3-D information are inherently complex and much less well understood than 2-D spatial displays. Many recent IR visualizations rely on volume displays (Cugini 1996; Benford et al 1995; Hemmje, Kunkel & Willet 1994). It is important that a rational evaluation plan be used to determine whether these or other potential 3-D candidates are truly useful.

A third area to which this testing plan could be applied involves moving these displays back into interfaces. The approach could accommodate feature-by-feature addition. Performance enhancements or degradations could be monitored so that optimal interfaces could be produced. Each feature or mode would have a reason for being included; these interfaces would be economical in terms of implementation and maintenance time from a developer's point of view and should enhance rapid learning from a user's point of view.

The case has been made here for the utility of visualizations for supporting information retrieval activities. However, it is clear that not all tasks that an information seeker might need to perform can be satisfied with visual methods. The use of a visual taxonomy in these studies provided a way to deal with the complexity of visual tasks. It would be very desirable to have a parallel series of text-based tasks. With these two categorizations -- visual and text-based tasks -- work could proceed to delineate the characteristics of full IR systems. The integration of visual and non-visual components could be structured rather than being a matter of happenstance.

The application of the visual taxonomy described in this paper should be tested more rigorously. This plan should include attention to details of question formats. In fact, it would be highly desirable to perform this activity as the focus of a whole study rather than as a part of any larger work. The reasons for this suggestion include the breadth of the taxonomy itself. It is important to test as many of the groupings as possible. In addition, internal validity needs to be assessed by replicating question types with varying formats.

Another interesting study to contemplate is a longitudinal investigation of users of visual systems and their ability to use these systems. Just as people have grown accustomed to using Boolean queries and just as generations of children have been taught to interpret tabular and graphical information, perhaps as visual systems proliferate there will be a concomitant increase in the fluency of people to ask visually-based questions. Such a study needs to be started soon if we expect to find users who have little visual information experience.

Finally, this study has demonstrated the feasibility of administering tests in an on-line format. All aspects of the generation of test materials and collection of data were automated. Almost 200 subjects were tested in less than 2 weeks. The randomization procedure produced precisely the desired groupings. No data was lost, corrupted, or otherwise compromised. Of course, not all testing can or should be done without ever observing individual subjects at work. An interesting final proposal for testing involves merging the current on-line approach, which gathers information about large numbers of users, with an observational methodology. The data from observational studies could be used to build models of user search behaviors and the online administration could be used to validate these models. The combined method might be better than either of them alone.

BIBLIOGRAPHY

Aalbersberg, I.J. 1995. Personal communication in Nuchprayoon (1996)

Bates, M. 1989. A 'berrypicking' model of information retrieval. Online Review 13(5): 408-424.

Beaulieu, M., S. Robertson and E. Rasmussen. 1996. Evaluating interactive systems in TREC. JASIS 41(1): 85-94.

Belkin, N.B., C. Cool, A. Stein and U. Thiel. 1995. Cases, scripts, and information-seeking strategies: on the design of interactive information retrieval systems. Expert Systems with Applications 9(3): 379-395.

Benford, S.D., D. Snowdon, C. Greenhalgh, R. Ingram, I. Knox and C. Brown. 1995. VR-VIBE: a virtual environment for co-operative information retrieval. Eurographics '95, 30th August - 1st September, Maastricht, The Netherlands, 349-360.

Bertin J. 1983. Semiology of Graphics. W. Berg (Translator). University of Wisconsin Press, Madison.

Blair, D.C. 1988. An extended relational retrieval model. Information Processing & Management 24(3): 349-371.

Blair, D.C. and M.E. Maron. 1985. An evaluation of retrieval effectiveness for a full-text document retrieval systems. Communications of the ACM 20: 648-656.

Brodlie, K.W. 1992. Visualization Techniques. In Scientific Visualization - Techniques and Applications, K.W. Brodlie, L.A. Carpenter, R.A. Earnshaw, J.R. Gallop, R.J. Hubbold, A.M. Mumford, C.D. Osland and P. Quarendon (editors), Springer-Verlag, chapter 3, pp. 37-86, 1992.

Card, S.K., G.G. Robertson, and W. York. 1996. The WebBook and the WebForager: an information workspace for the World Wide Web, CHI 96, ACM Conference on Human Factors in Software, ACM Press, New York. 111-117.

Card, Stuart K., G.G. Robertson, and J.D. Mackinlay. 1991. The information visualizer, an information workspace. Proceedings of ACM Human Factors in Computing Systems Conference (CHI'91), 1991, 181-188.

Chalmers M. 1993. Using a Landscape to represent a corpus of documents, Springer-Verlag Proceedings of COSIT '93, Elba, pp. 377-390.

Chalmers, M. 1996. A linear iteration time layout algorithm for visualising high-dimensional data. Proceedings of IEEE Visualization '96, 127-132

Chen, H., T. Yim, and D. Fye. 1995. Automatic thesaurus generation for an electronic community system. JASIS 46(3): 175-193.

Cleveland, W.S. 1985. The Elements of Graphing Data. Wadsworth Advanced Books and Software, Monterey, CA.

Croft, W.B. and T.J. Parenty. 1985. A comparison of a network structure and a database system used for information retrieval. Information Systems 10(4): 377-390.

Croft, W.B., H.R. Turtle, and D.D. Lewis. 1991. The use of phrases and structured queries in information retrieval. Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, 32-45.

Crouch, C.J. 1988. An analysis of approximate versus exact discrimination values. Information Processing & Management 24:5-16.

Crouch, D. and R.R. Korfhage. 1990. The use of visual representations in information retrieval applications. In T. Ichikawa, E. Jungert, & R. R. Korfhage (Eds.), Visual Languages and Applications, New York, Plenum Press, 305-326.

Cugini, J., C. Piatko, and S. Laskowski. 1996. Interactive 3D visualization for document retrieval. NIST publication.

Cutting, D.R., D.R. Karger, J.P. Pedersen and J.W. Tukey. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the fifteenth annual international ACM SIGIR Conference on Research and Development in Information Retrieval (Copenhagen) 318-329

Deerwester, S., S.T. Dumais, G.W. Furnas, and T. K. Landauer. 1990. Indexing by latent semantic analysis. JASIS 41(6): 391-407.

DeFazio, S., A. Daoud, L.A. Smith, and J. Srinivasan. 1995. Integrating IR and RDBMS using cooperative indexing. Proceedings of SIGIR '95, 84-92.

Dubin, D. 1995. Document analysis for visualization. Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval,199-204.

Furnas, G.W. 1994. High dimensional representations and information retrieval. In New Approaches in Classification and Data Analysis, edited by E. Diday, Y. Lechevallier, M. Schader, P. Bertrand, and R. Burtschy. Springer-Verlag, Berlin.

Furnas, G.W., T.K. Landauer, L.M. Gomez, and S.T. Dumais. 1987. The vocabulary problem in human-system communication. Communications of the ACM 30(11): 964-971.

Hahn, U. 1990. Topic parsing: accounting for text macro structures in full-text analysis. Information Processing & Management 26:135-170.

Harman, D. 1992. Ranking algorithms. In Information Retrieval: Data Structures & Algorithms, W.B. Frakes & R. Baeza-Yates, editors. Prentice-Hall, Upper Saddle River, NJ, pp363-392

Hearst, M. and P. Pedersen. 1996. Reexamining the cluster hypothesis: scatter/gather on retrieval results, in the Proceedings of the19th Annual International ACM/SIGIR Conference, Zurich, August.

Hearst, M., J. Pedersen, and D. Karger. 1995. Scatter/gather as a tool for the analysis of retrieval results, Working Notes of the AAAI Fall Symposium on AI Applications in Knowledge Navigation, Cambridge, MA, November.

Hearst, M.A. 1994. Multi-paragraph segmentation of expository text. In the Proceedings of the 32nd Meeting of the Association for Computational Linguistics, Los Cruces, NM, June.

Hearst, M.A. 1994. Using categories to provide context for full-text retrieval results. In Proceedings of the RIAO ' 94, New York.

Hearst, M.A. 1995. TileBars: visualization of term distribution information in full text information access. CHI '95 Proceedings, 213-220

Hearst, M.A. and C. Plaunt. 1993. Subtopic structuring for full-length document access. Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, 59-69.

Hemmje, M., C. Kunkel, and A. Willet. 1994. LyberWorld - a visualization user interface supporting fulltext retrieval. In: Proceedings of ACM SIGIR '94, July 3-6, Dublin, 249-259.

Kennedy, J.B., K.J. Mitchell and P.J. Barclay. 1996. A framework for information visualisation. SIGMOD Record, 25(4): 30-34.

Kim, H. & R.R. Korfhage. 1994. BIRD: Browsing Interface for the Retrieval of Documents. In Proceedings of the 1994 IEEE Symposium on Visual Languages, St. Louis, 176-177.

Korfhage, R. 1991. To see, or not to see - is that the query? In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM SIGIR, Association for Computing Machinery, 134-141.

Korfhage, R.R. 1997. Information Storage and Retrieval. John Wiley & Sons, New York. pp. 349

Koshman, S. 1996. Usability testing of a prototype visualization-based information retrieval system. Dissertation, University of Pittsburgh.

Larkin, J. and H. Simon. 1987. Why a diagram is (sometimes) worth 10,000 words. Cognitive Science 11: 65-99.

Lewis, D.D. and K. Sparck-Jones. 1996. Natural language processing for information retrieval. Communications of the ACM 39(1): 92-101.

Liddy, E.D., W. Paik, and M. McKenna. 1995. Development and implementation of a discourse model for newspaper texts. In Proceedings of the AAAI Symposium on Empirical Methods in Discourse Interpretation and Generation. Stanford, CA.

Lin, X. 1997. Map displays for information retrieval. JASIS 48(1): 40-54.

Lin, X. 1991. A self-organizing semantic map for information retrieval. Proceedings for the Fourteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Oct. 13-16; Chicago, IL), 262-269.

Lin, X. 1992. Visualization for the document space. Proceedings Visualization '92, IEEE Computer Society Press, Los Alamitos, CA., 274-281.

Lin, X., D. Soergel, and G. Marchionini. 1991. A self-organizing semantic map for information retrieval. SIGIR '91, 262-269.

Lohse, G., H. Rueter, K. Biolsi, and N. Walker. 1990. Classifying visual knowledge representations: a foundation for visualization research. Visualization '90: Proceedings of the First Conference on Visualization, 131-138

Losee, R.M. and S.W. Haas. 1995. Sublanguage terms: dictionaries, usage, and automatic classification. JASIS 46(7): 519-519.

Lovins, J. 1968. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11: 22-41.

Lynch C.A. and M. Stonebraker. 1988. Extended user-defined indexing with applications to textual databases. Proceedings VLDB, 306-317.

Mackinlay, J.D., R. Rao, and S.K. Card. 1995. An organic user interface for searching citation links. CHI '95, 67-73

Marchionini, G. 1992. Interfaces for end-user information seeking. Journal of the American Society for Information Science 43(2): 156-163.

Marchionini, G. 1995. Information Seeking in Electronic Environments. New York: Cambridge University Press.

McLeod I.A. and R.G. Crawford. 1983. Document retrieval as a database application. Information Technology: Research and Development 2: 43-60.

Morse, E., and M. Lewis. 1997. Why information visualizations sometimes fail. Proceedings of IEEE International Conference on Systems Man and Cybernetics, Orlando, FL, October 12-15, 1997.

Morse, E., M. Lewis, R.R. Korfhage, and K. Olsen. 1998. Evaluation of text, numeric and graphical presentations for information retrieval interfaces: User preference and task performance measures .Proceedings of IEEE International Conference on Systems Man and Cybernetics, San Diego, CA, October 11-14, 1998.

Newby, G.B. 1992. An investigation of the role of navigation for information retrieval. Proceedings of ASIS '92, 20-25

Newby, G.B. 1996. Metric multidimensional information space. In: TREC-5 Proceedings. Gaithersburg, MD: National Institute of Science and Technology.

Nuchprayoon, A (1996). GUIDO: A Usability Study of its basic retrieval operations. Doctoral Dissertation. School of Information Sciences, University of Pittsburgh.

Nuchprayoon, A. & R.R. Korfhage. 1994. GUIDO, A Visual Tool for Retrieving Documents. In Proceedings 1994 IEEE Computer Society Workshop on Visual Languages, St. Louis, 64-71.

Olsen, K.A., J.G. Williams, K.M. Sochats, and S.C. Hirtle. 1992. Ideation through visualization: the VIBE system. Multimedia Review 3(3): 48-59.

Olsen, K.A., K.M. Sochats, and J.G. Williams. 1997. Full text information retrieval and information overload. Accepted by the International Information and Library Review.

Olsen, K.A., R.R. Korfhage, M.B. Spring, K.M. Sochats, and J.G. Williams. 1993. Visualization of a document collection: The VIBE system. Information Processing and Management. 29(1): 69-81.

Pejtersen, A.M. 1988. Search strategies and database design for information retrieval in libraries. In L.P. Goodstein, H.B. Andersen & S.E. Olsen (Eds.), Tasks, Errors and Mental Models, Hampshire, England: Taylor & Francis, pp. 171-192

Porter, M. 1980. An algorithm for suffix stripping. Program 14(3): 130-137.

Raghavan, V.V. and S.K.M. Wong. 1986. A critical analysis of vector space model for information retrieval. JASIS 37(5): 279-287.

Rau, L.F. and P.S. Jacobs. 1991. Creating segmented databases from free text for text retrieval. Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery. 337-346.

Rogowitz B.E. and L.A. Treinish. 1993. An architecture for rule-based visualization. Proceedings of IEEE Visualization `93, San Jose, CA, October 1993, IEEE Computer Society Press, Los Alamitos, CA, 236-243.

Rose, D.E. and R.K. Belew. 1991. A connectionist and symbolic hybrid for improving legal research. International Journal of Man-Machine Studies 35(1): 1-33.

Salton, G. 1986. Another look at automatic text-retrieval systems. Communications of the ACM 29(7): 648-656.

Salton, G. 1989. Automatic Text Processing: the Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading, MA, pp. 530.

Salton, G. and C. Buckley. 1991. Automatic text structuring and retrieval: Experiments in automatic encyclopedia searching. Proceedings of SIGIR, 21-31..

Schütze, H., D.A. Huff, and J.O. Pedersen. 1995. A comparison of classifiers and document representations for the routing problem. Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press. 229-237.

Shneiderman, B. 1996. The eyes have it: a task by data type taxonomy for information visualizations. Proceedings of IEEE Symposium on Visual Languages, Boulder, CO, September 3-6, 336-343.

Spink, A. 1997. Information science: a third feedback framework. Journal of the American Society for Information Science 48(8): 728-740.

Spoerri, A. 1993. InfoCrystal: A Visual Tool for Information Retrieval. In Proceedings Visualization '93, San Jose, CA, 150-157.

Spoerri, A. 1993. Visual tools for information retrieval. Proceedings of the 1993 IEEE Symposium on Visual Languages. Bergen, Norway. Los Alamitos, CA: IEEE Computer Society Press, 160-168.

Stanfill, C. and D.L. Waltz. 1992. Statistical methods, artificial intelligence, and information retrieval. In Text-based intelligent systems: Current research and practice in information extraction and retrieval, ed. P.S. Jacobs, Lawrence Erlbaum, pp. 215-226.

Treisman, A. 1986. Features and objects in visual processing. Scientific American 254: 114-124.

Wehrend, S. 1992. Taxonomy of Visualization Goals (Appendix). In P.R. Keller and M.M. Keller (Eds.), Visual Cues, pp. 187-199

Wehrend, S. and C. Lewis. 1990. A problem-oriented classification of visualization techniques. Proceedings IEEE Visualization '90, 139-143

Willett, P. 1985. An algorithm for the calculation of exact term discrimination values. Information Processing & Management 21(3): 225-232.

Willett, P. 1988. Recent trends in hierarchic document clustering: a critical review. Information Processing & Management 24: 577-597.

Wise, J.A., J.J. Thomas, K. Pennock, D. Lantrip, M. Pottier, A. Schur, and V. Crow. 1995. Visualizing the non-visual: spatial analysis and interaction with information from text documents. Proceedings of Information Visualization, October 20-21, 1995. IEEE Computer Society Press, Los Alamitos, CA. 51-58.

York, J. and S. Bohn. 1995. Clustering and dimensionality reduction in SPIRE. Presented at the Automatic Intelligence Processing and Analysis Symposium, Mar 28-30, Tysons Corner, VA

Zhou, M.X. and S.K. Feiner. 1998. Visual task characterization for automated visual discourse synthesis. CHI '98, 392-399