Design of 3-D Visualization of Search Results:
Evolution and Evaluation

John Cugini / cuz@nist.gov
Sharon Laskowski / laskowski@nist.gov
Information Technology Laboratory
National Institute of Standards and Technology (NIST)
Gaithersburg, MD 20899

Marc Sebrechts / sebrechts@cua.edu
The Catholic University of America
Washington, DC 20064-0001


Contribution of the National Institute of Standards and Technology. Not subject to copyright. Reference to specific commercial products or brands is for information purposes only; no endorsement or recommendation by the National Institute of Standards and Technology, explicit or implicit, is intended.

Abstract

We discuss the evolution of the NIST Information Retrieval Visualization Engine (NIRVE). This prototype employs modern interactive visualization techniques to provide easier access to a set of documents resulting from a query to a search engine. The motivation and evaluation of several design features, such as keyword to concept mapping, explicit clustering, the use of 3-D vs. 2-D, and the relationship of visualization to logical structure are described. In particular, the results of an extensive usability experiment show how visualization may lead to either increased or decreased cognitive load.

Keywords

Comparison of 3-D and 2-D; Design of Visualization; Evaluation of Visualization; Information Visualization; Usability Experiment

1. Background and Motivation

For the past four years, the Information Technology Laboratory of the National Institute of Standards and Technology (NIST) has supported a small project ([Cugi96], [Cugi97]) to explore the potential value of visualization for information access. In particular, we were interested in exploiting 3-D technology to help users understand and manipulate search results, i.e. the set of documents returned by a search engine in response to some query.

There have been some attempts to provide an overview of the design space for information visualization (see [Chal95], [Card97], and [Zhou98]). These offer a top-down framework within which particular visualizations may be categorized. As opposed to a "unified field theory" of information visualization design, this paper takes a bottom-up approach: we present a case study of iterative design, from which some familiar and some novel lessons emerged. We hope that this detailed critique of various prototypes can serve as a guide to other researchers who wish to do meaningful test and evaluation of new approaches to document visualization.

2. Relationship to Previous Work

By now, the literature on visualization of textual databases is too extensive to review in detail (see [Youn96], [Card96], and [Card99] for survey articles). We will mention only those prototypes that aim to visualize a result set, not an entire database. Although there are obvious similarities, these two goals are quite distinct, as we shall see. Even when we so restrict the field of interest, we find a wide variety of approaches.

2.1 Details of Previous Prototypes

Allen et al [Alle93] developed a system of hierarchical clustering, displayed as an interactive tree, with logical zooming. There is always a selected document and a selected subtree. The basic organizing principle of the tree is similarity to the selected document. The 2-D visualization is contained in one of four windows and serves as an overview. The other three windows have textual details: they contain the current query, the subtree document lists, and the text of the selected document.

Envision [Nowe96] is a flexible interface to a digital library. Its basic model is the scatterplot graph. It allows users to decide which document attributes (e.g. relevance rank, score, author, date, index terms) will be mapped to which visual attributes (e.g. location, size, shape, color). It was decided to limit Envision to 2-D for the sake of wider availability. Envision allows only one index term per document as an attribute because of the "usability problems" that would arise from multiple terms. The user evaluation consisted of comparing subjects' performance to that of the interface designer, not to a text-based equivalent. A satisfaction survey showed a positive reaction to the interface.

Hearst and Pederson [Hear96] use a scatter/gather algorithm to do dynamic clustering and refinement of search results. The clustering "consistently outperforms ranked titles" in retrieval precision. The interface, however, is not a true visualization, but a GUI that textually displays the clusters, labelled with their characteristic terms. I.e. the emphasis is on the advantages of one structure over another not on visualization per se.

In [Veer97], Veerasamy and Heikes describe a simple 2-D grid system with search keywords along the y-axis, and document identifiers along the x-axis, using the rank order as returned by the search engine. Each cell of the grid shows the frequency of the corresponding keyword within that document. A careful study showed that the addition of this visual interface to the usual text-based interface allowed users to judge document relevance more quickly and accurately. Visualization was particularly effective in the identification of irrelevant documents as such.

More recently, Swan and Allan [Swan98] performed a controlled study comparing 1) a text-based system [ZPRISE], 2) a GUI-oriented system, and 3) the latter enhanced with 3-D visualization of document clusters. They wanted to improve so-called "aspect-oriented" IR, which emphasizes finding some specified information, not documents per se. In that context, using recall as a measure, there was a small advantage for the 3-D system over text-based, and for the text-based over the plain GUI. However, there was no evidence for the overall effectiveness of the use of 3-D. In addition, the utility varied depending on task and users. Experienced users preferred the text-based system, while novices liked the GUI systems. Some users thought the 3-D approach was "worthless", others thought it natural and intuitive.

In another recent paper, Borner [Born00] describes a system in which Latent Semantic Analysis (LSA) is used to pre-compute inter-document similarity within a known collection. The result set of a query on the collection is then clustered based on this analysis. The clustering and other document details are represented in a rich 3-D environment, using the CAVE interface. A force-directed algorithm is used to lay out the clusters in 3-D space. As of January 2000, no usability testing has been performed.

[Shne00] describes a tool presenting search results in a grid-like system, in which one of the axes represents a hierarchy. Early user tests have yielded encouraging results.

[Mann99] describes an interesting project which allows users to choose dynamically among several of these visualizations.

2.2 Summary of Prototypes

In sum, only [Veer97] and [Swan98] have directly compared the effectiveness of visualization against functionally equivalent traditional interfaces. These experiments have demonstrated modest to significant improvements. Furthermore, apart from [Swan98], the experimental visualizations have been very conservative: trees or 2-D grids. The following table sums up the five efforts, plus NIRVE on the last line.

System Structure Structure basis Visualization Evaluation Baseline
[Alle93] tree term similarity 2-D ranked list
(informal)
[Nowe96] scatterplot attributes 2-D designer's
performance
[Hear96] clusters term similarity GUI ranked list
[Veer97] grid terms 2-D text equivalent
[Swan98] clusters term similarity 3-D GUI and text
equivalents
[Born00] clusters LSA 3-D None
[Shne00] grid pre-categorized 2-D None
NIRVE clusters concept similarity 3-D 2-D and text
equivalents

3. Evolution of the NIRVE Prototype

There are a few design goals and constraints that have stayed constant throughout the development of NIRVE. Since it was conceived as a post-processor for NIST's PRISE [ZPRISE] search engine, NIRVE was based on the information that PRISE accepted and returned. With a few enhancements, PRISE basically accepts a set of terms, or keywords, as a query; it does not take boolean combinations. It returns a set of entries, one per document. Each entry contains: The number of documents returned is controlled by the query. Typically, we dealt with result sets of size 100-500. As a database, we used all the news stories (about 90,000) issued by the Associated Press for 1988, as made available through the Text Retrieval Conference [TREC].

Early on, we developed a normalized keyword profile as the basic metric for each document. Each component of this vector is calculated as the square root of the number of occurrences of each keyword divided by the document length, scaled to a maximum value of one for each component.

We were inclined to experiment with highly metaphorical visualizations, rather than something simple and schematic, such as a grid. Our emphasis has always been on presenting the user with an overview of the structure of the result set, rather than concentrating on finding an individual document. Typically, visualization is more helpful in such broad integrative tasks [Wick94] than in narrow searching. In all the models, the user can move the display around in 3-D (rotate and shift), and also select icons for individual operations, e.g. to view the full text of the document. As we shall see, keywords have associated colors. When displaying the full text, the keywords are rendered in the appropriate color.

All the NIRVE prototypes have been implemented as one or more graphical windows managed by OpenGL, and a control menu window managed by Tcl/Tk. These processes communicate via Xevents [Nye93].

3.1 Spiral Model

3.1.1 Spiral Design

Spiral Metaphor
In our first model, we tried to preserve the sequential structure of the ranked list returned by PRISE, and enhance it with additional information. We arranged document icons along a spiral in 2-D, with the top-ranked document in the middle, and the others spaced out along the spiral proportional to their scores (Figure 1). Thus the user was encouraged to start at the middle and work out towards the periphery - a reasonable metaphor. We believed that the most important information to convey about each document was its keyword content: this is what the user asked for, and also, what PRISE made available. The icon, therefore, was a simple square containing a bar chart, showing the relative frequency of color-coded keywords.
Keyword Weighting
The legend associating colors with keywords was in a separate window. Each keyword had a colored slider which controlled its weighting factor. As the user increased the factor for a given keyword, document icons containing that keyword were elevated above the plane of the spiral. The elevation was proportional to the sum of the products of a document's keyword frequency and the keyword weighting factor. That is, we used the 3rd dimension for an alternate ranking of the documents based on keyword weights rather than on PRISE scoring.

Figure 1: Spiral Model

3.1.2 Spiral Problems

Spurious Clustering
In the very earliest informal evaluations, the comment that always came up was why an apparent cluster of document icons was grouped together. The answer, of course, was that they just happened to be placed there because of the layout algorithm, which used the PRISE scores. And so, lesson number 1: people will view spatial arrangement metaphorically whether you want them to or not.
Complexity of Icon Elevation
The idea behind elevating document icons was to allow the user to express which keywords were really important and which were ancilliary and then use that information to select the relevant documents. When all the keywords but one had a zero weighting factor, this worked moderately well. But assigning significant weights to several keywords resulted in a display that was hard to interpret. Many icons would float above the spiral plane and it was difficult to make use of their positions (for similar remarks, see [Chal95]). In order to make the weighting more selective, we added an "AND" mode (as opposed to the implicit "OR" mode): in order to be elevated, a document had to have a non-zero frequency for all keywords which had non-zero weights. For example, if the user emphasized the keywords "michael" and "jordan", then a document had to have some occurrence of both keywords to be elevated. The moral here is less clear - perhaps two can be suggested. First, make selection mechanisms selective; it doesn't help to highlight 50% of a collection. Second, 3-D gets confusing very quickly; layout in 3-space must be done with great restraint.

3.2 3-D Axes Model

3.2.1 3-D Axes Design

3-D Axes Metaphor
Another early design, developed simultaneously with the spiral model, was the use of 3-D axes. In the first iteration, the user could dynamically select three keywords to be assigned to the X, Y, and Z axes and the icon would be placed in the location corresponding to those three components of its keyword profile (Figure 2). Each document icon still had a bar chart for the full keyword profile. The supposed advantage in this model was that each spatial dimension would have a direct and meaningful semantic interpretation.
Keyword Aggregation
Since we were limited to three spatial dimensions, we extended the model to allow the user to assign sets of keywords to each dimension. The natural tendency was to bundle together keywords which were close in meaning - the first hint of keyword aggregation.

Figure 2: 3D Axes Model

3.2.2 3-D Axes Problems

Volumetric Occlusion
It quickly became apparent that naively scattering icons in 3-space led to occlusion and confusion. Certain features, such as outliers, and documents with a zero value for one or more of the three axes did emerge. But most of the icons were located in the general volume down near the origin and were very hard to distinguish. Moreover, unlike the spiral model, there was no obvious sequential order in which the documents could be accessed. And so, reinforcing what we found with the spiral model: unless the data are already very structured (either naturally or as a result of some analysis and feature extraction), volumetric style 3-D is unlikely to be easily understood.
No Natural Clustering
We had hoped that some natural clustering might emerge from our 3-D scatterplot, but this did not seem to happen in the examples we tried. In retrospect, this was not surprising. This was not, after all, a scatterplot of the entire database, but rather a small subset, chosen precisely because its members matched a query. Since the documents were chosen for their similarity with respect to the query, the query keywords themselves are not likely to provide a means of strongly differentiating among them. By contrast, using all the keywords of the database to form high-dimensional document vectors may well result in a good clustering, as in [Hear96].

3.3 Nearest Neighbor Circle Model

3.3.1 Nearest Neighbor Circle Design

Document Sequence
The two main ideas behind the Nearest Neighbor Circle (NNC) were to radically simplify the visual display to avoid excess occlusion, and to perform some analysis internally rather than relying on visualization per se to generate good clustering. First, we defined a distance metric between documents, such as the Euclidean distance between their keyword profiles (recall the keyword profile is essentially an n-dimensional point, with all values between zero and one). NNC then orders the documents by applying the nearest neighbor algorithm: the successor of a document is the closest document which has not yet been chosen.
Circle Metaphor
The corresponding icons are then arranged in a circle, with the icons sitting upright, somewhat like photographic slides in a circular tray (Figure 3). The spacing is not uniform, however, but is proportional to the distance between adjacent documents. Thus, large visual gaps form implicit cluster boundaries - and indeed, we did find that somewhat sensible groups were generated by this procedure.
Keyword Weighting
As in the spiral model, NNC supported the ability to elevate icons based on dynamic weighting of keywords. Because the icons were arranged upright in a circle (essentially a 1-D structure) rather than laid out flat on a 2-D spiral, the result of elevating icons was much clearer. Furthermore, the elevation would show distinct patterns: nearby icons tended to all be raised or all be left on the base plane. This was a result, of course, of putting documents with similar profiles near each other.

Figure 3: Nearest Neighbor Circle Model

3.3.2 Nearest Neighbor Circle Problems

Too Many Keywords
We noticed that when the query consisted of just a few (4-5) keywords, the resulting structure of the display usually turned out to be reasonably coherent. By contrast, although the algorithms could handle any number of keywords, the visualization became harder to understand if many keywords were used. Natural clusters of documents tended to fragment into small, non-adjacent sub-clusters. This was especially frustrating since sometimes clusters would fragment based on keywords that were essentially synonyms - e.g. "tornado" vs. "twister".
Implicit Clusters
Although the size of the gaps between documents provided a clue about implicit cluster boundaries, it seemed as if marking these explicitly would help. Also, there was no easy way to tell what a cluster was about other than scanning through its icons and noticing which keywords were dominant.
No Workspace
Finally, the only thing users could really do with the documents was to look at the visualization, play with the keyword weighting, and view full text. There was no way to designate or save a desirable subset of documents.

3.4 Spoke and Wheel Model

3.4.1 Spoke and Wheel Design

The Spoke and Wheel prototype introduced a number of new features; the three most important were keyword-concept mapping, explicit clustering, and user marking and filtering.
Mapping Keywords to Concepts
Keyword-concept mapping allows the user to dynamically aggregate keywords into a presumably smaller set of concepts. For instance, the keywords "tornado", "twister", and "storm" might all be mapped to the single concept "STORM". Although it is common for each keyword to be mapped to exactly one concept, it is not required. This change typically cuts down the number of dimensions significantly and therefore simplifies the resulting visualization. Each document is then characterized by a concept profile, rather than keyword profile, and the bar chart of its icon reflects color-coded concepts, not keywords. Using concepts to describe a document often makes more sense semantically than the full set of keywords. Several keywords may be included in a query to make sure that all relevant documents are returned, but they may not denote any meaningful distinction in the subject matter of interest. Control of this mapping was incorporated into the keyword slider window via a keyword-concept matrix (Figure 4). The user can click each cell to toggle its value: a checkmark indicates that the corresponding keyword and concept are associated.

Figure 4: Keyword/Concept Matrix

Explicit Adjustable Clusters
As with NNC, a sequence of documents is calculated (based on their concept profiles), but now clusters are made explicit. We defined a cluster boundary as any gap in the document sequence larger than a given threshhold. This had the interesting consequence of enabling dynamic control of cluster granularity. The user can request fewer, bigger clusters, causing the prototype to increase the threshhold for gap size - i.e. only larger gaps will count as cluster boundaries. Conversely, lowering the threshhold induces smaller, more numerous clusters.
Cluster Icons and Spatial Arrangement
As before, document icons stand upright on a base plane, but now there are also larger 3-D icons for explicit clusters. The concept profile of a cluster is defined as the average of the profiles of its documents. The cluster icons are arranged around a circle, facing outward. The associated document icons are arranged outward along a radius aligned with the cluster icon (Figure 5). The angular distance between clusters is proportional to the logical distance between them; likewise the radial distance between documents reflects the distance metric separating their concept profiles.
Textual Equivalent
Since the clusters were now explicit structures, it allowed us to generate a webpage in which document titles were organized correspondingly. The direct motivation was to allow the user to see several titles at once; however, this also formed the basis of later experiments comparing interfaces with the same logical structure, but different visual presentation, namely text, 2-D, and 3-D.
Marking and Filtering
The third principal innovation was decorating every icon with a small colored flag, indicating the user's judgment of its value: red for bad, yellow for undecided, and green for good. The user could mark entities at the cluster or document level. Furthermore, once marked, the user can do dynamic filtering based on these attributes, e.g. show good and undecided documents, but suppress bad. In particular, the suppression of irrelevant clusters served to simplify the entire display.
Concept Weighting
This prototype still retained the ability, inherited from the spiral model, to assign a weight to each concept. However, this weighting no longer caused document icons to be elevated; rather it was used as a scaling factor for each dimension and this in turn affected the distance metric and clustering among documents. If a concept is assigned a low weight, it means that it doesn't matter too much if two documents differed with respect to that concept. Conversely, a high-weighted concept magnifies the logical distance between such document pairs. We found that this scheme was too subtle for most users, who generally ignored the sliders.

Figure 5: Spoke and Wheel Model

3.4.2 Spoke and Wheel Problems

Distinguishing Documents within a Cluster
Organizing documents according to concepts tended to generate fairly clean homogeneous clusters. But this success now made it more difficult to distinguish among documents within a cluster based solely on their (quite similar) concept profiles. The user could slide the mouse along a row of document icons to cause their titles to appear sequentially, but the spatial arrangement and bar charts of the document icons really conveyed very little information.
Cluster Relationships
The arrangement of clusters around a circle was essentially one-dimensional and this precluded a good visualization of the relationships among clusters. Note that inter-cluster distance was well-defined for all pairs of clusters, not just those adjacent along the circle.
Matrix Interface
While the association matrix between keywords and concepts was logically correct and complete, it tended to intimidate some users. Moreover, to make things easier, we limited concept names to capitalized versions of the keywords. Users therefore had to distinguish between lowercase keywords occupying columns and uppercase concepts occupying rows.
Disjunctive Aggregation Only
Keyword-concept mapping was motivated by the presence of near-synonyms in queries. For these, disjunction was the appropriate model. But in other cases, particularly proper names, there was good reason to want conjunctive aggregation of keywords (e.g. the concept "EPA" = "environmental" and "protection" and "agency"). This was interestingly similar to our earlier refinement of keyword elevation: we started with "OR" mode, and wound up needing an "AND" mode as well.
Confusion over Input Modes
A long-standing problem concerned input mode. We wanted to allow the user both to manipulate the 3-D display (move mode) and to select icons within it (pick mode). Both modes required all three mouse buttons, so we could not, for instance, use button #1 for move and button #2 for pick. We tried to make these modes very apparent, by setting the cursor to indicate which was in effect, and enabling use of the spacebar as a toggle. But in early experiments done by our collaborators at the Catholic University of America [Sebr99], users uniformly reported confusion and frustration over this issue. We conclude that mode-switching has a high cognitive load; this presents a difficult design problem when the user is limited to a single 2-D input device.

3.5 Concept Globe Model

3.5.1 Concept Globe Design

Simplify Cluster Definition
We had found that when adjusting the cluster threshhold, users tended to form clusters in which all of the documents had the same set of concepts present, i.e. their cluster profiles were similar in that all had the same set of non-zero components. More generally, it seemed to us from experience that slight variations in cluster profile conveyed virtually no useful information - what was really wanted was information about the presence or absence of a concept within a cluster or document. And so, we decided to drastically simplify the clustering algorithm: a cluster is defined as a set of documents all of which have some occurrence of the same subset of concepts. Thus, if five concepts are being used to distinguish among the documents, there are at most 32 clusters: one with all five concepts, five with four concepts, ten with three concepts, and so on.
Globe Metaphor
How then to arrange clusters spatially? It seemed reasonable that the number of concepts was an important organizing principle; a cluster with four or five concepts seemed more promising than one with only one or two. We decided to arrange clusters on the surface of a globe (Figure 6). The cluster icon was now a box, whose thickness represented the number of documents it contained, and whose face held the familiar colored bar chart for concept profile. The latitude of an icon is determined by the number of concepts it represents. Conveniently, there were unique locations - the North and South Pole - for the unique clusters with all and no concepts. Also, there was more room in the middle latitudes for the more numerous clusters with an intermediate number of concepts. It turned out in later experiments that subjects readily understood this metaphor.
Showing Cluster Relationships
What about the relationship among clusters? The closest relationship was when two clusters differed by the presence of a single concept. Note that two such adjacent clusters would necessarily be in two adjacent bands of latitude. Therefore, we developed a heuristic procedure to assign longitudinal position so as to try to keep such adjacent pairs close; i.e. longitude had a relational but not an absolute meaning. Of course, we had to avoid overlap within a latitude band. The mere location of the cluster icons was not a strong enough visual cue, and so we connected logically adjacent clusters by an arc, whose color corresponded to the conceptual difference between them. E.g. if cluster A has the concepts "boat", "sink", and "ocean", and cluster B has "boat", "sink", "ocean", and "storm", then they will be connected by an arc color-coded for "storm". These arcs were put in almost as an afterthought, but turned out to be quite successful: the subjects in our evaluation experiments were able to use them to navigate among the clusters.
First-Class Concepts
We decided that concepts should be first-class objects, not just uppercase versions of the keywords. Users could now freely assign names and colors to concepts, specify whether they were conjunctive or disjunctive, and add and delete them. Along with this, we changed the interface for control of the keyword-concept mapping. Instead of a matrix, we designed an interactive legend in which the colored concepts were shown in a row, each with a column of its keywords beneath it. Users could change the mapping by dragging and dropping keywords among the concept columns. The last column is always reserved for the UNUSED concept, in which unmapped keywords (if any) are stored.

We did not take the final step of allowing concepts to be any boolean function of keywords, although that might be useful in some cases, e.g. PRESIDENT = ((bill or william) and clinton). Our tentative judgment is that the utility is outweighed by the complexity of meaning (most users are not skillful at formulating logical expressions) and of the interface necessary to specify such combinations.

Representing Documents
For the first time, we decided not to show document icons by default. The concept profiles of documents within a cluster were now, by definition, only variations in quantity among a fixed subset of concepts, so showing the bar charts seemed almost useless. The salient issues then become how to distinguish among a cluster's documents, and how to design meaningful icons. Since we could not get access to the full text or full term vector (as used in [Hear96]) of documents quickly enough to support real-time operation, the only additional information available was the documents' titles and relevance scores (as returned by the search engine).

This raises an important issue: presumably, what the user really wants to know is what a document is about, in the true semantic sense. Concept profiles and term vectors are merely possible indicators of a document's true meaning. We use them because they are susceptible to automatic manipulation, not because they are perfect representations of a document. A document title is normally more informative than these, but it is trickier to use as an object of computation.

We arrange the icons for documents within a cluster on a 2-D document field. A document icon is a simple rectangle containing the title (not a bar chart), along with a little value flag as discussed above. These icons are arranged in the document field such that similar titles (i.e. those containing some matching words - we developed a simple metric for title similarity) have nearby horizontal positions. Vertical position is controlled by the score assigned by the search engine. Thus similar titles appear in the same column, with better scores towards the top of the column.

Clusters can be opened or closed, independently. When a cluster is opened, its 2-D document field is projected outward from the cluster icon and the view automatically zooms in on the field. Thus users can decide whether to display just an overview of the entire result set, or show details selectively.

Input Devices and Modes
We finally solved the nagging input mode problem by the simple expedient of using a second input device. A Spaceball [Spac99] (a 6-dimensional input device) is used to move and rotate the entire 3-D display. The mouse is used solely for picking. This elementary change caused a major improvement in user satisfaction.

Figure 6: Globe Model

3.5.2 Concept Globe Evaluation

We had been performing informal evaluations of the various prototypes described above. Once the globe model became stable, however, we prepared to conduct a more formal usability experiment. In particular, we wanted to measure the effects of various visualization modes, namely 3-D, 2-D and text. Moreover, we wanted to carefully isolate the effects of these modes, and not confound them with the effects of functional differences among prototypes. Therefore, we developed a 2-D and text version of the globe model, preserving as much of the functionality as feasible. These prototypes were the object of a detailed usability experiment, as reported in [Sebr99], from which the following is excerpted.

In the 2-D model, the globe was flattened into a map on which all clusters could be displayed simultaneously. Since there is no third dimension to convey cluster box thickness, this information is conveyed as the width of a gray bar located at the bottom of the box. Arcs indicating conceptual similarity are depicted as straight lines, and the field of document titles is simply drawn over the display of cluster icons.

In the text model, an HTML file is displayed in Netscape. Clusters are represented basically as lists of document titles. Each cluster is labeled with a textual colored concept profile. The order of clusters is according to the number of concepts contained, analogous to the north-to-south arrangement on the globe.

How Many Concepts?
All the models worked better when the result set was organized with a "reasonable" number of concepts, typically four or five. Once the number of concepts reached seven or eight, the resulting display became complex and difficult to interpret. How commonly do users inquire about topics for which five or so concepts are insufficient (clearly, the number of keywords may be greater)? We tentatively suggest that, most of the time, five or six concepts will be enough to characterize a topic, but this remains an open question.
Text vs. 2-D vs. 3-D
Initially, subjects performed better, as measured by task completion and response times, using the text model than the 2-D model, and using 2-D than 3-D. This was especially true for selective tasks, such as finding a document title. In such cases, the need to open a cluster and scan through the document field was probably the big disadvantage. In retrospect, it may have been a more valid comparison to implement the text version as a list of cluster titles only, each of which would have to be explicitly opened in order to see the contained document titles.

Each subject went through six sessions, however, and by the last session, the performance gap largely closed. 3-D showed the greatest improvement, 2-D somewhat less, and performance using the text model actually seemed to grow slightly worse. Moreover, the performance of "expert" users (those with extensive computer experience) was virtually equal for 3-D and text, and somewhat worse for the 2-D model.

We surmise, first, that novices and experts alike were already familiar with text-like operations, such as scrolling, but that novices had some difficulty adapting to the graphical interfaces; second that it took some practice before the spatial metaphors became familiar enough to be used without undue delay.

Cluster Grouping
Overall, the subjects understood and generally liked the organizational aspects of NIRVE including clustering of documents and the relational arrangement of clusters. They used the grouping of concepts into clusters to narrow their search for particular documents. If a particular concept was not of interest, the subject knew which set of documents to avoid. The grouping also contributed to the selection of potential documents because it showed concept combinations that might not have otherwise been considered.

The relational structure of the clusters was also used to keep track of preferred clusters. The vertical placement of clusters according to the number of concepts helped users adopt the strategy of linking up or down depending on their need of adding or subtracting concepts. Many 2-D and 3-D participants would start from one pole of the globe and navigate through various links. In a number of cases, they began with the lowest level containing the minimal number of "potential" concepts required to find a document, and then worked their way up the globe or map until they found a matching document.

Color Works
The most frequently used feature of the NIRVE interface was color. Users in all three modes took advantage of color-concept mapping. The text condition benefited the most from this dimension, making this otherwise tedious list more efficient than anticipated. Instead of skimming or quickly reading the list of concepts at the beginning of each cluster, the subjects adopted the strategy of scanning the associated colors. This strategy is efficient because visual scanning of color, an automatic process, takes less time and effort than scanning words.
Visualization vs. Text
Subjects using the 2-D and 3-D models had difficulty using the document field to find titles. Legibility was a problem, and the 2-D layout was complex compared to a familiar one-dimensional scrollable list of titles. Perhaps this is a case of "over-spatialization". Especially for a list of moderate size (e.g. 20-30 titles), there is probably little to be gained by structuring the set and then visualizing the result. In our experiment, we used result sets of only 100 titles; larger result sets (and hence larger clusters) might profit from such visualization.
3-D vs. 2-D
Although the globe was more visually appealing, it presented problems for many users. First was our familiar nemesis, occlusion: roughly half the clusters were not visible at any time because thay were on the back side of the globe. Secondly, subjects tended to get disoriented. They would find a needed cluster, go look at another one, and then have trouble re-locating the original one. We put alphabetic markers along the equator to help give a sense of absolute location, but they did not seem to help. In contrast, the 2-D version showed everything at once, and the only manipulation allowed was panning and zooming - whereas in the 3-D model, the scene could be shifted in any of three directions and rotated around the X or Y axis. In short, the surface of the globe is a 2-D manifold, and, in this application, there was no real advantage to curving it through 3-D space.

3.6 2.5-D Design

As a follow-up to the usability experiment just discussed, we developed a hybrid model, attempting to combine the better features of the 2-D and 3-D models. In this protoptype, cluster icons are laid out on a 2-D map, but the icons themselves have thickness, and the arcs connecting them loop up into the 3rd dimension (Figure 7). Also, when clusters are opened, the document field is projected outward, as in the global model. The document field itself remains a prime candidate for re-design.

We suspect that for many applications, comprehensible use of 3-D will require a similar strategy: relatively small 3-D entities embedded in a 2-D manifold, rather than full volumetric-style 3-D.

Figure 7: 2.5D Model

4. Conclusions

There are many design dimensions to be aware of when developing a visualization application: Often, the effect of each of these is not distinguished when evaluating a new approach. This is understandable: it can be quite expensive to test even a few variations along several dimensions. However, a credible claim that visualization improved an application must rest on a fair comparison with otherwise equivalent alternatives.

We suspect that because good visualization depends on good structure, what often happens is that developers are motivated to perform a deeper analysis in order to generate that structure. This re-structuring not only enables the visualization but also may suggest more powerful operations and functionality than originally foreseen. Thus, the improved visual version surpasses its non-visual ancestor at least as much because of this process of re-analysis as of the visualization itself.

We hope to extend and refine NIRVE as a vehicle for exploring some of the many unresolved design issues surrounding the visualization of search results:


Acknowledgments

We thank all the following: Dr. Christine Piatko actively collaborated in the early design and development of NIRVE. Michael Miller and Joanna Vasilakis of the Catholic University of America provided valuable help in evaluating various prototypes and offering design suggestions.

Refererences

[Alle93] R. Allen, P. Obry, M. Littman, "An Interface for Navigating Clustered Document Sets Returned by Queries", Proceedings of SIGOIS, pp.203-208, Milpitas, CA, June 1993.

[Born00] K. Borner, "Visible Threads: A Smart VR Interface to Digital Libraries", Proceedings of IST/SPIE's 12th Annual International Symposium: Electronic Imaging 2000: Visual Data Exploration and Analysis (SPIE 2000), San Jose, CA, 23-28 January 2000.

[Card96] S.K. Card, "Visualizing Retrieved Information: A Survey", IEEE Computer Graphics and Applications, v.16(2), pp.63-67, March 1996.

[Card97] Stuart K. Card, Jock D. Mackinlay, "The Structure of the Information Visualization Design Space", Proceedings of IEEE Symposium on Information Visualization, Phoenix, AZ, October 1997.

[Card99] Stuart K. Card, Jock D. Mackinlay, Ben Shneiderman, Readings in Information Visualization: Using Vision to Think, Morgan Kaufmann Publishers Inc. San Francisco, CA, 1999.

[Chal95] M. Chalmers, "Design perspectives in visualising complex information", Proc IFIP 3rd Visual Databases Conference, Lausanne Switzerland, March 1995.

[Cugi96] J. Cugini, C. Piatko, S. Laskowski, "Interactive 3D Visualization for Document Retrieval", Proceedings of the Workshop on New Paradigms in Information Visualization and Manipulation , ACM Conference on Information and Knowledge Management (CIKM '96), November 1996.

[Cugi97] J. Cugini, S. Laskowski, C. Piatko, "Document Clustering in Concept Space: The NIST Information Retrieval Visualization Engine (NIRVE)", CODATA Euro-American Workshop on Visualization of Information and Data, Paris, France, June 1997.

[Hear96] M. A. Hearst and J. O. Pederson, "Reexamining the cluster hypothesis: Scatter/gather on retrieval results", Proceedings of SIGIR '96, Zurich, Switzerland, Aug 18-22 1996.

[Mann99] T.M. Mann, "Visualization of WWW-Search Results", Proceedings of the International Workshop on Web-Based Information Visualization (WebVis'99), pp. 264-268, (in conjunction with DEXA'99, Tenth International Workshop on Database and Expert Systems Applications, eds A.M. Tjoa, A. Cammelli, R.R. Wagner) Florence Italy, September 1-3 1999, IEEE Computer Society.

[Nowe96] L.T. Nowell, R.K. France, D. Hix, L.S. Heath, and E.A. Fox, "Visualizing Search Results: Some Alternatives to Query-Document Similarity", Proceedings of SIGIR '96, Zurich, Switzerland, Aug 18-22.

[Nye93] Adrian Nye, Xlib Reference Manual, O'Reilly & Associates, 1993.

[Sebr99] M. Sebrechts, J. Vasilakis, M. Miller, J. Cugini, S. Laskowski, "Visualization of Search Results: A Comparative Evaluation of Text, 2D, and 3D Interfaces", 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, August 1999.

[Shne00] Ben Shneiderman, David Feldman, Anne Rose, Xavier Ferre' Grau, "Visualizing Digital Library Search Results with Categorical and Hierarchical Axes", Proceedings of ACM Digital Libraries 2000, San Antonio, Texas, June 2-7, 2000.

[Spac99] See http://www.spacetec.com/index.htm.

[Swan98] R.C.Swan, J.Allan, "Aspect Windows, 3-D Visualizations, and Indirect Comparisons of Information Retrieval Systems", Proc. 21st Annual SIGIR'98, Melbourne Australia, August 1998.

[TREC] Text Retrieval Conference: http://trec.nist.gov

[Veer97] Aravindan Veerasamy, Russell Heikes, "Effectiveness of a graphical display of retrieval results", Proceedings of SIGIR '97, pp. 85-92, Philadelphia, PA, July 27-31, 1997.

[Wick94] C.D.Wickens, D.H.Merwin, E.L.Lin, "Implications of Graphics Enhancements for the Visualization of Scientific Data: Dimensional Integrality, Stereopsis, Motion, and Mesh", Human Factors, 36(1) 44-61, 1994.

[Youn96] Peter Young, "Three Dimensional Information Visualisation", Computer Science Technical Report, No. 12/96, Department of Computer Science, University of Durham, UK, November 1996.

[Zami98] Oren Zamir, "Visualization of Search Results in Document Retrieval Systems: General Examination", Department of Computer Science and Engineering, University of Washington, September 1998.

[Zami??] Oren Zamir and Oren Etzioni, "Grouper: A Dynamic Clustering Interface to Web Search Results", Department of Computer Science and Engineering, University of Washington.

[Zhou98] Michelle Zhou and Steven Feiner, "Visual Task Characterization for Automated Visual Discourse Synthesis", Proc. CHI'98, pp.392-399, LA, Calif, April 18-23 1998.

[ZPRISE] NIST PRISE Search Engine: http://www.itl.nist.gov/div894/894.02/works/papers/zp2/main.html