The session provided an overview of the problems encountered in trying to conduct genuine interactive searching evaluation experiments in the context of TREC. The conflicts in the basic methods relate to the nature of the topics, the relevance judgements as well as the definition of the search task itself.
Firstly lengthy topic specifications were found to favour automation methods for query construction. In the five rounds of TREC the topics have evolved from rich discursive queries to brief questions.
Secondly user relevance judgements not only could differ from those of the assessors but could also be made at different levels, e.g. the document may be pertinent to the topic but not meet all the specified criteria of the request. Moreover in the case of relevance feedback, a document could be judged as a good source of terms but fall outside the topic criteria.
Thirdly the task specification for interactive searching differed in each of the rounds culminating in a separate interactive track in TREC-4. The ad hoc task was to find as many relevant documents as possible without too much rubbish in a limited time. Hence the basic laboratory experimental design was gradually modified to develop specialised methods for interactive searching within the TREC framework.
For TREC-5 attempts to make the interactive task compatible with the main TREC tasks were abandoned. Instead the task was defined as a browsing task, where a user would aim to retrieve items which covered as many different aspects of a topic as possible within a time limit. The total search output was assessed for "aspectual recall" instead of relevance at individual document level. The outcome of this approach and the experimental design has yet to be assessed.
Mira participants should now consider what contribution could be made to addressing the interactive evaluation issues.
The group discussed different aspects related to the TREC initiative.
The group has decided to concentrate its attention first of all on an important and crucial distinction between
where the first one is related to the interest and focus on comparing systems, and the second one is useful for designers that intend to develop new systems.
The group was aware of the fact that sometime this distinction is not made clear, and what is expected from an evaluation effort is different depending on the interpretation. (The distinction is discussed, in the context of educational software evaluation, in http://www.psy.gla.ac.uk/~steve/Eval.HE.html ). After the group has made this distinction clear, the positive and the negative side effects of an effort such as TREC have been considered.
The positive effects include:
The negative effects include:
The conclusions of the group were:
So the group would be to see the MIRA working group to address and decide on the possibility of being like a spin off for a European TREC. Why not to plan something in the two and half years of life of the MIRA working group?
Following the presentation on TREC, we identified what we thought were the main benefits of participation in TREC, namely:
It was observed that, the main TREC activity is set up as a "competition" between the various participants, and each participant has a different research agenda, in general. Consequently, knowledge about IR tends to be developed in a "bottom-up" fashion as a result of the diversity of TREC experiments. We asked what might be achieved if TREC was set up as a non-competitive activity. It happens that the TREC mini-tracks are intended to foster this kind of collaborative activity, and the interactive track was held up as an example. Nevertheless, the element of competition remains. Could a TREC experiment be designed where the some of the experimental variables were rigorously controlled in order to explore a particular aspect of the IR problem domain? For example, suppose the search engine was identical for a number of sites, and only the interface to the engine was changed. What could we learn about IR interface design from such an experiment?
We then considered what the Mira Working Group could do in respect of TREC. Various ideas emerged from our discussions:
We noted that TREC has at least another three years to run, and most of us felt that we would like to become involved in TREC. In some cases, lack of resources, both human and computer, were a barrier to involvement in TREC. We were reminded that there are a number of ways of being involved: A-type full participation, B-type participation using smaller datasets, and the various tracks. Perhaps, we in Mira, together with the CEC, should explore establishing a European-based TREC resource to enable greater participation by European research groups in TREC.
We discussed problems on the dynamic nature of information seeking. An information need is dynamic. Assessments about document relevance are dynamic. A searcher's knowledge of an information system and effective tactics for using it is dynamic.
To better understand these elements of information-seeking, we decided that it is important to consider the whole task. So, for example, finding information in the British Highway Code [the given scenario for this session] is only one, perhaps relatively small, task to which the Code might be used. To explore this, we considered how the Code might be presented so that learners better acquire the knowledge they need for passing their driving test. This change in perspective---from that of searching to that of skill acquisition--- opened up a range of possibilities and problems for how IR might apply to the Code.
One major difficulty in experimental evaluation is that changing the interface to a search engine is a gross change---many variables change at once. Thus, if differences are observed between the old and new interface it can be very difficult to know why.
Nevertheless, the group concluded that considering the embedded nature of information-seeking, with all its complexity, seems to be the most fruitful direction for discovering how to improve the effectiveness of information-seeking environments. Thus, making gross comparisons may teach us rather a lot, though perhaps not the exact reasons for any differences that emerge.
The TREC experience has been very fruitful in some respects. For example, by recognising that induced information problems are necessary for repeatability of experiments, a very useful collection of such resources has been built up. However, TREC has two main shortcomings: firstly, it has focused on the retrieval system as the central concept in IR, and, secondly, it has proved to be primarily a competitive, rather than a collaborative, exercise.
It is time now to redress this balance. A parallel research community should be set up (led by MIRA participants?), focusing on interaction issues and user behaviour. It should make use of the TREC data, perhaps using small, limited subsets for different experiments. The long-term, co-operative aim of this community should be to draw general (i.e. cross-system) conclusions about IR as part of the overall process of information seeking. This requires agreement in advance on the goals, motivations and priorities of such work.
The group concentrated on two topics:
One of the problems of the TREC is that it is strongly oriented to text and traditional database applications. It does not seem to meet well the challenges of multimedia and open network environments (like Internet and especially WWW). Some critical comments were also expressed on TREC's limited contribution to IR system design work. One of the problems is that the test topics are quite atypical. On the positive side, some new models and ideas have been introduced for the evaluation of interactive IR systems.
The group suggests for the TREC-6 that the theoretical and empirical issues of scalability (e.g. sampling and estimation) should be taken into more careful consideration. The focus of evaluations should also be changed. One possibility is to design experiments for fixed search engines because major advances are not likely to take place on the engine side. An example of experiment would be to focus on interface issues (e.g. visualisation of document spaces, query input technologies). The test collections should also be extended to cover hypertext documents (e.g. from WWW).
The general approach of TREC to run identical or closely similar experiments at all partner sites was seen a restricting and mechanical model for co-operation. A suggestion was made that the TREC-6 should adopt a common goal under which a family of small scale experiments could be designed. According to this model, each small scale experiment could be conducted by a group of (say from three to six) research groups. As a result, a larger set of hypotheses could be tested without loosing the possibility to compare the results from parallel experiments.

