DSV CMC Home Page

Filtering and Collaborative Filtering

Notes from the DELOS workshop, Budapest, November 1997

By Jacob Palme, e-mail jpalme@dsv.su.se, at the research group for CMC (Computer Mediated Communication), within the K2Lab laboratory at the DSV university department. Last revision: 13 November 1997.

Summary: By filtering is meant tools to aid people cope with information overload by selecting the most valuable and interesting document for each user. Social or collaborative filtering is filtering based on the evaluations of documents made by other people. Intelligent filtering is use of AI tools to automatically derive filtering rules by looking at the documents a person selects. Filters use and/or produce ratings. A rating is an evaluation (rating label) of a document in one or more rating scales. The quality of a filter is measured by comparing the prediction, which the filter makes of the rating of a document for a particular user, with the rating which the user actually makes. The common information retrieval measures precision and recall can also be used. Filtering is usually based on some kind of computation of various attributes of a document, such as keywords, spelling correctness, length, ratings put on the document by other people in the same peer group as the person for whom filtering is done.

In this workshop, a number of researchers in the area of filtering and collaborative filtering talked about their research, experience and ideas.

Table of contents

Introduction

This workshop was arranged for an Esprit project (funded by the European Union, EU) on digital libraries, by SZTAKI in Budapest, the research institute of the Hungarian Academy of Science. It was also a starting point for two new EU-funded projects, EuroGather and SELECT, in the area of filtering and collaborative filtering.

These are my very personal notes, they do not include everything said at the symposium.

table of contents - top of document

The GroupLens Research Project: Scalable Collaborative Filtering for the Internet

Joe Konstan comes from the University of Minnesota and was an invited lecturer, talking about the GroupLens collaborative filtering system.

Information filtering is not new, he said. Filtering is as old as human society. But the change today is that we have too much information, giving new needs for filtering. Collaborative filtering is based on earlier experience where people asked friends and consultants to help them find the most valuable information.

Basics of GroupLens:

Ratings
measures of interest
Correlations
measures of similar interest between users in a domain
Predictions
predicted interest a user will find in an item

Area of filtering: Usenet news

A newsreader with added buttons for rating in the newsreader user interface.

Important: Research should be done on a moderated newsgroup, because it is too easy to get good results in a newsgroup full with spam and trash.

Not only finding people who agree with you, also find people who consistently disagree with you, and use their ratings, inverted, to aid your choice.

What is the value to users of getting better filtering? This is very difficult to find out.

Important for reasonable implementations: Partition users into interest groups, find neighbourhood groups (people with similar views) within interest groups.

Evaluation criteria: Correlation between prediction done by the system and the rating which the individual made after reading the article. I asked: Is there not a problem with bias, if you see the prediction before you make your rating, and a problem that rating will cause you to see only the best documents, and this skewed selection will distort your rating scales. Answer, no, we studied this and this was not a problem.

Time-spent-reading is a very good surrogate for explicit ratings. There is no correlation at all between length and time-spent-reading! (Over 300 seconds was assumed to mean "you walked away", when estimating time-spent-reading.)

Combining time-spent-reading with explicit rates give best results.

Alternatives to collaborative rating: Spelling accuracy, amount of quoted text, fame of author could be used to predict rating.

Dynamic selection of the best filtering algorithm based on fluctuations and changes in the information base being rated.

Question from Rob Procter: Did you study correlation between posting behaviour and rating. Answer: No, we did not study that. It would violate the privacy of users.

Commercial experience: Commercial providers and users want a filtering systems which picks the N most recommended items. For example, when you go into a bookstore, show you the 8 books you should look at. This is not something we had envisaged before commercialisation.

One-to-one shopping: Give personalised information to each customer of what products that customer might like to buy. Example: table of contents - top of document

Software Prototype for Information Filtering and Rating using Evolutionary Algorithms

Peter Undersmayr, Guetner Ehrencetraut, University of Vienna, Austria.

Using genetic (evolutionary) algorithms. Tries to extract the right keywords which form a good user profile from highly rated e-mails. Also tries to find words from areas the user is interested in right now, and removing old keywords from areas which the user is not any more interested in. Also this is done using genetic algorithms.

table of contents - top of document

The Profile Editor: Designing a direct manipulative tool for assembling profiles

Patrick Baudisch, GMD, Germany.

The goal of this project was to develop a simple and easy-to-use user interface for specifying your interest profiles in interaction with the computer. The interface used rectangles to represent filtering rules, with height and width and position in a container. The size of the rectangle showed how many items in total could be selected by this rule. The position of the rectangle in the container indicated whether the user wanted much from this category. The user could also change the shape of the rectangle, high and narrow if all items are equal, and low and wide if there is a large variation between good and bad items.

In a rather complex way, the height, width and position of the rectangle thus represented the filtering cutoff rules. It was intended for use by non-expert users, but sounded much too complex to suit that category of users.

No evaluation and user test had yet been done.

Future work: Apply same methods to web search engines.

table of contents - top of document

Implicit Rating and Filtering

Dave Nichols, Computing Department, Lancaster University, UK.

By implicit rating is meant gathering information from user's behaviour. Computer user behaviour, record user actions, infer ratings from these actions. Related to the research area of "LIS - transaction log analysis". Advantage: We can more easily get substantially more ratings, predictions nearly as good as predictions based on explicit numerical ratings.

When people are asked to rate both explicitly and implicitly, the correlation is high.

Sources of implicit ratings: purchase price, assess/rate, save, delete, refer/cite, reply, mark/bookmark, examine (time), consider (time), glimpse, associate, repeated use.

table of contents - top of document

An architecture for intelligent and collaborative filtering

Jacob Palme, Stockholm University and KTH.

My presentation aimed at defining an architecture for rating and filtering, such that different researchers could develop different modules and have them work together. Examples of modules:

My paper was mostly based on the document at URL:
http://dsv.su.se/jpalme/select/rating-choices.html

The overheads I used can be found at URL:
http://dsv.su.se/jpalme/select/filtering-ohs.pdf

table of contents - top of document

Institutional Rating in Everyday Life

Peter Paul Sint: Socoec, Vienna.

Rating is common in everyday life, also outside of the computer area.

Example: Price, income, creditworthiness, entrance exams to universities, grading in schools and higher education, rating of courses and teachers, personnel evaluation, psychological testing, punishment rates in laws for different crimes, medals, prices, place in "history".

Rating is used the reduce uncertainty in making decisions (organisational or commercial). Example: Product and service rating, project proposal rating, medical intervention rating, ratings of decision options, political decisions agenda.

Collection of ratings: Publishers, journalists, specialists, opinion leaders, peer groups, juries, panels, opinion pools, stratified samples, automatic collection of data on behaviour.

table of contents - top of document

Application of a Generic Voting Tool for Rating Purposes

András Micsik, SZTAKI, Budapest.

He began by giving a description of the Web4Groups groupware system, and then described the voting subsystem of Web4Groups. He compared rating to voting and said that there are many similarities, but that rating would need different user interfaces, for example a two-stage process: Calculate the rating for each object and then sort the objects by their ratings.

table of contents - top of document

Social Filtering and Social Reality

Christopher Lueg, University of Zürich.

He argued for situated information filtering, filtering adjusted to the situation of the user. For example, interest is not a stable property, interest grows while reading a document on a new topic.

Usenet News (with full international newsfeed):
300 000 servers
12 million users
500 000 articles per day
15000 newsgroups

His system reduces the priority of a certain topic (Subject text) every time you pass it without reading it, and finally topics are completely removed (fade away). These kind of filters are time limited, and assume from user behaviour what topics he likes or does not like. Does not work well for spams, because every spam is a new message with a new topic.

Future work: Establish network of recommenders who trust each other, and who exchange "kill file" information. This would protect us from spammers. But how can I trust a rater, do we need a method of rating the raters? And how can we get a critical mass of recommendations? Could a market metaphore be used?

Remember: Participation in a newsgroup or mailing list is not only a method of getting information, it is a social activity, where I get to know people in a community and get friends who can help me.

table of contents - top of document

Knowledge Pump: Community-Centered Collaborative Filtering

Damain Arregui, Manfred Dardanne, Xerox Research Centre Europe.

Step 1: Filter to the right domain (different users can put the same document into different domains).
Step 2: Collaborative filtering inside one domain, done by people interested in that domain.

You do not get any recommendations, unless you have yourself contributed reviews.

The community map is a hierarchical structure of domains, built and updated by the system administrator, based on user requirements.

Recommendation is a correlation-weighted sum of reviews: The higher the correlation of the reviewer, the higher weight on his reviews in making predictions for you.

Architecture:

Document data base: The whole of WWW.

Client: Ordinary web browser with Java Applets creating additional GUI.

Server: Apache HTTP server, Java server (our own server with a socket open to the Java Applets in each client), mSQL-server, mSQL database.

Advice: Do not use Javascript.

Storage:
Profiles: nickname, real name, communities, advisors...
Communities: name, description, location in community tree
Documents: name, author...
Review: scores, communities, comments, date
Visits: document user, date...

table of contents - top of document

Lightweight Collaborations for Social Filtering on the Web

Rob Procter and Andy McKinlay, Edinburgh University.

Affordance: User interface which makes the potential for action visible..., example: GUI components and buttons.

Social affordance: "...making potential for social action/interaction visible...".

Presence: What other people are doing, who are doing it.

Shareable artifacts: Synchronously, asynchronously, the resources one are using.

Community artifacts: For example, well-defined borders, not very available on the web.

Sequential accountability: Question and answer pairs.

Distributional accountability: Co-occurence in longer sequences.

Topical coherence: Maintaining sequence, consistent with topic.

Proxies and caches can be used to investigate individual behaviour. Traffic analysis can give data of value.

Privacy, security, presence, community

His lecture started a long and interesting discussion among participants about issues of privacy, secrecy, presence, community. In a society which perfect privacy, no one would know anything about anyone else, all messages would be written by pseudonyms. But is this the kind of society people want. Is it not rather true that people like to see each other, know about each other, understand about each other.

table of contents - top of document

The end of symbolic immortality: a non-monetarian collaborative cooperation model in an Internet based groupware service

Roland Alton-Scheild and Gernot Tscherteu

This speech presented the Web4Groups system and how ratings could be added to it. In particular, the speaker argued for a "social system" in which your rights to give out ratings is determined by how good ratings you get on your own documents.

table of contents - top of document

A Visual Tagging Technique for Annotating Large-Volume Multimedia Databases

Konstantinos Chandrinos and others.

The speaker talked about a way of putting explanations to graphics as annotations. When you see the graphic, you can choose to see small icons at particular objects in the graphic, and by clicking the icon you could get to an explanation of this object. The Icon would be connected with markup showing which part of the screen image the icon refers to. This might be useful in scientific analysis of pictures (example: Images of old documents, images from distance viewing). One scientist would be able to see the annotations made by other scientists.

The speaker said that Java 1.1 could be used to implement this as an added feature to web browsers. The data base structure for storing the annotations is based on XML.

table of contents - top of document

Panel discussion

Q: What are the risks? Keeping people in small specialized user groups, not noting that the world is changing?

A: Maybe other filtering methods, like more and more smaller and more specialized newsgroups, will keep out new and valuable information more than collaborative filtering which will give you the most valuable from wider areas?

After that, the panel discussion seemed to be very much about the conflict between protection of personal privacy and the value of openness. Many social filtering systems use pseudonyms so that a rater cannot be identified from the ratings data base.

table of contents - top of document

Appendix: Schedule for the Workshop

Monday 10 Nov 97

09.00

Strategic Meeting of the DELOS Working Group
(limited for only one member from each DELOS member institution)

11.00

Registration Desk is opened

12.00-14.00

lunch for the workshop attendees

14.00

László Kovács (MTA SZTAKI):
Welcome

14.15

Joe Konstan (University of Minnesota)(invited lecture):
The GroupLens Research Project: Scaleable Collaborative Filtering for the Internet

15.15

Umberto Straccia:
Project Overview: EUROgatherer - A Personalized Information Gathering System

break

16.00

Peter Untersmayr, Guenter Ehrentraut:
Software Prototype for Information Filtering and Rating using Evolutionary Algorithms 

16.30

Patrick Baudisch:
The Profile Editor: Designing a direct manipulative tool for assembling profiles
[available in postcript format]

17.00

Joao Ferreira, Jose Luis Borbinha, Jose Delgado:
Using LDAP in a Filtering Service for a Digital Library
[available in postcript format]

17.30

Hui Guo:
SOAP: Live Recommendations through Social Agents
[available in postcript format]

19.00

Welcome Reception (with a dinner)

Tuesday 11 Nov 97

09.00

Dave Nichols:
Usage, Rating & Filtering
[available in postcript format]

09.30

Jacob Palme (Stockholm University and KTH)(invited lecture): An architecture for intelligent and collaborative filtering;

break

11.00

Peter Sint:
Institutional Rating in Everyday Life

11.30

András Micsik:
Application of a Generic Voting Tool for Rating Purposes

12.00-14.00

lunch

14.00

Christopher Lueg:
Social Filtering and Social Reality
[available in postscript format]

14.30

Damian Arregui, Manfred Dardanne (Xerox Research Centre Europe)(invited lecture):
Knowledge Pump: Community-centered Collaborative Filtering

break

16.00

Rob Procter, Andy McKinlay:
Lightweight Collaborations for Social Filtering on the Web
[available in text format]

16.30

Panel:
Damian Arregui, Manfred Dardanne, Joe Konstan, László Kovács (moderator), Jacob Palme

Wednesday 12 Nov 97

09.00

Erzsébet Csuhaj-Varjú:
A Language Theoretical Approach to Filtering and Cooperation

09.30

Reginald Ferber and Costas Tzeras:
The TREVI Project - Personalized Information Filtering, Linking, and Delivery for the News Domain
[available in text format and in HTML at http://www.darmstadt.gmd.de/~ferber/delos/trevi.html]

10.00

Emmanuel Nauer, Jacques Ducloy, Jean-Charles:
Lamire: Using of multiple data source for information filtering: first approaches in the MedExplore project
[available in postcript format]

break

11.00

Roland ALTON-SCHEIDL, Gernot TSCHERTEU:
The end of symbolic immortality: a non-monetarian collaborative cooperation model in an Internet based groupware service

11.30

Konstantinos Chandrinos, John Immerkær, Martin Dörr, Panos Trahanias:
A Visual Tagging Technique for Annotating Large-Volume Multimedia Databases -
A tool for adding semantic value to improve information rating
[available in postcript format]

12.00-14.00

lunch

table of contents - top of document