Kirsten Swearingen
Prof. Rashmi Sinha, IS 271
December 15, 2000
This study tested recommendation systems for books and movies, comparing the usefulness of recommendations provided by automated systems to those provided by human recommenders. It also looked at usability factors in user satisfaction. The results showed that systems requiring more information from users fared better on a range of satisfaction indicators, while interface factors had little affect. Friends had greater overall success in recommending items, with some significant differences between the book and friend domains. The results suggest that a recommendation system should be designed for a particular domain, that users with varying levels of experience in the domain will have different requirements, and that one should not necessarily focus on minimizing the input from users-the quality of the recommendations are of primary importance in attaining high levels of user satisfaction.
For my final Master's Degree Project, I intend to develop a book recommendation system for elementary-school children. I undertook this statistics experiment with the hope of studying existing recommendation systems to identify their best practices, as well as their problems, and apply what I learn to the system I develop. E-commerce sites such as Amazon and Reel Video use automated recommendation systems to help their customers who are shopping without a particular item in mind. Other sites, such as Sleeper and MovieCritic, exist solely for the purpose of providing the recommendation service, with no retail component. These systems vary in their interface designs, the amount of personal information they require from users, the number of ratings needed to generate a list of recommendations, and the number of results they return. Most significantly, these systems employ different collaborative filtering algorithms which result in recommendation sets of varying quality and usefulness. In this study, I compared the usability and usefulness of 6 book and movie recommender systems in order to answer the following questions:
A. Recommendation systems that require the least input from the user while providing useful recommendations (according to the user) will be rated the most satisfying, across a range of measures, regardless of the site's interface design.
B. Human recommenders will consistently outperform automated systems.
I studied the following sites:
|
Books
|
Movies
|
| Amazon's Recommendations Wizard | Amazon's Recommendations Wizard |
| Sleepers | MovieCritic |
| RatingZone QuickPicks | Reel Video - Movie Matches |
The sites were selected based upon the differences in interfaces, number of ratings required, and results displays.
Experiment participants were individuals from within and outside of SIMS who had expressed some interest in using a recommender system to receive entertainment suggestions. They were given the choice to explore books or movies. Ten subjects participated in the experiment, 5 for books, and 5 for movies.
For each of the three book/movie recommendation systems (presented in random order), each subject completed the following tasks:
The second part of the experiment involved the human recommenders. Participants supplied e-mail addresses for three friends familiar enough with their taste in books or movies to be able to recommend 3 items. I placed constraints upon the friends' recommendations, asking them to name only books or movies that the participant had not read/seen or discussed reading/seeing. In my analysis I intended to focus primarily on those items not already read or seen. Because the recommendation set would be so much smaller than those provided by the automated systems (only 3 items, as contrasted with 8, 15, and 20), I wanted to limit the possibility that all 3 friend suggestions would fall in the read/seen category.
For each item recommended by the friends, participants reviewed a description of the plot and a cover image (if one was available), then indicated whether they had heard of the item and whether they would be interested in reading or viewing it.
After reviewing the 3 automated and 3 friend recommendation sets, participants answered a 3 question post-test interview.
The independent and dependent variables for this experiment appear in the tables below.
|
Independent Variables
|
| Content type (books vs. movies) |
|
User input:
|
|
System recommendation output:
|
| Page design (layout, color scheme) |
| Level of detail in user instructions |
| User experience with the different systems |
|
Dependent Variables
|
| Satisfaction with the experience |
| Time spent registering to use the system |
| Time spent looking at recommendations until finding one that seems like a good fit. |
I decided I could not evaluate the algorithms used to produce recommendations because they are invisible to the user and usually held as proprietary secrets. For similar reasons, I did not examine the degree to which data for informing the recommender is collected implicitly, via click-throughs, shopping cart use, searches, and purchases.
I collected the following information through observation, a written questionnaire and a brief post-test interview.
1. Amount of information required from user:
Objective:
- Number of ratings required
- Number of pieces of personal information required
Subjective:
- Feelings about number of ratings required in order to get recs
- Feelings about amount of personal information required
2. Quality of recommendations returned:
Objective:
- Number of results returned
- Number of items already read/viewed
- Liked
- Not liked
- Number of items heard of
- Interested in
- Not interested in
- Number of items not heard of
Subjective:
- Feelings about number of results returned (not enough/just right/too much/no opinion)
- Feelings about amount of detail provided with results (same as above)
- Evaluation of usefulness (not useful, useful, very useful, no opinion)
- Would subject use system again? (yes/no/not sure)
- Would subject recommend to friend/family? (yes/no/not sure)
3. Evaluation of interface:
Subjective only:
- Ease of use (very difficult/difficult/neither easy nor difficult/easy/very easy)
- Impact of the page design (positive/negative/no impact)
- Instructions for use
- Page layout
- Navigation
- Graphics
- Color
4. Time required
Objective only:
- Time to complete registration/initial ratings
- Time to locate at least one "useful" recommendation
Confounding Variables
I controlled for the following variables:
|
Name of Variable
|
Possible Effect
|
Method of Control
|
| 1. Historical data on clicks and purchases | Might provide supplemental information to inform the system's recommendations | Create bogus user name and password for each test subject. |
| 2. Level of familiarity with particular recommendation system | May affect user's expectations and satisfaction levels | Pre-screening questions re experience with recommender systems. |
| 3. Experience with other recommendation systems during course of experiment | Depending upon which system is used first, a user may have higher expectations | Randomize order in which recommendation systems are tested. Ask questions immediately after each test. |
| 4. Server speed | May affect user satisfaction with system if extremely slow displaying results. |
Standardize workstation setup-use machines in SIMS Lab, all with same connection speed. |
| 5. Different user needs and experience with domain | Different users will have different standards for what they consider to be a "useful" recommendation. Therefore the time required may vary widely | Operationalize the definition of "useful" to mean items that the user a) has not read/seen and b) would be interested in reading/seeing |
I conducted a pilot test with one subject. From this test, I drew several conclusions:
1. Each subject should evaluate systems within a single domain-either books or movies. The pilot subject indicated that her standards for books and movies were disparate enough to skew her evaluation negatively towards the movie recommenders-she expected more from them and so was more disappointed by the results.
2. The form for collecting the count of useful recommendations needed to be refined to be more useable. I developed the following 2 by 2 grid for capturing the data:
| Heard Of | Not Heard Of | |
| Interested in | ||
| Not Interested In |
3. The overall system evaluation questions, which were open-ended during the pilot, needed to be carefully structured to be useable in the analysis. Though the ordinal answer scale I developed made it intuitively easy for the participants to respond, unfortunately, my failure to use an interval or ratio scale made certain types of analysis impossible. Had I run a second pilot test or attempted to analyze the data from the first test, I might have realized this flaw in time to modify the experiment design.
In the following section are displayed a series of bar, pie and error charts comparing and contrasting different aspects of the various systems. (The small number of subjects made it impractical to attempt more sophisticated analysis such as linear regressions or ANOVAs.) The first chart illustrates the overall performance of the 5 systems and the "friends system."

Graph 1. Average Percent Useful Recommendations
Friend recommendations performed the best overall, on average, followed by Reel Video, Sleeper and Amazon. However, the standard error bars indicate an extremely wide range of success rates at Reel.

Graph 2. Error Bars for Overall System Performance
Furthermore, there are no significant differences between any two of the systems, other than the friends and the automated ones, as shown above in Graph 2. See Graph 3 below for the system/friend comparison graph.

Graph 3. System vs. Friends: Average Percent of Useful Recommendations - Books and Movies
In attempting to explain the large differential between the systems and the friends, I considered whether there might be some individuals for whom it is simply more difficult to recommend items. For book recommendations, this was not the case. Friends were able to provide useful recommendations significantly more often than the automated systems. (See Graph 4 below.)

Graph 4. Books: Within-Subjects Comparison
Graph 5. Movies: Within-Subjects Comparison
Movies presented a more mixed set of results. There were 2 individuals who had low percentages of useful recommendations across the board - from both systems and friends. Two other individuals had more success getting useful recommendations from the systems than from their friends. This supports a theory that books and movies are too very different domains for providing recommendations. The graph below shows the overall differences between the book and movie recommendations.

Graph 6. System vs. Friends: Average Percent of Useful Recommendations - Books, Movies
The graphs that follow provide a different perspective on the above comparisons, looking at each individual's useful recommendations from systems alone, excluding the friend information.


Graph 7. Individual Useful Recs, by System
RatingZone provided the fewest useful recommendations-2 participants received no useful recommendations at all. I believe this was due to the fact that there were no plot summaries provided, so subjects were unable to say whether they were interested in reading the book, if they had not already heard of it. (See Graph 8, below, for more information.)

Graph 8. Was there enough description provided with recommended item to make a decision?

Graph 9. Books: Comparison of Percent Already Read, Read & Liked, and Useful Recommendations

Graph 10. Movies: Comparison of Percent Already Seen, Seen & Liked, and Useful Recommendations
I was curious to see whether there was a correlation between the items participants had already read or seen and enjoyed and the system's ability to provide useful recommendations. As it turned out, in the book domain, there was a correlation. Of the books Amazon and Sleeper listed that participants had already read, 100% felt into the "liked" category-these two systems had an average useful recommendation rate of about 30%. In contrast, at RatingZone where fewer of the previously read books were liked by the users, significantly fewer items were rated useful. Again, this might be largely due to the fact that there was not enough supporting information accompanying the RatingZone recommendation set.
|
Graph 11. Books: Heard of vs Not Heard of |
Graph 12. Movies: Heard of vs Not Heard of |
In the book domain, only Sleeper provided more items that participants had not heard of and were interested in reading. The other systems and the friends tended to recommend books that participants had already heard of.
In the movie domain, Amazon and Reel performed better in the "not heard of" category while MovieCritic and Friends were better at suggesting movies the individuals had already heard of and were interested in trying.
In addition to usefulness, a second area I considered was time required per system--this was the most purely objective measure used in the study. I suspected that total time would correlate in some way to to user satisfaction--in particular, I expected that systems that required more ratings or more time to find one useful recommendation would rate lower on a variety of satisfaction measures. However, this turned out not to be the case. MovieCritic, Amazon and Sleeper were all named as favorites in the post-test interviews while Reel and RatingZone, which required the least time, were not named.

Graph 13. Average Total Time, by System

Graph 14. Average Total Time, by Participant
For the most part, there was not a great deal of variation across the participants in terms of time required to register and get useful recs. The greatest variation occurred in the time required for individuals to generate the initial set of recommendations.
A third area of interest to me were the aspects of the interface that might impact satisfaction. Once again, the book and movie systems performed differently, with the book recommenders receiving on the whole positive or neutral evaluations, while the movie systems were either neutral or negative.

Graph 15. Impact of Interface Factors (1 = Positive, 0 = No Impact, - 1 Negative)

Graph 16. Ease of Use
Almost all systems were rated easy or very easy to use. It is interesting to see the range of "ease of use" ratings for Amazon. The system requires a user to enter their favorite author, artist, movie, and activity, then generates a list of 16 items. The user then checks the boxes by items most representative of what he or she would like to have recommended. Several participants found this second "refinement" step confusing. Additionally, in the words of one participant, "it was difficult to think of a 'favorite' movie."
I considered the possibility that ease of use and usefulness might be affected by the type of input the system required in order to generate its list of recommendations-some asked the user to rate a pre-selected list of items, while others asked the user to enter a "favorite" or an item for which they wanted a "creative match." Systems using individually-generated initial rating sets did perform slightly better than the automatically-generated ones, but the difference was not significant.

Graph 17. Percent of Useful recs vs. Source of Initial List of Items to Rate
In addition to gathering information on the number of useful recommendations, I also asked participants to indicate, after testing each system, how useful they felt it was-very useful, useful, not useful, or no opinion. MovieCritic was the only system not rated "not useful" by any participants. Amazon, Sleeper, and Reel had responses in all categories except "no opinion," while RatingZone was evaluated only as "useful" or "not useful."

Graph 18. How Useful was System Overall? (Subjective Evaluation)
The final area of the study asked participants whether they would use the system in the future and would they recommend the system to friends or family members. MovieCritic and Sleeper fared the best in both categories, Amazon and Reel turned out to be fairly comparable with their results balancing out to near 0, meaning "no opinion." The only clear loser was RatingZone-most would not use the system again, though a few said they would recommend it to a friend.

Graph 19. Comparison of Would Use System Again to Would Recommend System to Others?
In a post-test interview, I asked each participant for their opinion as to which system (of the 3 automated and the 3 friends) provided the overall best set of recommendations. The results here were somewhat surprising:

Graph 20. Best Overall
Although numerically the friends provided the most useful recommendations, 50% of the participants felt that an automated system was superior. This might be partly explained by some of the participants' comments:
The system "suggested a number of things I hadn't heard of, interesting matches. Some guesses were even more out there than my friend's recommendations. More mainstream books are the least useful part of it. 90% of my friend's recs, I'll want to read but I already knew he wanted to read these. It's better to be stretched, stimulated w/new ideas."
The study results were limited in some important ways, as described below.
1. Name Recognition. All participants used Amazon and all had some familiarity with the system before the experiment. It is possible that Amazon's recommendations received more "interested in" valuations because of the brand name recognition. Similarly, the fact that a recommendation came from a friend naturally caused some participants to be favorably disposed from the outset.
Future study recommendation: Conduct blind testing of users, to isolate the recommended items from the interface so that the source of the recommendation will not be a factor in their level of interest.
Design suggestion: Consider incorporating a "reading circle" idea into the recommendation system. This would allow individuals to see others' "already read and liked" lists and their "to read" lists
2. Friends are Apples, Systems are Oranges. The friends' recommendations might have been skewed positively due to the constraint I placed upon the recommenders-instructing them not to recommend items they knew the participant had already read or seen. My intention, as stated previously, was to compensate for the small size of the friend recommendation set-only 3 items, as contrasted with 8, 15, or 20 from the other systems. Unfortunately, this biased the evaluation in favor of the friend recommendations, as an automated system cannot easily learn what a person has or had not already read or seen. However, interestingly enough, some friends did not or could not follow the "not previously read/seen" restriction-about 10% of the friends recommendations were of books/movies already seen or read.
Another possible distortion of the friend recommendations arises from the fact that plot summaries of recommended items were lifted from Amazon. I did this in order to spare the friend participants from having to craft descriptions of the items, but this may have affected the outcome by including a system-generated piece of information in the friend scenario. Also, it was difficult to locate neutral descriptions of the items-most Amazon summaries had some sort of evaluative language. Those that were judgment-free tended to be too brief and detail-less to be useful.
Future study recommendation: Ask friends to describe the item. Also, do not explicitly limit the friends to items not already read/seen, to make the friend system more comparable to the automated ones.
Design suggestion: Consider incorporating quick and easy ways for a user to identify items they have already read even if the items are not used in the initial evaluation set or returned as recommendations.
3. Different Types of Readers and Moviegoers. Many of the comments made during the course of this experiment pointed to the importance of accounting for this component in my system design. For instance, one participant commented "MovieCritic asked too many questions-decisions about films are made pretty quickly-this system requires an investment of time and thought, best for a life-long movie watcher."
Future study recommendation: Gather more detailed information about reading and viewing patterns.
Design suggestion: Consider having two flavors of recommender interface-a "quick picks" option and a more time- and rating-intensive option for the serious readers and moviegoers.
4. Test Subjects Too Similar. Participants were almost all adults with 3 -5 years web experience, 90% were between 25 and 34 years of age, and the majority were SIMS students.
Future study recommendation: Target a larger, more diverse sample of participants so as to make the results generalizable. Use a pre-screening questionnaire to ensure a balance.
Click here to access Appendices (100 KB Word document):
For more information, contact Kirsten Swearingen, kirstens@sims.berkeley.edu