Marti Hearst, Paul Laskowski, Luis Silva
CHI 2016
This python script demonstrates our methods for evaluating visualizations using both the Question and Heuristic procedures, and for comparing the two metrics. While the techniques shown here are identical to those used in the paper, the data is simulated in order to protect the participants.
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import scipy as sp
pylab.rcParams['figure.figsize'] = 16, 12
from pandas import *
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'
import statsmodels.api as sm
import statsmodels.formula.api as smf
import sys
print(sys.version)
This script was prepared using Python 3.4.3.
For the purposes of this demonstration, we simulate a dataset using a simple probabilistic model. We assume that there are three types of graphs - low, medium, and high quality. For each type, we specify a probability distribution over possible heuristic ratings, which are assumed to be independent of each other. Similarly, for each graph type, we specify a probability that each question is answered correctly and we assume that all questions are independent.
np.random.seed(202)
train = pd.DataFrame(columns = ("Student_id", "rating", "correct",
"graph_id", "quality", "question_id",
"Question_type", "condition"), dtype=float )
# We assume that there are three qualities of graphs - low, medium, high quality.
# For each quality, we define a probability distribution over all possible heuristic ratings
# between 0 and 5. We assume the distribution is the same for all heuristics
# and for all students. The five values below are the probabilities of assigning
# the scores of 0, 1, 2, 3, and 4, respectively, for truly low, medium, and high
# quality visualizations.
heuristic_dist = {"low": (.2, .3, .3, .1, .1),
"medium": (.1, .2, .4, .2, .1),
"high": (.1,.1,.3,.3,.2)}
# Similarly, for each type, we define a probability distribution that specifies whether
# a question is answered correctly or not. The first position is the probability of answering
# incorrectly and the second position is the likelihood of answering correctly.
# Low quality visualizations should make it harder to answer questions correctly than
# high quality visualizations.
question_dist = {"low": (.7, .3),
"medium": (.5, .5),
"high": (.3, .7)}
# Here we used these probability distributions to compute scores randomly, but according
# to the distributions, using python's random.choice() function.
for graph_id in range(18):
quality = ["low","medium","high"][graph_id % 3]
# Add heuristic ratings for 4 students
for Student_id in range(4):
for question_id in range(3):
# compute a heuristic score
rate = np.random.choice(5, p=heuristic_dist[quality]) + 1
train.loc[len(train)] = [Student_id, rate, np.nan, graph_id, quality, question_id, "heuristic", "heuristic"]
# Add question answers for 4 students
for Student_id in range(4, 8):
for question_id in range(3):
# compute whether a question is answered correctly.
# Note that we leave the answer column blank. In practice, we would use this column to
# record the user's answer, then compute the value of the correct column by comparing the
# user's answer to the correct answer.
correct = np.random.choice(2, p=question_dist[quality])
train.loc[len(train)] = [Student_id, np.nan, correct, graph_id, quality, question_id, "multiple_choice", "question"]
train.head(15)
Record the number of graphs.
numgraphs = len(train.graph_id.unique())
numgraphs
We create a dataframe to hold the final heuristic score, and the final question-based score for each graph. These are found by averaging the individual scores.
graph_sum = train.groupby(["graph_id"]).mean()[["rating", "correct"]]
graph_sum
Visualize the relationship with a scatterplot.
for xy in zip(graph_sum.rating, graph_sum.correct):
print(xy)
ax = graph_sum.plot(kind='scatter', x='rating', y='correct')
for g, x, y in zip(graph_sum.index, graph_sum.rating, graph_sum.correct): # <--
ax.annotate(str(int(g)), xy=(x+ .01, y))
plt.ylabel('Question-Based Score')
plt.xlabel('Heuristic-Based Score')
plt.title("Relationship between Heuristic Ratings and Question-Based Scores")
Check the correlation between the measures. Given the metric nature of our variables, we use Pearson's correlation, which is the default for pandas.Series.corr().
graph_sum.rating.corr(graph_sum.correct)
To check for outliers in a systematic way, we employ Studentized residuals.
lm = smf.ols('correct ~ rating', data=graph_sum).fit()
lm.outlier_test()
In this simulated dataset, graph 8 is the only 1 with a statistically significant studentized residual (unadjusted). If this were a real dataset, we would examine this visualization in more detail, seeking to understand why the two metrics are in disagreement.
train[(train.graph_id == 8)].groupby(["condition","question_id"]).mean()
As an alternative to Studentized residuals, we can also identify outliers by fitting a Gaussian envelope to the dataset.
from sklearn.covariance import EllipticEnvelope
clf= EllipticEnvelope()
clf.fit(graph_sum[["rating","correct"]])
clf.mahalanobis(graph_sum[["rating","correct"]])
Under this measure, graph 8 is still noticable, but both graphs 3 and 6 are more extreme outliers.
The nature of our study means that we have two different measures for each visualization, but no authoritative "ground truth" to compare them to. Because of this, it is not possible to say which procedure is more accurate. We can, however, analyze which of them is more reliable. In other worse, which procedure generates a ranking that is more consistent from one set of participants to another.
In part, reliability depends on $\sigma_m$, the standard deviation of the mean scores for each visualization. The more spread out the final scores are, the more confident we can be that we ranked them correctly. Even if a procedure has a high $\sigma_m$, however, it will not be very reliable if there is a lot of noise in scoring each design. The higher the average standard error for each final score, $\sigma_v$, the more the score will vary from one run of the study to another and the less confidence we can have in it. We therefore consider the ratio, $\sigma_m / \sigma_v$ - how spread out the final scores are relative to the noise in each one. To consistently estimate this parameter, we use the statistic, $g=\sqrt{I/J}f$, where $I$ is the total number of scores, $J$ is the number of designs, and $f$ is Cohen's $f$. $f^2$ is sometimes called the signal-to-noise ratio, which we compute as the treatment sum of squares over the error sum of squares.
Get a dataframe with just the questions to work on.
trainmc = train[train["Question_type"] == "multiple_choice"]
Since there could be autocorrelation between the different answers for a given student, we do not want to base our estimates of variance on each individual question. Instead, we first aggregate the three questions answered by each student for each graph, taking the average number of correct responses.
trainmcagg = trainmc.groupby(["graph_id","Student_id"]).mean()["correct"].reset_index()
trainmcagg.head(10)
Check how many total scores we have. A score is the fraction of questions one student gets right for one dataset.
numqscores = len(trainmcagg)
numqscores
Finally, we get the average proportion of correct responses for each graph_id
mcorrect = trainmcagg.groupby(["graph_id"]).mean()["correct"]
mcorrect
Insert these values back into the trainmcagg dataframe.
trainmcagg["mcorrect"] = trainmcagg.graph_id.map(mcorrect)
Each row now shows both whether that question is right, as well as the overall percent correct for that question.
trainmcagg.head(10)
The grand mean is the percentage correct out of all questions asked for all graphs.
grandmean = trainmcagg.correct.mean()
grandmean
Also check the standard deviation.
trainmcagg.correct.std()
Compute the Mean Square Error
MSEqe = ((trainmcagg.correct - trainmcagg.mcorrect)**2).sum()/(numqscores - numgraphs)
MSEqe
Compute the Mean Square Treatment.
MSTreatmentqe = ((trainmcagg.mcorrect - grandmean)**2).sum()/(numgraphs-1)
MSTreatmentqe
Compute the Mean Square Total, which we can compute from the variance, or directly from the formula.
MSTotalqe = ((trainmcagg.correct - grandmean)**2).sum()/(numqscores-1)
print(MSTotalqe, trainmcagg.correct.var())
As a check to see if everything is done correctly, we confirm that SST = SSTreatment + SSE
SSTotalqe = MSTotalqe * (numqscores -1)
SSTreatmentqe = MSTreatmentqe * (numgraphs - 1)
SSErrorqe = MSEqe * (numqscores - numgraphs)
print("SSTotal = ", SSTotalqe)
print("SSTreatment + SSError =", SSTreatmentqe + SSErrorqe)
We employ Cohen's $f$ as an estimator for the ratio $\sigma / \sigma_m$. We compute Cohen's f as follows:
Cohen_f_qe = (SSTreatmentqe / SSErrorqe)**.5
Cohen_f_qe
Using this result, we compute reliability.
reliability_qe = (numqscores / numgraphs)**.5 * Cohen_f_qe
reliability_qe
As an alternate method, we can first estimate the standard error in the final score of each graph, $\sigma_v$. We define this as a weighted average of the standard deviation in each graph $j$, $\sigma_v^2 = \frac{1}{I}\sum_j I_j \sigma_{v_j}^2$ Since we're averaging $I_j$ ratings together to get the final score for graph $j$, we have $\sigma_{v_j}^2= \sigma^2 / I_j $. Then $\sigma_v^2 = \frac{1}{I}\sum_j \sigma^2 = \frac{J}{I} \sigma^2$. We estimate $\sigma^2$ by MSE.
error_v_qe = ( numgraphs / numqscores)**.5 * MSEqe ** .5
error_v_qe
Next, we separately estimate $\sigma_m$, the standard deviation of the true mean scores for each graph.
sigma_m_qe = (SSTreatmentqe / numqscores)**.5
sigma_m_qe
We can finally find the scaled error by dividing the two
reliability_qe = sigma_m_qe / error_v_qe
reliability_qe
Unlike the derivation based on Cohen's $f$, this method includes a correction for degrees of freedom. This is an attractive feature, though both estimators remain biased though consistent.
To get a sense of how big a reliability statistic of 1.37 is, we employ a familiar example. Suppose we were to use the question procedure to give each visualization a grades of A, B, C, and so forth. Suppose further that one letter grade corresponds to one standard deviation, a common grading guideline. This implies that we scale the scores so that $\sigma_m = 10$ on a 100-point grading scale. We can then compute $\sigma_v$ to be,
scaled_sigma_v = 10 / reliability_qe
scaled_sigma_v
We now want the probability that a given score is more than 10 points away from its true mean value. We approximate the sampling distribution of the score as a normal curve and find the area under the curve more than 10 points above the mean, and less than 10 points under the mean:
import scipy
2 * sp.stats.norm(0, scaled_sigma_v).cdf(-10)
In this case, for our simulated data, the chance of being a full letter grade off is about 17%.
The procedure for the heuristic procedure is similar to that for the questions.
Get a dataframe with just the heuristics to work on.
trainhe = train[train["Question_type"] == "heuristic"]
trainhe.head(15)
As before, we aggregrate by student, taking the mean of the three ratings.
trainheagg = trainhe.groupby(["graph_id","Student_id"]).mean()["rating"].reset_index()
trainheagg.head()
Get the number of ratings, which could potentially differ from the number of questions, above.
numratings = len(trainheagg)
numratings
Get the mean for each graph.
mrating = trainheagg.groupby(["graph_id"]).mean()["rating"]
mrating
Insert these values back into the dataframe.
trainheagg["mrating"] = trainheagg.graph_id.map(mrating)
Confirm that each row of trainheagg now has the rating given by one student, as well as the mean rating for the given graph.
trainheagg.head(10)
Get the grand mean - the average over all ratings.
grandmeanrating = trainheagg.rating.mean()
grandmeanrating
Also check the standard deviation of the ratings
trainheagg.rating.std()
Compute the Mean Square Error
MSEhe = ((trainheagg.rating - trainheagg.mrating)**2).sum()/(numratings - numgraphs)
MSEhe
Compute the Mean Square Treatment
MSTreatmenthe = ((trainheagg.mrating - grandmeanrating)**2).sum()/(numgraphs-1)
MSTreatmenthe
Compute the Mean Square Total, which we can compute from the variance, or from the formula.
MSTotalhe = ((trainheagg.rating - grandmeanrating)**2).sum()/(numratings-1)
print(MSTotalhe, trainheagg.rating.var())
Test to see if we did that right by confirming that SST = SSTreatment + SSE
SSTotalhe = MSTotalhe * (numratings -1)
SSTreatmenthe = MSTreatmenthe * (numgraphs - 1)
SSErrorhe = MSEhe * (numratings - numgraphs)
print("SSTotal = ", SSTotalhe)
print("SSTreatment + SSError =", SSTreatmenthe + SSErrorhe)
Compute Cohen's $f$.
Cohen_f_he = (SSTreatmenthe / SSErrorhe)**.5
Cohen_f_he
Use Cohen's $f$ to compute reliability
reliability_he = (numratings / numgraphs)**.5 * Cohen_f_he
reliability_he
As an alternate method, we compute the standard error for each rating of a visualization.
error_v_he = ( numgraphs / numratings)**.5 * MSEhe ** .5
error_v_he
And the standard deviation of the true mean scores for each graph.
sigma_m_he = (SSTreatmenthe / numratings)**.5
sigma_m_he
We can finally find the reliability by dividing the two
reliability_he = sigma_m_he / error_v_he
reliability_he
As before the result is different, since it includes a correction for degrees of freedom.