Example Code for Visualization Assessment

Supplement to Evaluating Information Visualization via the Interplay of Heuristic Evaluation and Question-Based Scoring

Marti Hearst, Paul Laskowski, Luis Silva

CHI 2016

This python script demonstrates our methods for evaluating visualizations using both the Question and Heuristic procedures, and for comparing the two metrics. While the techniques shown here are identical to those used in the paper, the data is simulated in order to protect the participants.

Basic environment setup

In [16]:
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import scipy as sp
pylab.rcParams['figure.figsize'] = 16, 12
from pandas import *
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'   
import statsmodels.api as sm
import statsmodels.formula.api as smf
import sys
print(sys.version)
3.3.5 |Anaconda 2.3.0 (x86_64)| (default, Sep  2 2014, 13:57:31) 
[GCC 4.2.1 (Apple Inc. build 5577)]

This script was prepared using Python 3.4.3.

Simulation of Data Set

For the purposes of this demonstration, we simulate a dataset using a simple probabilistic model. We assume that there are three types of graphs - low, medium, and high quality. For each type, we specify a probability distribution over possible heuristic ratings, which are assumed to be independent of each other. Similarly, for each graph type, we specify a probability that each question is answered correctly and we assume that all questions are independent.

In [17]:
np.random.seed(202)

train = pd.DataFrame(columns = ("Student_id", "rating", "correct", 
                                "graph_id", "quality", "question_id", 
                                "Question_type", "condition"), dtype=float )

# We assume that there are three qualities of graphs - low, medium, high quality.
# For each quality, we define a probability distribution over all possible heuristic ratings
# between 0 and 5.  We assume the distribution is the same for all heuristics
# and for all students.  The five values below are the probabilities of assigning
# the scores of 0, 1, 2, 3, and 4, respectively, for truly low, medium, and high
# quality visualizations.

heuristic_dist = {"low": (.2, .3, .3, .1, .1),
                  "medium": (.1, .2, .4, .2, .1),
                  "high": (.1,.1,.3,.3,.2)}

# Similarly, for each type, we define a probability distribution that specifies whether
# a question is answered correctly or not.  The first position is the probability of answering 
# incorrectly and the second position is the likelihood of answering correctly.  
# Low quality visualizations should make it harder to answer questions correctly than
# high quality visualizations.

question_dist = {"low": (.7, .3),
                 "medium": (.5, .5),
                 "high": (.3, .7)}

# Here we used these probability distributions to compute scores randomly, but according
# to the distributions, using python's random.choice() function.

for graph_id in range(18):
    quality = ["low","medium","high"][graph_id % 3]
    # Add heuristic ratings for 4 students
    for Student_id in range(4):
        for question_id in range(3):
            # compute a heuristic score
            rate = np.random.choice(5, p=heuristic_dist[quality]) + 1
            train.loc[len(train)] = [Student_id, rate, np.nan, graph_id, quality, question_id, "heuristic", "heuristic"]
      
    # Add question answers for 4 students
    for Student_id in range(4, 8):
        for question_id in range(3):
            # compute whether a question is answered correctly.  
            # Note that we leave the answer column blank.  In practice, we would use this column to
            # record the user's answer, then compute the value of the correct column by comparing the
            # user's answer to the correct answer.
            correct = np.random.choice(2, p=question_dist[quality]) 
            train.loc[len(train)] = [Student_id, np.nan, correct, graph_id, quality, question_id, "multiple_choice", "question"]

                
train.head(15)
Out[17]:
Student_id rating correct graph_id quality question_id Question_type condition
0 0 2 NaN 0 low 0 heuristic heuristic
1 0 2 NaN 0 low 1 heuristic heuristic
2 0 5 NaN 0 low 2 heuristic heuristic
3 1 5 NaN 0 low 0 heuristic heuristic
4 1 5 NaN 0 low 1 heuristic heuristic
5 1 3 NaN 0 low 2 heuristic heuristic
6 2 3 NaN 0 low 0 heuristic heuristic
7 2 1 NaN 0 low 1 heuristic heuristic
8 2 4 NaN 0 low 2 heuristic heuristic
9 3 2 NaN 0 low 0 heuristic heuristic
10 3 5 NaN 0 low 1 heuristic heuristic
11 3 3 NaN 0 low 2 heuristic heuristic
12 4 NaN 0 0 low 0 multiple_choice question
13 4 NaN 0 0 low 1 multiple_choice question
14 4 NaN 0 0 low 2 multiple_choice question

Record the number of graphs.

In [18]:
numgraphs = len(train.graph_id.unique())
numgraphs
Out[18]:
18

Examining the Graphs using Both Metrics

We create a dataframe to hold the final heuristic score, and the final question-based score for each graph. These are found by averaging the individual scores.

In [19]:
graph_sum = train.groupby(["graph_id"]).mean()[["rating", "correct"]]
graph_sum
Out[19]:
rating correct
graph_id
0 3.333333 0.333333
1 2.750000 0.500000
2 3.333333 0.833333
3 2.416667 0.333333
4 3.250000 0.333333
5 3.000000 0.583333
6 2.500000 0.250000
7 3.416667 0.500000
8 3.666667 0.333333
9 2.666667 0.250000
10 3.250000 0.583333
11 3.416667 0.833333
12 3.083333 0.500000
13 2.750000 0.333333
14 3.333333 0.666667
15 2.916667 0.166667
16 3.166667 0.583333
17 3.416667 0.666667

Visualize the relationship with a scatterplot.

In [20]:
for xy in zip(graph_sum.rating, graph_sum.correct):
    print(xy)
(3.3333333333333335, 0.33333333333333331)
(2.75, 0.5)
(3.3333333333333335, 0.83333333333333337)
(2.4166666666666665, 0.33333333333333331)
(3.25, 0.33333333333333331)
(3.0, 0.58333333333333337)
(2.5, 0.25)
(3.4166666666666665, 0.5)
(3.6666666666666665, 0.33333333333333331)
(2.6666666666666665, 0.25)
(3.25, 0.58333333333333337)
(3.4166666666666665, 0.83333333333333337)
(3.0833333333333335, 0.5)
(2.75, 0.33333333333333331)
(3.3333333333333335, 0.66666666666666663)
(2.9166666666666665, 0.16666666666666666)
(3.1666666666666665, 0.58333333333333337)
(3.4166666666666665, 0.66666666666666663)
In [21]:
ax = graph_sum.plot(kind='scatter', x='rating', y='correct')
for g, x, y in zip(graph_sum.index, graph_sum.rating, graph_sum.correct):                                                # <--
    ax.annotate(str(int(g)), xy=(x+ .01, y))
    
plt.ylabel('Question-Based Score')
plt.xlabel('Heuristic-Based Score')
plt.title("Relationship between Heuristic Ratings and Question-Based Scores")
Out[21]:
<matplotlib.text.Text at 0x10af6b0d0>

Check the correlation between the measures. Given the metric nature of our variables, we use Pearson's correlation, which is the default for pandas.Series.corr().

In [22]:
graph_sum.rating.corr(graph_sum.correct)
Out[22]:
0.51622971491266323

To check for outliers in a systematic way, we employ Studentized residuals.

In [23]:
lm = smf.ols('correct ~ rating', data=graph_sum).fit()
lm.outlier_test()
Out[23]:
student_resid unadj_p bonf(p)
graph_id
0 -1.300633 0.213012 1.000000
1 0.728805 0.477347 1.000000
2 1.843065 0.085169 1.000000
3 0.332084 0.744421 1.000000
4 -1.130135 0.276167 1.000000
5 0.777012 0.449233 1.000000
6 -0.355535 0.727139 1.000000
7 -0.413467 0.685115 1.000000
8 -2.221725 0.042110 0.757985
9 -0.632618 0.536509 1.000000
10 0.353062 0.728954 1.000000
11 1.683497 0.112969 1.000000
12 0.147514 0.884691 1.000000
13 -0.265091 0.794548 1.000000
14 0.710145 0.488509 1.000000
15 -1.619050 0.126266 1.000000
16 0.490923 0.630585 1.000000
17 0.572563 0.575423 1.000000

In this simulated dataset, graph 8 is the only 1 with a statistically significant studentized residual (unadjusted). If this were a real dataset, we would examine this visualization in more detail, seeking to understand why the two metrics are in disagreement.

In [24]:
train[(train.graph_id == 8)].groupby(["condition","question_id"]).mean()
Out[24]:
Student_id rating correct graph_id
condition question_id
heuristic 0 1.5 3.50 NaN 8
1 1.5 4.25 NaN 8
2 1.5 3.25 NaN 8
question 0 5.5 NaN 0.25 8
1 5.5 NaN 0.50 8
2 5.5 NaN 0.25 8

As an alternative to Studentized residuals, we can also identify outliers by fitting a Gaussian envelope to the dataset.

In [25]:
from sklearn.covariance import EllipticEnvelope
clf= EllipticEnvelope()
clf.fit(graph_sum[["rating","correct"]])
clf.mahalanobis(graph_sum[["rating","correct"]])
Out[25]:
graph_id
0     1.710622
1     3.502854
2     2.912075
3     9.760099
4     1.235750
5     1.079405
6     8.160107
7     0.835265
8     6.023398
9     5.212265
10    0.133552
11    3.013277
12    0.238431
13    3.454416
14    0.724548
15    3.898149
16    0.207508
17    1.030750
dtype: float64

Under this measure, graph 8 is still noticable, but both graphs 3 and 6 are more extreme outliers.

Reliabity of the Question Procedure

The nature of our study means that we have two different measures for each visualization, but no authoritative "ground truth" to compare them to. Because of this, it is not possible to say which procedure is more accurate. We can, however, analyze which of them is more reliable. In other worse, which procedure generates a ranking that is more consistent from one set of participants to another.

In part, reliability depends on $\sigma_m$, the standard deviation of the mean scores for each visualization. The more spread out the final scores are, the more confident we can be that we ranked them correctly. Even if a procedure has a high $\sigma_m$, however, it will not be very reliable if there is a lot of noise in scoring each design. The higher the average standard error for each final score, $\sigma_v$, the more the score will vary from one run of the study to another and the less confidence we can have in it. We therefore consider the ratio, $\sigma_m / \sigma_v$ - how spread out the final scores are relative to the noise in each one. To consistently estimate this parameter, we use the statistic, $g=\sqrt{I/J}f$, where $I$ is the total number of scores, $J$ is the number of designs, and $f$ is Cohen's $f$. $f^2$ is sometimes called the signal-to-noise ratio, which we compute as the treatment sum of squares over the error sum of squares.

Get a dataframe with just the questions to work on.

In [26]:
trainmc = train[train["Question_type"] == "multiple_choice"]

Since there could be autocorrelation between the different answers for a given student, we do not want to base our estimates of variance on each individual question. Instead, we first aggregate the three questions answered by each student for each graph, taking the average number of correct responses.

In [27]:
trainmcagg = trainmc.groupby(["graph_id","Student_id"]).mean()["correct"].reset_index()
trainmcagg.head(10)
Out[27]:
graph_id Student_id correct
0 0 4 0.000000
1 0 5 0.666667
2 0 6 0.333333
3 0 7 0.333333
4 1 4 0.333333
5 1 5 0.333333
6 1 6 0.333333
7 1 7 1.000000
8 2 4 0.333333
9 2 5 1.000000

Check how many total scores we have. A score is the fraction of questions one student gets right for one dataset.

In [28]:
numqscores = len(trainmcagg)
numqscores
Out[28]:
72

Finally, we get the average proportion of correct responses for each graph_id

In [29]:
mcorrect = trainmcagg.groupby(["graph_id"]).mean()["correct"]
mcorrect
Out[29]:
graph_id
0     0.333333
1     0.500000
2     0.833333
3     0.333333
4     0.333333
5     0.583333
6     0.250000
7     0.500000
8     0.333333
9     0.250000
10    0.583333
11    0.833333
12    0.500000
13    0.333333
14    0.666667
15    0.166667
16    0.583333
17    0.666667
Name: correct, dtype: float64

Insert these values back into the trainmcagg dataframe.

In [30]:
trainmcagg["mcorrect"] = trainmcagg.graph_id.map(mcorrect)

Each row now shows both whether that question is right, as well as the overall percent correct for that question.

In [31]:
trainmcagg.head(10)
Out[31]:
graph_id Student_id correct mcorrect
0 0 4 0.000000 0.333333
1 0 5 0.666667 0.333333
2 0 6 0.333333 0.333333
3 0 7 0.333333 0.333333
4 1 4 0.333333 0.500000
5 1 5 0.333333 0.500000
6 1 6 0.333333 0.500000
7 1 7 1.000000 0.500000
8 2 4 0.333333 0.833333
9 2 5 1.000000 0.833333

The grand mean is the percentage correct out of all questions asked for all graphs.

In [32]:
grandmean = trainmcagg.correct.mean()
grandmean
Out[32]:
0.4768518518518518

Also check the standard deviation.

In [33]:
trainmcagg.correct.std()
Out[33]:
0.3106174822985846

Compute the Mean Square Error

In [34]:
MSEqe = ((trainmcagg.correct - trainmcagg.mcorrect)**2).sum()/(numqscores - numgraphs)
MSEqe
Out[34]:
0.077674897119341571

Compute the Mean Square Treatment.

In [35]:
MSTreatmentqe = ((trainmcagg.mcorrect - grandmean)**2).sum()/(numgraphs-1)
MSTreatmentqe
Out[35]:
0.15622730573710961

Compute the Mean Square Total, which we can compute from the variance, or directly from the formula.

In [36]:
MSTotalqe = ((trainmcagg.correct - grandmean)**2).sum()/(numqscores-1)
print(MSTotalqe, trainmcagg.correct.var())
0.0964832203095 0.0964832203095

As a check to see if everything is done correctly, we confirm that SST = SSTreatment + SSE

In [37]:
SSTotalqe = MSTotalqe * (numqscores -1)
SSTreatmentqe = MSTreatmentqe * (numgraphs - 1)
SSErrorqe = MSEqe * (numqscores - numgraphs)
print("SSTotal = ", SSTotalqe)
print("SSTreatment + SSError =", SSTreatmentqe + SSErrorqe)
SSTotal =  6.85030864198
SSTreatment + SSError = 6.85030864198

We employ Cohen's $f$ as an estimator for the ratio $\sigma / \sigma_m$. We compute Cohen's f as follows:

In [38]:
Cohen_f_qe = (SSTreatmentqe / SSErrorqe)**.5
Cohen_f_qe
Out[38]:
0.79572995815084713

Using this result, we compute reliability.

In [39]:
reliability_qe = (numqscores / numgraphs)**.5 * Cohen_f_qe
reliability_qe
Out[39]:
1.5914599163016943

As an alternate method, we can first estimate the standard error in the final score of each graph, $\sigma_v$. We define this as a weighted average of the standard deviation in each graph $j$, $\sigma_v^2 = \frac{1}{I}\sum_j I_j \sigma_{v_j}^2$ Since we're averaging $I_j$ ratings together to get the final score for graph $j$, we have $\sigma_{v_j}^2= \sigma^2 / I_j $. Then $\sigma_v^2 = \frac{1}{I}\sum_j \sigma^2 = \frac{J}{I} \sigma^2$. We estimate $\sigma^2$ by MSE.

In [40]:
error_v_qe = ( numgraphs / numqscores)**.5 * MSEqe ** .5
error_v_qe
Out[40]:
0.13935108280826308

Next, we separately estimate $\sigma_m$, the standard deviation of the true mean scores for each graph.

In [41]:
sigma_m_qe = (SSTreatmentqe / numqscores)**.5
sigma_m_qe
Out[41]:
0.19205989363603276

We can finally find the scaled error by dividing the two

In [42]:
reliability_qe =  sigma_m_qe / error_v_qe
reliability_qe
Out[42]:
1.3782447166219236

Unlike the derivation based on Cohen's $f$, this method includes a correction for degrees of freedom. This is an attractive feature, though both estimators remain biased though consistent.

Grading Example

To get a sense of how big a reliability statistic of 1.37 is, we employ a familiar example. Suppose we were to use the question procedure to give each visualization a grades of A, B, C, and so forth. Suppose further that one letter grade corresponds to one standard deviation, a common grading guideline. This implies that we scale the scores so that $\sigma_m = 10$ on a 100-point grading scale. We can then compute $\sigma_v$ to be,

In [43]:
scaled_sigma_v = 10 / reliability_qe
scaled_sigma_v
Out[43]:
7.2556055389858409

We now want the probability that a given score is more than 10 points away from its true mean value. We approximate the sampling distribution of the score as a normal curve and find the area under the curve more than 10 points above the mean, and less than 10 points under the mean:

In [44]:
import scipy
2 * sp.stats.norm(0, scaled_sigma_v).cdf(-10)
Out[44]:
0.1681277453769493

In this case, for our simulated data, the chance of being a full letter grade off is about 17%.

Heuristics

The procedure for the heuristic procedure is similar to that for the questions.

Get a dataframe with just the heuristics to work on.

In [45]:
trainhe = train[train["Question_type"] == "heuristic"]
In [46]:
trainhe.head(15)
Out[46]:
Student_id rating correct graph_id quality question_id Question_type condition
0 0 2 NaN 0 low 0 heuristic heuristic
1 0 2 NaN 0 low 1 heuristic heuristic
2 0 5 NaN 0 low 2 heuristic heuristic
3 1 5 NaN 0 low 0 heuristic heuristic
4 1 5 NaN 0 low 1 heuristic heuristic
5 1 3 NaN 0 low 2 heuristic heuristic
6 2 3 NaN 0 low 0 heuristic heuristic
7 2 1 NaN 0 low 1 heuristic heuristic
8 2 4 NaN 0 low 2 heuristic heuristic
9 3 2 NaN 0 low 0 heuristic heuristic
10 3 5 NaN 0 low 1 heuristic heuristic
11 3 3 NaN 0 low 2 heuristic heuristic
24 0 3 NaN 1 medium 0 heuristic heuristic
25 0 2 NaN 1 medium 1 heuristic heuristic
26 0 4 NaN 1 medium 2 heuristic heuristic

As before, we aggregrate by student, taking the mean of the three ratings.

In [47]:
trainheagg = trainhe.groupby(["graph_id","Student_id"]).mean()["rating"].reset_index()
trainheagg.head()
Out[47]:
graph_id Student_id rating
0 0 0 3.000000
1 0 1 4.333333
2 0 2 2.666667
3 0 3 3.333333
4 1 0 3.000000

Get the number of ratings, which could potentially differ from the number of questions, above.

In [48]:
numratings = len(trainheagg)
numratings
Out[48]:
72

Get the mean for each graph.

In [49]:
mrating = trainheagg.groupby(["graph_id"]).mean()["rating"]
mrating
Out[49]:
graph_id
0     3.333333
1     2.750000
2     3.333333
3     2.416667
4     3.250000
5     3.000000
6     2.500000
7     3.416667
8     3.666667
9     2.666667
10    3.250000
11    3.416667
12    3.083333
13    2.750000
14    3.333333
15    2.916667
16    3.166667
17    3.416667
Name: rating, dtype: float64

Insert these values back into the dataframe.

In [50]:
trainheagg["mrating"] = trainheagg.graph_id.map(mrating)

Confirm that each row of trainheagg now has the rating given by one student, as well as the mean rating for the given graph.

In [51]:
trainheagg.head(10)
Out[51]:
graph_id Student_id rating mrating
0 0 0 3.000000 3.333333
1 0 1 4.333333 3.333333
2 0 2 2.666667 3.333333
3 0 3 3.333333 3.333333
4 1 0 3.000000 2.750000
5 1 1 4.000000 2.750000
6 1 2 2.333333 2.750000
7 1 3 1.666667 2.750000
8 2 0 4.000000 3.333333
9 2 1 3.666667 3.333333

Get the grand mean - the average over all ratings.

In [52]:
grandmeanrating = trainheagg.rating.mean()
grandmeanrating
Out[52]:
3.092592592592593

Also check the standard deviation of the ratings

In [53]:
trainheagg.rating.std()
Out[53]:
0.68569606917866932

Compute the Mean Square Error

In [54]:
MSEhe = ((trainheagg.rating - trainheagg.mrating)**2).sum()/(numratings - numgraphs)
MSEhe
Out[54]:
0.45987654320987659

Compute the Mean Square Treatment

In [55]:
MSTreatmenthe = ((trainheagg.mrating - grandmeanrating)**2).sum()/(numgraphs-1)
MSTreatmenthe
Out[55]:
0.50290486564996373

Compute the Mean Square Total, which we can compute from the variance, or from the formula.

In [56]:
MSTotalhe = ((trainheagg.rating - grandmeanrating)**2).sum()/(numratings-1)
print(MSTotalhe, trainheagg.rating.var())
0.470179099287 0.470179099287

Test to see if we did that right by confirming that SST = SSTreatment + SSE

In [57]:
SSTotalhe = MSTotalhe * (numratings -1)
SSTreatmenthe = MSTreatmenthe * (numgraphs - 1)
SSErrorhe = MSEhe * (numratings - numgraphs)
print("SSTotal = ", SSTotalhe)
print("SSTreatment + SSError =", SSTreatmenthe + SSErrorhe)
SSTotal =  33.3827160494
SSTreatment + SSError = 33.3827160494

Compute Cohen's $f$.

In [58]:
Cohen_f_he = (SSTreatmenthe / SSErrorhe)**.5
Cohen_f_he
Out[58]:
0.58674563904777843

Use Cohen's $f$ to compute reliability

In [59]:
reliability_he = (numratings / numgraphs)**.5 * Cohen_f_he
reliability_he
Out[59]:
1.1734912780955569

As an alternate method, we compute the standard error for each rating of a visualization.

In [60]:
error_v_he = ( numgraphs / numratings)**.5 * MSEhe ** .5
error_v_he
Out[60]:
0.33907098932593621

And the standard deviation of the true mean scores for each graph.

In [61]:
sigma_m_he = (SSTreatmenthe / numratings)**.5
sigma_m_he
Out[61]:
0.34458877899867402

We can finally find the reliability by dividing the two

In [62]:
reliability_he =  sigma_m_he / error_v_he
reliability_he
Out[62]:
1.0162732579502216

As before the result is different, since it includes a correction for degrees of freedom.