Modern Information Retrieval
Appendix: Porter's algorithm


Contents

The rules in the Porter algorithm are separated into five distinct phases numbered from 1 to 5. They are applied to the words in the text starting from phase 1 and moving on to phase 5. Further, they are applied sequentially one after the other as commands in a program. Thus, in the immediately following, we specify the Porter algorithm in a pseudo programming language whose commands take the form of rules for suffix substitution (as above). This pseudo language adopts the following (semi-formal) conventions:

-   a consonant variable is represented by the symbol C which is used to refer to any letter other than a,e,i,o,u and other than the letter y preceded by a consonant;
-   a vowel variable is represented by the symbol V which is used to refer to any letter which is not a consonant;
-   a generic letter (consonant or vowel) is represented by the symbol L;
-   the symbol 1#1 is used to refer to an empty string (i.e., one with no letters);
-   combinations of C, V, and L are used to define patterns;
-   the symbol * is used to refer to zero or more repetitions of a given pattern;
-   the symbol + is used to refer to one or more repetitions of a given pattern;
-   matched parenthesis are used to subordinate a sequence of variables to the operators * and +;
-   a generic pattern is a combination of symbols, matched parenthesis, and the operators * and +;
-   the substitution rules are treated as commands which are separated by a semicolon punctuation mark;
-   the substitution rules are applied to the suffixes in the current word;
-   a conditional if statement is expressed as ``if (pattern) rule'' and the rule is executed only if the pattern in the condition matches the current word;
-   a line which starts with a % is treated as a comment;
-   curled brackets are used to form compound commands;
-   a ``select rule with longest suffix'' statement selects a single rule for execution among all the rules in a compound command. The rule selected is the one with the largest matching suffix.
Thus, the expression (C)* refers to a sequence of zero or more consonants while the expression ((V)*(C)*)* refers to a sequence of zero or more vowels followed by zero or more consonants which can appear zero or more times. It is important to distinguish the above from the sequence (V*C) which states that a sequence must be present and that this sequence necessarily starts with a vowel, followed by a subsequence of zero or more letters, and finished by a consonant. Finally, the command
if (*V*L) then ed 2#21#1
states that the substitution of the suffix ed by nil (i.e., the removal of the suffix ed) only occurs if the current word contains a vowel and at least one additional letter.

The Porter algorithm is applied to each word in the text (simple formulation) and is given by the following procedure.

% Phase 1: Plurals and past participles.
select rule with longest suffix {
sses 2#2 ss;
ies 2#2 i;
ss 2#2 ss;
s 2#2 1#1; }
select rule with longest suffix {
if ( (C)*((V)+(C)+)+(V)*eed) then eed 2#2 ee;
if (*V*ed or *V*ing) then {
select rule with longest suffix {
ed 2#2 1#1;
ing 2#2 1#1; }
select rule with longest suffix {
at 2#2 ate;
bl 2#2 ble;
iz 2#2 ize;
if ((* C1C2) and ( C1 = C2) and ( 3#3 {l,s,z})) then C1C2 2#2 C1;
if (( (C)*((V)+(C)+)C1V1C2) and ( 4#4 {w,x,y})) then C1V1C2 2#2 C1V1C2e; }
}
}
if (*V*y) then y 2#2 i;
if ( (C)*((V)+(C)+)+(V)*) then
select rule with longest suffix {
ational 2#2 ate;
tional 2#2 tion;
enci 2#2 ence;
anci 2#2 ance;
izer 2#2 ize;
abli 2#2 able;
alli 2#2 al;
entli 2#2 ent;
eli 2#2 e;
ousli 2#2 ous;
ization 2#2 ize;
ation 2#2 ate;
ator 2#2 ate;
alism 2#2 al;
iveness 2#2 ive;
fulness 2#2 ful;
ousness 2#2 ous;
aliti 2#2 al;
iviti 2#2 ive;
biliti 2#2 ble; }
if ( (C)*((V)+(C)+)+(V)*) then
select rule with longest suffix {
icate 2#2 ic;
ative 2#2 1#1;
alize 2#2 al;
iciti 2#2 ic;
ical 2#2 ic;
ful 2#2 1#1;
ness 2#2 1#1; }
if ( (C)*((V)+(C)+)((V)+(C)+)+(V)*) then
select rule with longest suffix {
al 2#2 1#1;
ance 2#2 1#1;
ence 2#2 1#1;
er 2#2 1#1;
ic 2#2 1#1;
able 2#2 1#1;
ible 2#2 1#1;
ant 2#2 1#1;
ement 2#2 1#1;
ment 2#2 1#1;
ent 2#2 1#1;
ou 2#2 1#1;
ism 2#2 1#1;
ate 2#2 1#1;
iti 2#2 1#1;
ous 2#2 1#1;
ive 2#2 1#1;
ize 2#2 1#1;
if (*s or *t) then ion 2#2 1#1; }
select rule with longest suffix {
if ( (C)*((V)+(C)+)((V)+(C)+)+(V)*) then e 2#2 1#1;
if (( (C)*((V)+(C)+)(V)*) and not (( *C1V1C2) and ( 4#4 {w,x,y}))) then e 2#2 nil; }
if ( (C)*((V)+(C)+)((V)+(C)+)+V*ll) then ll 2#2 l;