Introduction:
In 1898, when Otto von Bismarck was an old man, a journalist asked him what he took to be the decisive factor in modern history. He answered: "The fact that the North Americans speak English." It was a prescient remark, in two ways. The linguistic and cultural ties among the English-speaking nations were to play a decisive role in shaping the political history of the century. And in turn, the political and cultural ascendancy of the English-speaking nations were to establish English as the most successful world language since classical times, the preferred medium for international business and trade, science and technology, tourism, and cultural life.
To most observers, the rise of the Internet seems to provide just one more road along which English can march on its ineluctable course of conquest. Certainly everybody seems to be sure that the Internet will be an English lake. On the part of anglophones, this certainty is accompanied by a certain amount of self-satisfaction. The Sunday New York Times ran a story a year or so ago with the headline "World, Wide, Web: Three English Words." One computer writer described the Internet as "a great force for the Anglification of the planet" and the editor of a magazine called The Futurist predicts that thanks to new media English will become the native language of a majority of the world by some time in the next century. And indeed, one linguist has suggested in all earnestness that the UN should simply declare English the official world language, but rename it "Globalese" so as not to imply that it belongs to any one speech community anymore. (I have the feeling that Bismarck's misgivings wouldn't have been entirely allayed by this maneuver.)
Not surprisingly, non-English-speakers have tended to react to this prospect with a certain apprehension. The director of a Russian Internet provider recently described the Web as "the ultimate act of intellectual colonialism." And President Chirac was even more apocalyptic, describing English domination of the Internet as a "major risk for humanity," with its threat of linguistic and cultural uniformity. It's true that few non-English-speaking nations have followed France in trying to explicitly mandate the use of the national language on Web sites and the like, but the concern about the spread of English on the Internet is very general.
Is any of this justified -- the neoimperialist swagger on one side, the cries of alarm on the other? It's true that right now the overwhelming portion of Net communication is carried out in English. There are a lot difficulties in coming up with accurate figures on language use on the Net. But a figure of around 85 percent is probably close, depending on how and when the measurements are made.1 This figure could seem alarming to nonanglophones, but in fact it doesn't mean much by itself. For one thing, it doesn't take into account the current disproportions in where Web users are located. The Internet was basically a North American development, and the majority of its users are still drawn from the United States and the rest of the English-speaking world. Figure 1 shows the distribution of top-level Internet hosts by linguistic community from a survey made by Network Wizard in January of 1998.2
Table 1: Proportions of Top-Level Internet Hosts
in Major Linguistic Communities
| Linguistic Community | Percentage
of
top-level servers |
| English | 78.4 |
| German | 4.0 |
| Japanese | 3.9 |
| Finnish | 1.5 |
| Dutch | 1.4 |
| French | 1.4 |
| Swedish | 1.1 |
| Norway | 1.0 |
| Spanish | 1.0 |
| Italian | 0.8 |
| Chinese | 0.9 |
| Danish | 0.5 |
| Portuguese | 0.5 |
| Korean | 0.4 |
| Russian | 0.4 |
| Polish | 0.3 |
At present, then, the distribution of Internet users is chiefly a reflection of the patterns of early adoption of the technology. For example the Scandinavian nations have been very aggressive in getting on the Internet, which is why Finland currently has more sites than either the French- or Spanish-speaking worlds. But it's reasonable to assume that the penetration of the Web will eventually reach a level that's more-or-less proportionate to population for most of the developed nations of the world. This may take a few years, since at this point the technology is still spreading more rapidly in the US than in other nations. In 1997, for example, the number of top-level Internet hosts in the US increased by about 93 percent, against 84 percent in Korea, 63 percent in Italy, 59 percent in Japan, 38 percent in Germany, and 36 percent in France.3
The continued high growth in the U.S. is the result of what economists call a positive externality, like the fax effect: the more people there are on-line the more incentive there is for everyone else to get on line. The effect is repeated, moreover, as the Internet spreads from one sector or community to the next. In the US, for example, we have already reached the point where we expect every real estate agency to have a Web site and every lawyer to have an email address. To date, most other developed nations haven't reached this point, but this is clearly only a matter of time.4 Internet penetration is already at well over ten percent of households in Germany and the Scandinavian nations, and will reach that level by 1999 in Italy and the Netherlands and soon after in most developed European and Asian nations (the figure for the U.S. is 24 percent).5 Within five or ten years, we can expect the national penetration of the Web to be roughly proportional to population for the developed world, and at that point linguistic disparities will seem less dramatic than they do now.
Even when Internet penetration is roughly equalized, it is true, we will expect that the proportion of English will continue to be greater than the proportion of native Anglophone users, simply because many people in non-English-speaking nations will find it convenient or expedient to post Web pages in English or to use English when they want to reach an international audience. The survey that Schuetze and I performed found that English-language pages currently account for roughly a third of the content of Web servers in non-English-speaking nations, though there is a lot of variation from one country to the next and the proportion of English in most will almost certainly diminish over time.
In any event, it is a mistake to attach any great sociopolitical significance to the bald proportions of one or another language on the Internet. We should bear in mind that the incentive to use English doesn't necessarily create a corresponding disincentive to use the local language. In this way the Internet is different from other media of wide diffusion. There is a limited number of movie screens in France, and each of these can show only one film at a time, so that Steven Spielberg and Erich Rohmer are necessarily in competition for channels of distribution. But with the Internet there is an essentially unlimited abundance of communicative resources, which means that diffusion of information is not a zero-sum game.
Indeed, the economics of distribution make multilingual publication on the Web much more feasible than it is in print. The editors of the proceedings of an international medical conference conducted in English can easily allow authors to provide French versions of their contributions on the same site. A company in Nancy that does 5 percent its of sales outside of France may not feel it is worthwhile to print catalogues in English or other languages, but it may make sense to make available some of its Web pages in English. And while this adds to both the absolute amount of English on the Web and its relative proportion as opposed to French, it doesn't diminish the amount of French or the availability of French content to francophones, whether in France or abroad.
But the problem with attaching significance to the bald proportions of English on the Internet isn't just that they're inaccurate indicators of the availability of non-English content, but that they invoke assumptions about cultural influence that have been carried over from debates about language use in earlier media. These echoes are ubiquitous in the discussions of Internet language use. A recent (and very sensible) list of proposals for augmenting the use of French on the Net, for example, speaks of "la défense et l'illustration du français sur les réseaux," alluding to du Bellay's 16th-century appeal to make the French language illustrious; and complaints about the dominance of English are very often couched in terms of concerns about the "presence" of this or that language on the Net. For most people, clearly, this is a question of national or cultural pride that leans heavily on a print conception of linguistic influence -- the idea that the mark of a great language is a global "presence," a wide diffusion that comes of universal renown. But this picture is singularly inappropriate to the placeless world of the Net, where the only sort of "presence" that is relevant is simple accessibility. On the Internet every language has an international presence. From the machine in my office in Palo Alto I can call up the French-language pages of the French Ministry of Culture, the Welsh-language pages of the National Library of Wales, or the Hawaiian-language site at the University of Hawaii.
Granted, the numerical prevalence of English pages greatly heightens the phenomenal impression of English dominance. If you do a search on Alta Vista for the words "Roland Barthes," for example, you will find that out of the first forty nonduplicate hits, 32 are in English and only two are in French (the others are in German, Italian, Spanish, Finnish, and Swedish). Even if you correct these figures for differences in the degree of Internet penetration, the ratio of English to French documents on Barthes would be almost three to one. For all I know that proportion is consistent with the rates of print publication about Barthes, but the hit list will be a little disconcerting to a Frenchman who is used to browsing the reassuringly francophone shelves of bookstores, libraries, and other institutions of the literary old order.
To some extent, it's true, this impressionistic effect may be ameliorated by increasing use of browser language preferences and by language-specific search engines and Web indexes. If you restrict the browser to French documents, for example, you still turn up 318 matches in all for Roland Barthes, which is quite a large number. But over the long run I think people will simply get used to seeing large numbers of English documents on the Web, and won't attach to this the same importance that they might when they find that sixty percent of the films showing at Parisian cinemas are American productions, which would entail a correspondingly reduced distribution of French films.
The survey
Setting aside the rhetoric of "presence" and influence, then, the important question is not whether English will be statistically or impressionistically dominant on the Web, bur rather whether French or Hungarian users will have adequate access to services and information in their own languages. To try to answer this question, my colleague Hinrich Schuetze and I did a survey of 2.5 million Web pages drawn from a Web crawl performed by the Internet Archive in early 1997. This represents about five percent of the entire set of pages retrieved by the crawl, a far greater proportion than have been used in other estimates of language use. We classified the pages using an automatic language identifier developed by Schuetze that can identify alphabetic languages with about 95 percent accuracy (it works slightly less well on nonalphabetic languages like Chinese or Japanese)6. The identifier assigned percentages to the languages used in each top-level domain (i.e., com, de, and so forth). In presenting the proportions of English content, we used a measure that corrected for differences in the size of servers.7
It's important to bear in mind that the amount of content in a particular language can give only an approximate indication of how widely the language is used.8 Nonetheless, we take these figures as suggesting the broad patterns of language use. The results of the survey are too extensive to cover in detail here, but I'll give a few representative patterns (more figures are included in the appendix). For the present purposes, I'll look chiefly at monolingual nations, since multilingual communities raise special problems.
Let's first consider a set of nations that have relatively low Internet penetration, which moreover speak languages that are little used outside of the national boundaries. (I include U. S. figures for comparison.) In all of these nations the proportion of English content is quite high:
Table 2: Developing language communities
| Domain | % of English | Top-level Internet Hosts (July 97) | inhabitants/server |
| Bulgaria (bg) | 86 | 5515 | 1613 |
| China (cn) | 82 | 25,594 | 39,391 |
| Egypt (eg) | 95 | 1894 | 25,608 |
| Georgia | 81 | 298 | 32,685 |
| Greece (gr) | 81 | 19,711 | 132 |
| Latvia (lv) | 75 | 5184 | 8629 |
| Romania (ro) | 84 | 5998 | 41,619 |
| Thaland (th) | 95 | 12,794 | 406 |
| Turkey (tr) | 62 | 22,963 | 981 |
| US (com, net, edu, etc.) | 99.7 | 11,829,141 | 21 |
These are very different nations, of course, but they have certain things in common. First, the technology is not widely available, so the possibilities of internal communication are limited. (To take the extreme case, if there were only one email user in a nation, it would obviously make no sense for him to use the local language). Moreover, the vast majority of the Internet users in nations like China, Egypt and Bulgaria are drawn from English-speaking elites, who use it as essentially a medium for scientific and technical communication. (Note also a number of these nations have to deal with the added technical difficulties posed by non-Roman character sets if they want to publish material in the local language.) In any event, these figures represent only the practice of a small group of early adopters, and are not particularly informative about the direction of use once Internet penetration increases to the levels we see in highly developed nations.
But there is variation as well among the developed non-English-speaking nations where the Internet has already taken root. In general, we find the highest proportion of English among the smaller northern European nations whose languages are restricted to national use:
Table 3: Small, developed language communities; minor languages
| Domain |
|
|
|
|
| Denmark (dk) |
|
|
|
|
| Finland (fi) |
|
|
|
|
| Netherlands (nl) |
|
|
|
|
| Norway (no) |
|
|
|
|
| Sweden (se) |
|
|
|
|
The high use of English here seems natural enough: these are nations whose national languages are not widely used outside the national boundaries, and whose trade and cultural life is largely lived in a multinational setting; in addition, all of these countries have high levels of English proficiency.
The proportion of English use is lower, by contrast, in nations that speak major European languages:
Table 4: Larger language communities
|
|
|
|
|
|
| Austria (at) | 42 | 87,408 | 86 | |
| Germany (de) | 25 | 875,631 | 88 | |
| Spain (es) | 24 | 121,823 | 319 | |
| France (fr) | 26 | 292,096 | 186 | |
| Italy (it) | 33 | 211,966 | 265 | |
| Portugal (pt) | 26 | 18,147 | 547 | |
Internet penetration is currently lower in these nations, but they still represent large linguistic communities with a correspondingly higher emphasis on internal communication. Moreover, proficiency in English isn't presumed in nations like France and Italy the way it is in Sweden or the Netherlands, so a site posted in English runs the risk of being inaccessible to a number of local users, particularly as the technology spreads to nontechnical and nonacademic users.
Finally, we note that there is a very low percentage of English in the Latin-American countries, even though the technology is still not widely in place there:
Table 5: Latin-American nations
| Domain |
|
Top-level Internet Hosts | Inhabitants
per host |
|
| Argentina (ar) |
|
18,985 | 1472 | |
| Brazil (br) |
|
68,685 | 1733 | |
| Chile (cl) |
|
19,168 | 591 | |
| Columbia (co) |
|
6905 | 3842 | |
| Mexico (mx) |
|
35,238 | 1913 | |
| Peru (pe) |
|
6510 | 2616 | |
| Uruguay (uy) |
|
1024 | 2723 | |
| Venezuela (ve) |
|
4679 | 3102 | |
This again makes sense: both Brazil and hispanophone Latin America are large monolingual communities whose cultural and commercial ties with the rest of the world are relatively etiolated compared say to those of Sweden or France. (These are also nations in which people tend to speak English less well than in Western Europe.) And while the proportion of world Internet hosts that are in Brazil or the hispanophone world is relatively small, each still represents a sizable population in its totality.
For any one of these nations, of course, there are a number of particular factors that determine how and when the local language is used, and there are interesting cases of variation -- for example, it isn't clear why the proportion of English is so much higher in Austria than in Germany or in Colombia than in Venezuela. For the present purposes, though, it's enough to observe that these results show that the received wisdom about the Internet is false. English is not going to drive out the use of other languages, and in fact is already in a minority position in all the non-English-speaking nations in which the technology has gained a substantial foothold. With greater diffusion, moreover, the proportion of non-English content in non-English-speaking domains is certain to increase, as the technology is adopted by small businesses and individual users. A real-estate agency or architectural studio in Hannover has neither the incentive nor probably the resources to put up its content in English, the way a company like Lufthansa does.9
The process is self-reinforcing, moreover: the more members of a linguistic community there are on the Web, the more incentive they have to use their own language -- and the more incentive advertisers and content providers have to provide local-language services. (A 1998 study by Jupiter Communications reports that at some major U.S. sites 30 and 50 percent of their hits come from foreign users, who would surely find it more convenient and quicker to access information from local servers.)10
There are already numerous examples of this trend. The Web sites of the Council of the European Union and of the Louvre, for example, were originally posted exclusively in English, but now offer multilingual versions. The Web index service Yahoo! has put up localized versions in French, Spanish, German, Danish, Norwegian, Swedish, Italian, Chinese, Korean, and Japanese. And there is a booming market in translation and "localization" services, as corporations and advertisers press to make their messages available in other languages.
At the same time, it's asking a lot to suppose that the market alone will ensure that speakers of other languages have available the full range of information and services available to English-speaking users. The state has a role to play here as well. A few nations, like France, have taken the step of mandating the use of the local language in Web sites and the like, but there are also less coercive steps that governments can take, for example by subsidizing translation and the digitization of cultural patrimonies.11 And both governments and international bodies can continue to devote resources to the development of technology required to display non-ASCII characters and character sets.12 And developed nations may also choose to support the installation of Internet connections in less-developed regions of their linguistic communities, such as francophone or lusiphone Africa.
None of this means, of course, that most people from non-English-speaking communities will be doing all their Internet browsing and communication in their own languages. The Web is an international marketplace, after all, and a French or German user who has some knowledge of other languages would be foolish to confine herself to sites in her own language when shopping for software or CD's. On-line dictionaries and translation aids, moreover, can make the use of foreign-language information much easier than it is in print, particularly for users who have some basic knowledge of the language already.13 In this sense the Web does provide an added incentive for people to learn other languages, particularly but not exclusively English -- I should note that foreign-language sites are being used very productively by students in American language courses.
Effects of the Internet on linguistic organization
This opening up of the linguistic market has effects that go beyond the expanded incentives to learn English, though. Together with other features of the Internet, it promises to work changes in the organization of language communities and in the role they play in the construction of national communities. The modern conception of language and national identity, after all, was largely a creation of the print communications system of the eighteenth and nineteenth centuries, which emerged with the maturation of print capitalism and the spread of bourgeois literacy. It rested first on the emergence of standardized vernaculars that were more widely used than the administrative vernaculars of the Renaissance state, which made possible the diffusion of uniform representations throughout the national community, so that, as Benedict Anderson has put it, people
gradually became aware of the hundreds of thousands, even millions, of people in their language-field, and at the same time that only those hundreds of thousands, or millions, so belonged. These fellow readers, to whom they were connected through print, formed, in their secular, visible invisibility, the embryo of the nationally-imagined community.14
This is the sense that Samuel Johnson was getting at when he said in 1777 that Britain had become a "nation of readers." He meant not merely that more people were reading (though that was true enough), or that they were reading more texts (reading "extensively," as Roger Chartier has put it, as opposed to the "intensive" reading of earlier periods), but also that the experience of participating in the print discourse had become constitutive of the sense of national identity.
The discourse that mediated the rise of modern national consciousness was naturally shaped by the material limitations of print (and later, by analogous properties of the broadcast media). Given the large capital accumulations that print required, production was necessarily concentrated, increasingly so over the course of the nineteenth and twentieth centuries, despite the continuing democratization of the reading public. Distribution was highly centralized, especially in metropolitan nations like France and England -- circulation was effectively limited to national boundaries, and remote regions and colonies were marginalized (a complaint, recall, that was particularly rankling to the American revolutionaries). And in order to achieve the relatively large circulation that the economics of print required, the common discourse was restricted to matters of general interest -- to "public affairs" in the broadest sense of the term (that is, so as to include literature, commerce, and faits divers). This topical circumscription in turn determined the nature of the standardized print languages itself: by the nineteenth century it had become an instrument designed essentially for the requirements of formal exposition -- what Heinz Kloss refers to as Sachprosa -- that was well removed from the language of everyday life.
Electronic communication contrasts with print and broadcast in most of these regards. First, the low costs of production and distribution mean that the ability to speak is more widely distributed -- the point that people are getting at when they observe that on the Internet, "anyone can reach a potential audience of millions," and the like. This can be a little misleading, to be sure. Posting a Web site that is actually accessible to hundreds of thousands of users requires a large capital investment in both technology and publicity, and the recent scramble to acquire Web "portals" is an indication of how concentrated the distribution of information actually is.15 But the Internet is still leakier than print, and tends to resist monopolistic concentration. The decentralization of distribution, moreover, entails that communication is much more efficient, both in the sense that messages can more easily reach their intended audiences, and in the sense that access is independent of geographical distance and institutional and commercial connections -- the property that creates, among other things, the more open linguistic marketplace that I mentioned earlier.
Then too, there are notable differences in content between the two media. It may be that "print discourse" is itself an abstraction over a wide range of forms and media, but "the Internet" is even more so. Some Internet content is the digital equivalent of print forms -- news, literature, scientific papers, advertising, and the like. Much of it though, has no print equivalent -- you think of discussion groups, email, or personal Web pages, forms that either take the place of communication that was formerly oral or represent essentially new types of communication. As a consequence, the language of the Net contains a more varied repertoire than the language of print -- it includes not just the equivalents of print vernaculars, but the varieties used in email and discussion lists, whose deceptive informality masks a highly stylized register.
What effects will all this have on the organization of language communities? The variety of forms and functions of the Net make it very difficult to answer this in any simple way. Electronic communication does have certain inherent biases, to take Harold Innis's term, but they don't militate in the large for any particular form of social or political organization. At best the technology can help to amplify and facilitate sociopolitical changes already in motion. But in the end, the wired world will be sociolinguistically quite different from the present one.
Take, for example, some of the effects of the elimination of geographical constraints on the accessibility of information. One area where this has already had a striking effect is in the distribution of news and other kinds of public information. In the world of print or broadcast, it's only the English-language media that can achieve anything like general worldwide distribution. You can sometimes find a French television news program on cable in big cities in the U.S. or a three-day-old copy of Le Figaro at an international news dealer, but they aren't available at every in every hotel room and at every street corner the way the Herald Tribune and CNN are in France. And for smaller or less influential languages like Greek or Hindi, the circulation of information pretty much stops at national borders.
With the Web, by contrast, this kind of distribution is very easy. My French and German colleagues at the Xerox Palo Alto Research Center routinely read the Web versions of daily papers like Le Figaro and Die Welt. And you can find electronic versions of newspapers from Malaysia, Indonesia, Colombia, Turkey, Qatar, and about 70 or 80 other nations. (Yahoo! lists more than 1400 sites for newspapers outside of the United States, a majority of which are publishing at least a part of their content on the Web.) And as with news, so with many other forms of communication traditionally consigned to print: magazines, government information, educational materials, scientific journals, and finally, as the digitized collections of major national libraries begin to come on-line, the aggregate literatures of the developed nations. To these moreover we should add the numerous international discussion groups conducted in languages large and small, which can constitute international communities of reception in which news and the like can be interpreted. (A recent search for sites or discussion groups that were wholly or partially in languages other than English found around 100 languages, among them Arabic, Armenian, Basque, Breton, Cambodian, Catalan, Czech, Esperanto, Gaelic, Greek, Hebrew, Hindi, Hmong, Hungarian, Indonesian, Macedonian, Malay, Rumanian, Slovenian, Swahili, Urdu, Welsh, Yiddish, and Yoruba.)
These efficiencies of distribution work to the particular advantage of dispersed language communities -- not just linguistic diasporas like the Germans in California or Yiddish speakers everywhere, but ultimately, postcolonial populations that have up to now existed in the linguistic penumbra of the metropolis. People in the francophone Caribbean or the Mahgreb, for example, can have more immediate access to a much greater range of French-language content; institutions of higher education can have access to textbooks, periodicals, and eventually, to the digitized contents of national library collections. And similarly for the Hungarian speakers in Slovakia, the Chinese of Southeast Asia, the francophones of western Canada, or the Russians in many parts of Eastern Europe.
On the other hand, the Net can reduce the dependency on institutions that have traditionally exercised a hegemonic influence over the periphery. Francophones outside the metropolis need not depend on Le Figaro or Le Monde; they can also get on-line versions of Nice-Matin, Lyon Capitale, or Les Dernières Nouvelles D'Alsace. People in lusiphone Africa have the option of going to sites in Brazil, which has many more Web sites than Portugal does. And in the Caribbean, presumably, the great number of U.S. sites will attract more users than sites in Britain, the historical metropolis of the region. At the point when the Internet becomes an important medium of communication in these nations, then, linguistic ties with the metropolis are likely to become more selective and facultative than with print. But we should bear in mind that in many of these nations, Internet access will be restricted for a long time to small elites, and the technology won't the cultural importance that the mass media do: its effects will be limited to policy makers and higher education.
What of the major language communities in the developed world? Nations like the U.S., France, Germany, and so forth are already cohesive communities whose interests are well served by the mass media, and however far-reaching the effects of on-line news and information, the shift to this form of publication is unlikely to work any important changes in the sense of national identity, particularly since these functions will continue to be highly centralized, dominated by a small number of on-line publications and Web portals. There has been a good deal of talk, of course, about the Internet as an internationalist force that transcends national boundaries and creates "global communities" and the like. But while international discussion lists have an important role to play in sectors like the academic and scientific worlds, they are not likely to be much of a factor in disrupting the basic patterns of national identity. Francophone Belgians are not going to feel less Belgian simply in virtue of participating in discussion groups with francophones in France or Canada.
There are several ways, though, in which the Internet may have sociolinguistic effects even in communities like these. One in altering the perception of the connection between language and national community. The place to look here, I think, is less the digital equivalents of print genres than the various quasi-public forms that have emerged on the Net: Usenet discussion groups, special-interest distribution lists, and the like, which have had huge participation in nations like the U. S., where 25 percent of households already have Internet access. For the first time in history, the written language is being used as a medium for active, daily, public communication among millions of people -- "public" at least in the sense that the participants have never met, and are connected entirely through their participation in these groups. The "nation of readers," that is, is becoming a nation of writers.
Enthusiasts like to predict that this discourse will lead to a fundamental reorganization of public life. This is probably unrealistic: the Internet is too disorganized, too fragmented and too selective in its participation to replace traditional political institutions. But the Net has become an important secondary forum that the press must pay attention to, and it shapes the way a lot of people understand the public discussions of civic life. In particular it introduces a new forms of language into these discussions, one less like the print discourse than the oral exchanges of private life, but filtered through newly emerged conventions of electronic communication. This informality can be deceptive: the language of email discussions is no less rule-governed than the language of print, but is less explicit and relies more on the contextual background -- it is the projection of private language into a quasi-public sphere. This is one reason why forms like email can be difficult for foreigners to master, even when they are capable of writing perfect formal prose. More important, the medium can discourage the participation even of native speakers who aren't privy to the interactive norms of middle-class speech. In this regard it is no different, of course, from ordinary conversation, but ordinary conversation doesn't present itself as a public forum.
So while the Internet certainly opens up the public discussion in certain ways, it can also restrict and circumscribe it, by moving it away from a neutral public language that transcends social differences. In this sense the medium could play into the strong recent tendencies in several Western nations to redefine nationality in more narrowly culture-based terms, rather than in terms of shared institutions and political ideals. (The importance of Internet discussion has already been cited, for example, by proponents of the movement to make English the official language of the United States.) I don't want to make too much of this -- these tendencies were in play well before the Internet was introduced, and the Net is not going to be more than a compounding influence. But we should bear this in mind when we hear people talk in confident ways about the capacity of the Internet to broaden the political process. As Harold Innes said about earlier technological revolutions in communication, that "improvements in communication [can] make for increased difficulties in understanding."16
I should mention one other way in which the discussions on the Internet can play a role in redefining communities, by refiguring secondary communities at the national level, particularly in scientific and professional sectors. In non-English-speaking nations, most scientific publication is now carried out either largely or entirely in English, with the local language reserved either for administrative publications or for oral discussion in classrooms, laboratories and the like. In a sense, then, we can't speak of a French- or Italian-language scientific public, since scientific discourse in these languages is chiefly conducted within the private and institutional spheres, mediated by personal and professional connections.
On the Internet, though, a great deal of this oral discourse has begun to bubble up into public view. The scientific distribution lists of the Net are full of the kinds of discussions of practice that were excluded from print journals in the nineteenth century: pedagogical and technical tips, gossip, institutional politics, anecdotal observations about curiosities that lie outside the realm of current theory. You might be reminded of the scientific periodicals of the pre-nineteenth-century period, like the Philosophical Transactions of the Royal Society or the Journal des Sçavans, with the mix of concerns that we would now distinguish as private and public, and the mix of participants that we would now distinguish as authorized and unauthorized -- professors, graduate students, interested amateurs. And this development necessarily changes the conception of the national scientific community, both by sharpening the awareness of common interests and by providing a forum for the formation of opinion that is more independent of institutional structures and hierarchies.
In the end, this may be what is most interesting
about the Internet -- not that it makes the world smaller, like previous
communications technologies, but that it helps to keep it big and diverse.
Notes
1 There are several ways of classifying pages according to languages. One simple procedure is simply to do searches on certain terms (ideally, these should be either names like "Internet" whose spelling doesn't change from one language to the next, or terms whose synonyms are spelled differently in each language, like "welcome"). Using this procedure, Crystal (1997) comes up with a figure for English of around 80 percent in a 1996 search. (See Crystal, David. English as a Global Language. Cambridge: The Cambridge University Press, 1997.) There are several problems here, however. For one thing, standard search engines do not index pages from all languages with equal coverage. For another, the topics of discussion on the Web vary from one domain to the next, particularly since the mix of sectors is different -- in the U. S., for example, there are many more pages belonging to small businesses and individuals. And certain search terms (for example "tritium") will evoke higher proportions of English than others.
A second method is via user reports, such as was conducted in the GVU user surveys at Georgia Tech (see http://www.cc.gatech.edu/gvu/user_surveys/survey-1998-04/.) Since participation in these is voluntary and appeals are posted in English, however, this naturally favors anglophone and particularly American sites.
A more accurate procedure was used in a survey performed
in 1997 by the Babel project, a joint initiative of Alis Technologies and
the Internet Society (see http://babel.alis.com:8080/palmares.en.html).
The project surveyed about 3200 randomly chosen home pages that contained
more than 500 characters using an automatic language classifier. The potential
sources of difficulty here include the accuracy of the language classifier
(which utilized only trigrams), the randomness of the selection process,
the representativeness of home pages, and the small size of the sample,
particularly as regards smaller domains. For the better-represented languages,
however, the figures accorded quite closely with the results of our own
survey, reported below, which involved about 2.5 million pages. In particular,
their figure of 84 percent English content is close to our figure of 85.9
percent.
| language | number of pages | percentage English |
| English | 2722 | 84.0 |
| German | 147 | 4.5 |
| Japanese | 101 | 3.1 |
| French | 59 | 1.8 |
| Spanish | 38 | 1.2 |
| Swedish | 35 | 1.1 |
| Italian | 31 | 1.0 |
| Portuguese | 21 | 0.7 |
| Dutch | 20 | 0.6 |
| Norwegian | 19 | 0.6 |
| Finnish | 14 | 0.4 |
| Czech | 11 | 0.3 |
| Danish | 9 | 0.3 |
| Russian | 8 | 0.3 |
| Malay | 4 | 0.1 |
2 See http://www.nw.com/zone/WWW/report.html. These
figures include only the 55 largest domains, which however account for
over 99.5 percent of top-level hosts. I also include only linguistic communities
that represent more than 0.3 percent of the total number of top-level hosts.
The mapping between domains and linguistic communities was made as follows:
3 The only nation in which the rate of increase has been higher than the US is Taiwan, where the number of hosts increased by 510 percent in 1997.
4 In France in particular the adoption of the Internet
may be a bit slower because so many of its functionalities are already
available to users of the minitel. A random selection of 40 sites in France
from 1997 revealed the following distribution:
| Type of organization | Sites | Percentage |
| Internet-related businesses | 11 | 27.5 |
| University sites | 8 | 20 |
| Research institutes | 7 | 17.5 |
| Touristic or community guides | 4 | 10 |
| Media companies | 3 | 7.5 |
| Government agencies | 3 | 7.5 |
| On-line magazines | 2 | 5 |
| Computer sales | 1 | 2.5 |
| CD-ROM catalogue | 1 | 2.5 |
5 See http://www.jup.com/research/reports/europe/.
6 The classifier works by identifying words and trigrams (three-character sequences) that are characteristic of the language in question. It was first trained to identify English on the gov (U.S. government) domain, on the assumption that virtually all of the text documents in that domain were in English. It was then trained to identify other languages by analyzing the language of a domain in which the language was native (e.g., German in de) while ignoring cues that were characteristic of English. The process was iterated for domains in which more than two languages were used (e.g., Belgium or Finland).
In all domains there was a residue of pages that could not be assigned to one or another language, usually because they did not contain sufficient alphabetic content (e.g., index pages, pages of numerical tables). For most large domains, this residue ranged between five and ten percent of pages.
Where the classifier identified some pages as belonging to a language related to the national language of the country, we assigned these misidentifications to the national language. For example classifier tagged 1.9 percent of the pages in the dk (Denmark) domain to Norwegian, but we assumed that these were misclassifications of Danish-language pages.
7 The procedure was as follows: we assigned each server a single vote, independent of its size, and distributed that vote among languages according to the proportion of pages on the server in each. The procedure has the advantage of correcting for the effect of a few large servers that may not be representative of the wider pattern of language use in a domain -- for example if they belong to an international organization or a multinational corporation.
8 For one thing, the figures measure only the number of Web pages in a language, not its use in email or Internet discussion groups and the like. For another, we haven't measured how frequently certain Web pages are consulted. This difference is likely to be particularly important when we are comparing nations like the U.S. or Finland with nations in which the technology is only beginning to take hold. In the former there are large numbers of sites for individuals and small firms that get relatively few hits, while in the latter most of the sites belong to government departments, universities, and large companies, which get relatively more hits each.
9 The determination of who is using English and why would involve an extensive hand-search of sites, which we did not undertake. We did however examine 40 French sites chosen at random, and noted the following patterns. Only three sites contained all or almost all English content. These were the French Nuclear Energy agency, a site for a TV channel, and a site for a medical institute. Another seven sites contained about half English content. These included an Internet developer, a multimedia communications agency, a Metz city guide, and several university sites. The remaining sites consisted of all or mostly all French content.
10 See http://www.jup.com.
11 Anatoly Voronov, director of the Russian Internet provider Glasnet, observes that at present "it is far easier for a Russian language speaker with a computer to download the works of Dostoyevsky translated into English to read than it is for him to get the original in his own language." Quoted in Michael Specter, "The World-Wide Web: Three English Words," The New York Times, April 6, 1996.
12 This has been a major focus of the MLIS program of the European Union, as well; see http://www2.echo.lu/mlis/mlishome.html.
13 Machine translation systems, for example, may be woefully inadequate when it comes to composing letters or understanding the fine points of a text, but they are generally sufficient to give a user a sense of what a document is saying, particularly if she has a smattering of the language already. And bilingual dictionary plug-ins can provide glosses that take into account the context in which a term is used. (If a reader runs across the English sentence "We sent out for pizza," for example, the dictionary will be in a position to know that "send out" in the context means "order" rather than "transmit.")
14 Anderson, Benedict, Imagined Communities. London: Verso, 1983, p. 47.
15 According to a study by Alexa Inc., 50 percent of all Web clicks go to just slightly more than 1500 sites, or less than one tenth of one percent of the total, and the top two percent of Web sites account for 95 percent of the total number of clicks.
16 Innis, Harold, The Bias of Communication. Toronto:
University of Toronto Press, 1951, p. 25.