Computer-Mediated Discourse Analysis:

An Approach to Researching Online Behavior

 

Susan C. Herring

School of Library and Information Science

Indiana University

herring@indiana.edu

 

 

To appear in Barab, S. A., Kling, R., & Gray, J. H. (Eds.). (in press). Designing for Virtual Communities in the Service of Learning. New York: Cambridge University Press.

 


Introduction

Over the past fifteen years, the Internet has triggered a boom in research on human behavior. As growing numbers of people interact on a regular basis in chat rooms, Web forums, listservs, email, instant messaging environments and the like, social scientists, marketers and educators look to their behavior in an effort to understand the nature of computer-mediated communication and how it can be optimized in specific contexts of use. This effort is facilitated by the fact that people engage in socially meaningful activities online in a way that typically leaves a textual trace, making the interactions more accessible to scrutiny and reflection than is the case in ephemeral spoken communication, and enabling researchers to employ empirical, micro-level methods to shed light on macro-level phenomena. Despite this potential, much research on online behavior is anecdotal and speculative, rather than empirically grounded. Moreover, Internet research often suffers from a premature impulse to label online phenomena in broad terms, e.g., all groups of people interacting online are “communities”;[1] the language of the Internet is a single style or “genre”.[2] Notions such as “community” and “genre” are familiar and evocative, yet notoriously slippery, and unhelpful (or worse) if applied indiscriminately. An important challenge facing Internet researchers is thus how to identify and describe online phenomena in culturally meaningful terms, while at the same time grounding their distinctions in empirically observable behavior.

Online interaction overwhelmingly takes place by means of discourse. That is, participants interact by means of verbal language, usually typed on a keyboard and read as text on a computer screen. It is possible to lose sight of this fundamental fact at times, given the complex behaviors people engage in on the Internet, from forming interpersonal relationships (Baker, 1998) to implementing systems of group governance (Dibbell, 1993; Kolko & Reid, 1998). Yet these behaviors are constituted through and by means of discourse: language is doing, in the truest performative sense, on the Internet, where physical bodies (and their actions) are technically lacking (Kolko, 1995). Of course, many online relationships also have an offline component, and as computer-mediated communication becomes increasingly multimodal, semiotic systems in addition to text are becoming available for conveying meaning and “doing things” online (cf. Austin, 1962). Nonetheless, textual communication remains an important online activity, one that seems destined to continue for the foreseeable future. It follows that scholars of computer-mediated behavior need methods for analyzing discourse, alongside traditional social science methods such as experiments, interviews, surveys, and ethnographic observation.

            This chapter describes an approach to researching online interactive behavior known as Computer-Mediated Discourse Analysis (CMDA). CMDA applies methods adapted from language-focused disciplines such as linguistics, communication, and rhetoric to the analysis of computer-mediated communication (Herring, 2001). It may be supplemented by surveys, interviews, ethnographic observation, or other methods; it may involve qualitative or quantitative analysis; but what defines CMDA at its core is the analysis of logs of verbal interaction (characters, words, utterances, messages, exchanges, threads, archives, etc.). In the broadest sense, any analysis of online behavior that is grounded in empirical, textual observations is computer-mediated discourse analysis.[3]

            The specific approach to computer-mediated discourse analysis described here is informed by a linguistic perspective. That is, it views online behavior through the lens of language, and its interpretations are grounded in observations about language and language use. This perspective is reflected in the application of methodological paradigms that originated in the study of spoken and written language, e.g., conversation analysis, interactional sociolinguistics, pragmatics, text analysis, and critical discourse analysis. It also shapes the kinds of questions that are likely to get asked. Linguists are interested in language structure, meaning, and use, how these vary according to context, how they are learned, and how they change over time. CMDA can be used to study micro-level linguistic phenomena such as online word-formation processes (Cherny, 1999), lexical choice (Ko, 1996; Yates, 1996), sentence structure (Herring, 1998), and language switching among bilingual speakers (Georgakopoulou, in press; Paolillo, 1996). At the same time, a language-focused approach can be used to address macro-level phenomena such as coherence (Herring, 1999a; Panyametheekul, 2001), community (Cherny, 1999), gender equity (Herring, 1993, 1996a, 1999b) and identity (Burkhalter, 1999), as expressed through discourse. Indeed, the potential—and power—of CMDA is that it enables questions of broad social and psychological significance, including notions that would otherwise be intractable to empirical analysis, to be investigated with fine-grained empirical rigor. The present chapter is intended as a practical contribution toward helping researchers realize this potential.

            Because of its practical focus, this chapter will be most useful to readers who already have some study of computer-mediated communication in mind and who have given some thought to how they might approach their investigation. Readers who have made preliminary observations about a behavior (or behaviors) of interest in a specific online environment, and who have collected (or have access to) a relevant corpus of data, will be even better positioned to appreciate the methodological concerns addressed here. At the same time, the chapter is not intended as a step-by-step ”how to” guide, but rather as an overview of how a CMDA researcher might conceptualize, design and interpret a research project involving identifying and counting discourse phenomena in a corpus of computer-mediated text.[4] For details regarding the implementation of specific analytic methods, readers are referred to the research studies cited in the references.

            I begin by providing some historical background on CMDA and the kinds of research that have been carried out in the linguistic CMDA tradition, broadly construed. I then present a detailed overview of one version of the CMDA approach based on the “coding and counting” paradigm of classical content analysis, identifying a set of conceptual skills necessary for carrying out a successful analysis. These skills are illustrated with reference to the problem of analyzing “virtual community” in two professional development sites on the Internet. In concluding, the limits of the coding and counting paradigm, and the CMDA approach as a whole, are identified and future directions are charted.

 

Background

The term “computer-mediated discourse analysis” was first coined in 1995 (see Herring, 2001), although research meeting the definitional criteria for CMDA has been carried out since the mid-1980s (in the linguistic sense: e.g., Murray, 1985, 1988; Severinson Eklundh, 1986), and arguably, as early as the 1970s (in the general sense: Hiltz & Turoff, 1978). Starting in the mid-1990s, and corresponding to the upsurge in computer-mediated communication (CMC) research that followed closely on the heels of the popularization of the Internet (Herring, 2002), an increasing number of researchers began focusing on online discourse as a way to understand the effects of the new medium. However, different researchers approached computer-mediated discourse with different questions, methods, and understandings, often working in isolation from one another—and in the case of researchers outside the United States, unaware that other researchers shared their interests. The present chapter attempts to systematize some of the goals, understandings, and procedures implicitly shared by this emerging cadre of researchers.

            As background to the remainder of the chapter, it is useful to think of CMDA as applying to four domains or levels of language, ranging prototypically from smallest to largest linguistic unit of analysis: 1) structure, 2) meaning, 3) interaction, and 4) social behavior. Structural phenomena include the use of special typography or orthography, novel word formations, and sentence structure. At the meaning level are included the meanings of words, utterances (e.g., speech acts) and larger functional units (e.g., 'macrosegments', Herring, 1996b; cf. Longacre, 1992). The interactional level includes turn-taking, topic development, and other means of negotiating interactive exchanges. The social level includes linguistic expressions of play, conflict, power, and group membership over multiple exchanges. In addition, participation patterns (as measured by frequency and length of messages posted and responses received) in threads or other extended discourse samples constitute a fifth domain of CMDA analysis.

            The kinds of understandings obtainable through a language-focused approach can be illustrated by summarizing briefly a few studies that focus on phenomena from each domain. Non-standard spelling and typography have been analyzed structurally in Internet Relay Chat as an example of creative play (Danet et al., 1997), on the French Minitel system as an illustration of the tension between efficiency and expressivity (Livia, in press), and in a social MUD as evidence of participants’ “insider” status (Cherny, 1999). Studies that consider what online participants mean by what they say—for example, by classifying their utterances as speech acts—have discovered differences between educational and recreational uses of IRC, as well as differences associated with teacher/leader vs. other roles (Herring & Nix, 1997). Studies of interactional phenomena have identified system-imposed constraints on turn-taking (Herring, 1999a; Panyametheekul, 2001) and topic coherence (Herring & Nix, 1997; Lambiase, in press). One stream of socially-focused CMDA, research on group identity, has identified discourse styles associated with participant age (Ravert, 2001), gender (Hall, 1996; Herring, 1993, 1996a, b, in press a), ethnicity (Paolillo, in press) and race (Burkhalter, 1999; Jacobs-Huey, in press), even in supposedly anonymous text-only CMC. Finally, participation patterns have been observed to vary according to the synchronicity of the medium (Condon & Cech, 2001, in press), and to reveal social influence and dominance in online groups (Herring, in press b; Herring et al., 1992; Hert, 1997; Rafaeli & Sudweeks, 1997). This brief survey is intended to provide a sense of the range and diversity of topics that have been researched thus far using CMDA. More detailed surveys of the findings of previous CMDA research can be found in Herring (2001, 2002).

             

The CMDA Approach

CMDA is best considered an approach, rather than a “theory” or a single “method”. Although the linguistic variant described here is based on a loose set of theoretical premises (those of linguistic discourse analysis, plus a rejection of a priori technological determinism; see below), it is not a theory in that CMDA (as an abstract entity) makes no predictions about the nature of computer-mediated discourse. The findings of CMDA studies neither support nor falsify the premises of the approach, beyond confirming that it is useful or indicating that it is in need of further refinement. Rather, the CMDA approach allows diverse theories about discourse and computer-mediated communication to be entertained and tested. Moreover, although its overall methodological orientation can be characterized (see below), it is not a single method but rather a set of methods from which the researcher selects those best suited to her data and research questions. In short, CMDA as an approach to researching online behavior provides a methodological toolkit and a set of theoretical lenses through which to make observations and interpret the results of empirical analysis.

            The theoretical assumptions underlying CMDA are those of linguistic discourse analysis, broadly construed. First, it is assumed that discourse exhibits recurrent patterns. Patterns in discourse may be produced consciously or unconsciously (Goffman, 1959); in the latter case, a speaker is not necessarily aware of what she is doing, and thus direct observation may produce more reliable generalizations than a self-report of her behavior. A basic goal of discourse analysis is to identify patterns in discourse that are demonstrably present, but that may not be immediately obvious to the casual observer or to the discourse participants themselves. Second, it is assumed that discourse involves speaker choices. These choices are not conditioned by purely linguistic considerations, but rather reflect cognitive (Chafe, 1994) and social (Sacks, 1984) factors. It follows from this assumption that discourse analysis can provide insight into non-linguistic, as well as linguistic, phenomena. To these two assumptions about discourse, CMDA adds a third assumption about online communication: computer-mediated discourse may be, but is not inevitably, shaped by the technological features of computer-mediated communication systems. It is a matter for empirical investigation in what ways, to what extent, and under what circumstances CMC technologies shape the communication that takes place through them (Herring, u.c.).

            The basic methodological orientation of CMDA is language-focused content analysis. This may be purely qualitative—observations of discourse phenomena in a sample of text may be made, illustrated, and discussed—or quantitative—phenomena may be coded and counted, and summaries of their relative frequencies produced. (It should be noted that quantitative CMDA comprises a qualitative component, e.g., in deciding what counts as an instance of a phenomenon to be coded and counted, especially when the phenomena of interest are semantic rather than syntactic (structural) in nature; see Bauer, 2000, and “analytical methods”, below). An example of the quantitative approach is Simeon Yates’ (1996) comparison of a corpus of asynchronous computer conferences with spoken and written English corpora with respect to range of vocabulary, modality, and personal pronoun use. An example of the qualitative approach is Lori Kendall’s (2002) ethnographic, participant-observer study of gendered behavior in a social MUD. An earlier ethnography of a social MUD carried out by Lynn Cherny (1999) applies both approaches, but to different phenomena: qualitative description of novel word creations (Ch.3) and quantitative analysis of turn-taking patterns (Ch.4). Alternatively, Herring (1996b) combines the two approaches: the same patterns of email message structure are identified by both qualitative and quantitative means.[5]

            As with other forms of content analysis, the CMDA researcher must meet certain basic requirements in order to conduct a successful (i.e., valid, coherent, convincing) analysis. She must pose a research question that is in principle answerable. She must select methods that address the research question, and apply them to a sufficient and appropriate corpus of data. If a “coding and counting” approach is taken, she must operationalize the phenomena to be coded, create coding categories, and establish their reliability, e.g., by getting multiple raters to agree on how they should be applied to a sample of the data. If statistical methods of analysis are to be used, appropriate statistical tests must be identified and applied. Finally, the findings must be interpreted responsibly and in relation to the original research question. These requirements have been discussed extensively in the literature on the conduct of empirical research (see, e.g., Alford, 1998 for research in sociology; Bauer, 2000 for content analysis methods in communication); a basic familiarity with them is assumed here. Of interest in the present chapter is how to apply this general research schema to the particular constellation of issues and challenges associated with the study of computer-mediated behavior.

As an illustration of the CMDA approach, the following sections consider a currently popular research theme—that of “virtual community”—and how CMDA can be applied to determine empirically whether a group of people interacting online constitutes a community. In keeping with the focus of this volume on learning, the two online environments chosen for illustration have professional development as their reason for existence and both are associated with educational contexts: secondary science and mathematics education in the first case, and tertiary linguistics education and research in the second. To address the volume’s focus on system design, the environments were selected to contrast in their technological affordances (one is a multimodal Web site, the other a text-based listserv); furthermore, one was intentionally designed with the goal of creating community, whereas the other was not. A comparison of these two environments can shed light on how the technological and social properties of CMC systems relate to the phenomenon of virtual community.

 

Analyzing “Virtual Community”

Since it was first articulated in print (Rheingold, 1993), the concept of “virtual community” has become increasingly fashionable in Internet research (e.g., Baym, 1995a; Cherny, 1999; Werry & Mowbray, 2001), although it has also been criticized (Fernback & Thompson, 1995; Jones, 1995a; see also Kling & Courtright, this volume). The criticisms include a pragmatic concern that the term has been overextended to the point of becoming meaningless—for some writers, it seems that any online group automatically becomes a “community”—and a philosophical skepticism that virtual community can exist at all, given the fluid membership, reduced social accountability, and lack of shared geographical space that characterize most groups on the Internet (e.g., McLaughlin et al., 1995). For the purposes of the present discussion, we assume that virtual community is possible, but that not all online groups constitute virtual communities. The task of the researcher then becomes to determine the properties of virtual communities, and to assess the extent to which they are (or are not) realized by specific online groups.

 

Two Learning Environments

Two online professional development environments will serve as examples to ground our discussion of how CMDA can be applied to investigate virtual community. Professional development environments are online learning environments in which people participate voluntarily and intermittently—i.e., for the purpose of acquiring information and skills to advance professionally—rather than in formal courses with students, instructors, and syllabi, as is the case for distance education. In successful cases, participation in such environments is continuous and self-sustaining, unlike course-based CMC which is task-focused and temporally bounded. An example of a genre of professional development environment that dates back to the early days of computer networking is listserv discussion groups for professionals in academic disciplines (e.g., Hert, 1997; Korenman & Wyatt, 1996). A more recent example is the growing genre of professional development Web sites that combine discussion forums with access to documents and other online resources (e.g., Renninger, this volume).

            The environments selected as illustrations for this chapter represent these two types. The first, the Linguist List, was founded in November 1990 by a husband and wife team of academic linguists as a means for disseminating information and engaging in public discussion about issues of interest to professional (and aspiring professional) linguists; it has been in continuous existence since 1990. Originally a text-only, by-subscription list that made archived messages available only to subscribers, in 1994 it established a Web site and posted the discussion archives there, making them widely publicly accessible.[6] For further description and analysis of the Linguist List, see Herring (1992, 1996b). The second environment, the Inquiry Learning Forum (ILF), was opened to registered members in March 2000. It was designed with National Science Foundation support by a team of faculty and graduate students in the School of Education at Indiana University, with the explicit goal of fostering online community among secondary math and science in-service and pre-service teachers interested in the inquiry learning approach (National Research Council, 2000). Members must go to the ILF Web site to post messages and access the other resources there (which include videos of teachers using inquiry methods in their classrooms); past messages remain on the site alongside current messages. For further description and analysis of the ILF, see Barab, MaKinster, & Scheckler (this volume) and Herring, Martinson & Scheckler (2002).

            These environments are plausible candidates for virtual community status in several respects. First, both bring together people who arguably already constitute real-world professional communities: academic linguists and secondary math and science educators. Second, their online participation is centered around a shared professional focus, as in Wenger’s (1998) “communities of practice.” Third, the Linguist List is active and long-lived, which some might take as prima facie evidence that it has achieved online community status. In contrast, the ILF has struggled to establish and maintain an active level of participation, but might be considered to have a prima facie claim to community status on the grounds that it was explicitly designed to support community (Barab, MaKinster, Moore, Cunningham, & The ILF Design Team, in press). For these reasons, it is germane to ask: To what extent does participation in these two environments in fact constitute “community” (as opposed to being simply “people interacting online”)?

            The following sections describe how a researcher making use of CMDA might go about addressing this question. Five conceptual skills involved in the research process are highlighted and discussed, first, with reference to CMDA in general, second, with reference to virtual communities, and last, with reference to the two professional development sites. The order of presentation of the five skills is roughly sequential (i.e., a researcher generally starts with the first, and progresses to the last), although the research process—in CMDA, no less than in other scientific disciplines—is frequently iterative, involving many feedback loops (Harwood et al., 2001). However, it is important to stress that what follows is not intended as an analysis in and of itself; to answer the question of what constitutes online community definitively would take us well beyond the scope of the present chapter.

 

Research Questions

To carry out an investigation by means of CMDA, it is first necessary to have a research question, a problem to which the analyst desires to find a solution. Typically, the research question is based on prior observation—the researcher may have noticed some online behavior or behaviors and may have formed a preliminary hypothesis concerning them. Articulating a research question is a first step towards testing the hypothesis.

            A good CMDA research question has four characteristics:

            1)  It is empirically answerable from the available data;

            2)  it is non-trivial;

3)  it is motivated by a hypothesis; and

            4)  it is open-ended.

Each of these characteristics is discussed below.

A CMDA research question should ideally ask about empirically-observable phenomena, or phenomena that can be operationalized empirically, as opposed to purely subjective or evaluative ones. A question about the nature and frequency of joking in an online forum, for example, can be addressed empirically more readily than a question about whether the participants are having fun. Further, the question should be answerable from the data selected for analysis. For example, if only computer-mediated data are to be examined, the question should not ask whether CMC is better or worse than face-to-face communication along some dimension of comparison, since the CMC data can not tell us anything directly about face-to-face communication. Equally important in CMDA, the question should be answerable on the basis of textual evidence. Text is direct evidence of behavior, but it can only be indirect evidence of what people know, feel, or think. If it is important that the researcher try to understand participants’ internal conscious or unconscious states, CMDA should be supplemented with other methods of analysis such as interviews or psychological experiments.

A good research question should be non-trivial; that is, the answer should be of some ostensible interest to at least a portion of the larger research community, and not already known in advance. Additionally, the research question should not be worded so as to presuppose an answer; that is, the answer should not appear to be a foregone conclusion.

At the same time, a research question motivated by a hypothesis—even if it is no more than an informal hunch—is more interesting and more interpretable than one that is not. Note that it is not necessary to posit a hypothesis that the researcher expects will be confirmed by the results of the analysis, although the hypothesis should be prima facie plausible. In some cases, a researcher may advance a popular hypothesis that she suspects is incorrect, in order to disprove it. For example, she might postulate that participant gender is invisible in CMC (a commonly held view in the early 1990s, based on the paucity of social status cues in text-only CMC), suspecting that such is not the case in her data.[7] The empirical results, if negative, are all the more illuminating for running counter to the prevailing wisdom.

Ideally, whether the researcher’s hypothesis is supported or not, the results of the study should contribute new knowledge. Phrasing the question as an open-ended question (what, why, when, where, who, how) leaves the door open to unexpected findings to a greater extent than closed (yes/no) questions, generally speaking. One caveat is that unexpected answers to yes/no questions can be informative, as noted above, when the hypothesis underlying the question is favored by popular opinion or common sense, but receives no empirical support. Similarly, positive support for an unobvious hypothesis can also cause us to understand the world in new ways. However, support for obvious hypotheses does not advance knowledge, nor does lack of support for unobvious hypotheses. In contrast, a systematic study will always reveal something new in response to a well-crafted “what”, “why”, or “how” question.

What kinds of questions about virtual community can be researched from a CMDA perspective? Although all are legitimate foci of intellectual curiosity, the researcher is setting herself up for difficulty if she asks questions such as: i) “Does virtual community exist?” ii) ”Is virtual community a good thing?” iii) “Does membership in virtual communities satisfy needs previously satisfied only in face-to-face communities?” or iv) “Do people interact regularly in groups online?” Note, first of all, that these are closed questions, to which the answer can only be “yes” or “no”. In addition, the first is effectively biased towards an affirmative answer, in that exhaustive evidence would be required in order to answer it negatively. The second question both presupposes the existence of virtual community (a problem if virtual community hasn’t already been empirically demonstrated) and asks a subjective, evaluative question about it; “goodness” is difficult to measure empirically. The third question involves a comparison; it can only be answered if empirical evidence (gathered by comparable means) is available from both “virtual communities” (presupposed to exist) and face-to-face communities. Finally, the fourth question, although neutrally worded and answerable, is trivial—the answer is obvious to anyone who has spent any time on the Internet.

The following, in contrast, are examples of open-ended questions that can usefully be addressed using CMDA: a) “What are the discourse characteristics of a virtual community?” (b) “What causes an online group to become a community?” c) “What causes a virtual community to die?” d) ”How do virtual communities differ from face-to-face communities?”[8] e) “What happens to face-to-face communities when they go online?” and f) “In what ways do communities constituted exclusively online differ from online communities that also meet face-to-face?” However, these questions are not all equally easy to answer; their answerability depends on the data available for investigation. Thus, for example, a)-d) and f) require an independent determination of virtual community, e.g., in terms of participants’ perceptions; b), c), and e) require longitudinal data; and d) and e) require face-to-face data (see discussion of “data” below).

In addition, particular data samples will generally exhibit characteristics that invite more specific questions to be asked about them. The question raised in the previous section—“[t]o what extent does participation in these two environments constitute ‘community’ (as opposed to being simply ‘people interacting online’)?”—is a straightforward application of question (a) to the Linguist List and the ILF data samples. But these samples, by their nature, also give rise to questions about virtual community and professional development (e.g., “What is the nature of virtual community in professional development environments, and how does it differ from virtual community in structured learning environments / unstructured social environments / etc.?”). Furthermore, the two environments contrast according to a number of technological and social dimensions, as summarized in Table 1.[9] Additional questions can be asked to focus on the contributing effects of a particular dimension to online behavior (e.g., “Is a multimodal environment more conducive to virtual community than a text-only environment?”; or “How does the self-presentation of the group ‘owners’ (e.g., as peers or as experts) affect the likelihood that a group will develop community characteristics?”).

 

Table 1. Dimensions of contrast between the Linguist List and the ILF

Linguist List

ILF

Text-only

Multimodal                                        (text + video + limited audio and graphics)

Messages come to subscriber                 (“push” technology)

Member must go to site to post messages (”pull” technology)

Archives stored separately

Past messages appear alongside current ones

Public (by subscription)

Semi-public (by registration; password required; limited membership)

Pre-existing face-to-face “community” (meets at annual professional meeting)

Loosely defined pre-existing “community” (most members have never met face-to-face)

Relatively homogeneous population of users (academic linguists at universities) with similar access opportunities

Heterogeneous population of users (pre-service teachers; in-service teachers; ILF researchers) with differential access

Founders’ goals were specific, limited in scope (i.e., information exchange & discussion)

Creators’ goals were broad, ambitious (i.e., create intentional community; foster inquiry learning)

Moderators present themselves as peers, “facilitators” (but exercise behind-the-scene control over postings)

ILF development team members have higher status (but post messages themselves, and do not control postings)

Discussion is on topics selected by participants

Discussion is often focused around artifacts (video clips; instructional technology; lesson plans, etc.)

 

The comparison of the two groups in Table 1 suggests too many possible questions about the variables that condition virtual community, in fact. Ideally, two data samples that are compared should differ according to only one dimension, such that if differences in behavior are found between the samples, they can plausibly be attributed to that dimension of variation. If, however, it turns out that either the Linguist List or the ILF exhibits more “community” behaviors than the other, to what should the difference be attributed: (multi)modality? ease of posting messages? ease of access to the group’s history? availability of face-to-face interaction? the intentions/behavior of the group’s founders? etc. Causal indeterminacy is a common problem in research that analyzes naturally occurring behavior.[10] The experimental research paradigm controls for this by holding all variables constant except for the variable that is hypothesized to condition the experimental result. For examples of experimental research that make use of CMDA methods, see Condon & Cech (1996a, 1996b, 2001).

 

Data Selection

In CMDA, as in other empirical social science approaches, a data sample must be selected that is appropriate to the study. By “appropriate” is meant that the sample should be of a nature and size to answer the research question(s); if the research question involves a comparison, more than one sample may be required. Each of these considerations is discussed below. For the purposes of this discussion, it is assumed that the data of interest are produced naturally (i.e., by online discourse participants for their own purposes), and logged or culled from online archives by the researcher, rather than elicited experimentally.

            It is often impossible to examine all the phenomena of relevance to a particular research question; this is especially true in CMDA, for which a vast amount of textual data is available in the form of online interactions. (Even in groups with relatively low participation, such as the ILF in its first year, the total amount of text quickly adds up to more than can easily be analyzed by a human coder using micro-linguistic methods.) For this reason, the researcher must usually select a sample from the totality of the available data. In CMDA, this is rarely done randomly, since random sampling sacrifices context, and context is important in interpreting discourse analysis results. Rather, data samples tend to be motivated (e.g., selected according to theme, time, phenomenon, individual or group), or samples of convenience (i.e., what the researcher happens to have access to at the time). Some advantages and disadvantages of these various sampling techniques are summarized in Table 2.

 

Table 2. CMDA data sampling techniques

 

Advantages

Disadvantages

Random                        (e.g., each message selected or not by a coin toss)

representativeness; generalizability

loss of context & coherence; requires complete data set to draw from

By theme                           (e.g., all messages in a particular thread)

topical coherence; a data set free of extraneous messages

excludes other activities that occur at the same time

By time                         (e.g., all messages in a particular day/week/month)

rich in context; necessary for longitudinal analysis

may truncate interactions, and/or result in very large samples

By phenomenon             (e.g., only instances of joking; conflict negotiation)

enables in-depth analysis of the phenomenon (useful when phenomenon is rare)

loss of context; no conclusions possible re: distribution

By individual or group     (all messages posted by an individual or members of a demographic group, e.g., women, students)

enables focus on individual or group (useful for comparing across individuals or groups)

loss of context (especially temporal sequence relations); no conclusions possible re: interaction

Convenience                  (whatever data are available to hand)

convenience

unsystematic; sample may not be best suited to the purposes of the study

 

Of the techniques in Table 2, temporal sampling preserves the richest context. If a long enough continuous time period is captured, the sample will most likely include coherent threads, thereby incorporating the advantages of thematic sampling as well. Analogously, a thematic sample is typically organized by time, enabling some longitudinal observations to be made. Because of their multiple advantages, these two sample types are favored in CMDA research. In addition, it is possible to break a sample of any type down by individual or group, thereby achieving additional focus while avoiding the disadvantages of individual or group sampling. (For example, an extended thread was isolated for analysis from the Linguist List, then broken down by gender of participants, in Herring, 1992, 1996b).

            The richest possible context is required for the purposes of analyzing virtual community, as are data that can show change over time, if questions about the inception, evolution, and demise of virtual communities are to be addressed. The sample should include, as much as is possible, the typical activities carried out on the site. These considerations suggest intermittent time-based sampling (e.g., several weeks at a time at intervals throughout a year) as particularly appropriate.[11] Ideally, in any analysis of virtual community, textual analysis would be supplemented by ongoing participant observation.[12]

            The ILF environment imposes some limitations on sampling, as well as suggesting alternative sampling possibilities. Discussions take place in different parts of the ILF site, making it difficult to capture a representative overall time-based sample; rather, samples must be collected from individual “rooms” and collated, if a single sample is required. Moreover, discussions in the “classroom” portion of the ILF site are organized around videos of teachers using inquiry methods in their classrooms, with one discussion forum attached to each video (Herring et al., 2002). This configuration suggests new categories of data sampling: by room, and by artifact (in this case, video). A sampling technique based on units of interaction determined by the site design (and/or by participants’ actual usage) has the advantage of allowing discourse patterns to emerge that are internally coherent to such units, whereas if data are combined across units, those patterns might be less apparent.

            How much data is required to conduct a successful CMDA study? There is no simple answer to this question. The data should be sufficient to address the research question, such that tests of statistical significance could meaningfully be conducted on the key findings (regardless of whether or not the researcher actually conducts such tests). What counts as a sufficient amount of data will depend, therefore, on the frequency of occurrence of the analytical phenomenon in the data sample, the number of coding categories employed to describe the phenomenon, and the number of external factors that are allowed to vary (e.g., modality; topic of discussion; participant gender). Two general rules of thumb are 1) the more infrequent the phenomenon in the data, the larger the sample should be, and 2) the more variables considered in the analysis, the larger the sample should be. This is so that 1) enough instances of the phenomenon are available to analyze, and 2) when the sample is broken down into sub-samples for purposes of comparison, there are still enough instances in each category to allow for statistical testing.[13] Since it is often difficult to know all of this in advance, a recommended practice is to start with a pilot study based on a small amount of data, and expand the sample size as necessary in a larger study, according to the tendencies revealed in the pilot study.

            A related issue concerns the number of samples required for purposes of comparative analysis. Above we noted that some CMDA research questions presuppose a comparison with face-to-face discourse. While it may be legitimate to draw a comparison with previous research on face-to-face communication in interpreting one’s results (see “interpretation” below), no key results should be founded on such a comparison, unless the researcher can assure that the face-to-face study was carried out using comparable methods (e.g., because it was conducted by the researcher herself, or because the same methods that were applied in the face-to-face study were applied to the computer-mediated data). Otherwise, a comparable face-to-face sample is normally required. What the researcher hopes to find are cases in which the same people are communicating about the same topics, for the same purposes, both face-to-face and via CMC. Unfortunately, this situation rarely occurs naturally. Left to their own devices, people tend to use different modalities for different communicative purposes; moreover, CMC enables certain behaviors that would be difficult or impossible offline,[14] and vice versa. Data collected in experimental settings are superior to naturally-occurring data for the purposes of comparing CMC with face-to-face (and traditional written) communication (see, e.g., Condon & Cech, 1996a, 1996b, 2001). However, since evidence of community is highly unlikely to surface in laboratory settings, given that experimental subjects typically have no past (or anticipated future) interaction (Walther, 1996), empirical comparison of face-to-face and online community is difficult. This may be one question for which interpretive, rather than strictly empirical, answers will have to suffice for the present time (cf. Etzioni, 1999).

            Multiple CMC samples (or sub-samples) may also be required in order to carry out a single study, depending on the research question. These are usually easier to collect, but care should be taken to hold constant as many dimensions of variation as possible, to maximize the interpretability of the results. Our two professional development samples in fact vary according to too many dimensions to enable straightforward comparison, as noted above. A better example of contrasting samples is Paolillo’s (in press) comparison of a(n asynchronous) Usenet newsgroup and a (synchronous) IRC channel frequented by the same participant demographic group (and to some extent, the same individuals): expatriate South Asians. When differences are found in language choice in the two samples, they can plausibly be attributed to differences in synchronicity between the two CMC modes.

Dividing a larger sample into sub-samples by demographic group, topic, or other category is another means to insure that the sub-samples share all but one feature. Applying this principle to research on virtual community, we might, for example, compare the behaviors of individuals within a single group who are known to interact face-to-face with other group members, with those individuals who do not, to test the hypothesis that face-to-face contact enhances involvement in online community (cf. Diani, 2000). Or we might consider participant behavior by role or status in relation to hypothesized community behaviors. In the case of the Linguist List, the behavior of professors might be compared with that of students, or U.S. linguists with non-U.S. linguists; in the ILF, pre-service teachers might be compared with in-service teachers, and teachers with researchers, to determine if higher status groups are more invested in the “community” than lower status groups.[15]

 

Operationalization of Key Concepts

The coding and counting approach to CMDA research described in this chapter requires that key concepts be operationalizable (and operationalized) in empirically measurable terms. This entails defining the concepts unambiguously, such that another researcher, examining the same data, could in principle reproduce the identification of a given token as an exemplar of the concept.[16] Equally or more important, it is necessary to define a concept in concrete, textual terms in order to be able to code it consistently. In the case of highly abstract concepts, this necessarily entails a reduction (and a risk of distortion) of the concept; content analysis is sometimes criticized on these grounds (cf. Bauer, 2000). At the same time, it is the requirement of operationalization, more than any other single requirement, that lends CMDA its rigor and makes it a useful tool for getting an empirical grasp on otherwise slippery or intractable concepts.

            Concepts vary in the degree to which they are inherently operationalizable. This can be represented as a continuum, as in Figure 1. In a previous section, it was suggested that a researcher should avoid asking questions about concepts that are too far towards the subjective, abstract end of the continuum. In fact, such questions are often the most interesting to ask, but in order to address them quantitatively using CMDA, they must be defined in terms of textual phenomena that can be directly observed, coded, and counted. Thus, for example, concepts of widespread interest in CMC research such as affect, democracy, depth (of discussion), empowerment learning, trust, etc. can be operationalized by identifying discourse behaviors (plausibly) characteristic of each phenomenon and then articulating interpretive links between those behaviors and the larger concepts. (We will see how this might be done for the concept of virtual community below.) Alternatively, it might be necessary to supplement CMDA with other methods in order to make a meaningful demonstration that the evidence addresses the concept. For example, it is unlikely that CMC evidence alone could make a definitive case for changes in offline states of affairs; such a demonstration would normally require offline evidence, observational or self-reported.

 

Figure 1.  Continuum of operationalizability

 

More operationalizable                                                                       Less operationalizable

<------------------------------------------------------------------------------------------------------>

external, directly observable behavior                                           internal, subjective states

concrete, bounded, measurable                                        abstract, ambiguous, generalized

directly related to coding categories                    not obviously related to coding categories

 

            “Community” is an inherently abstract concept. It also has a subjective component, especially when it is applied to online contexts, where it is always, in some sense, a metaphorical extension of the literal meaning of community as “grounded in a shared physical space” (cf. Jones, 1995a). Accordingly, definitions of community (and virtual community) abound, although Wellman’s (2001) tripartite characterization of community as providing “sociability, support, and identity” constitutes a useful point of departure. More specifically, six sets of criteria can be identified from the literature on virtual community (e.g., Haythornthwaite et al., 2000; Jones, 1995a, 1995b; Reid, 1991, 1994, 1998; Riel, this volume):

            1)  active, self-sustaining participation; a core of regular participants

            2)  shared history, purpose, culture, norms and values

            3)  solidarity, support, reciprocity

            4)  criticism, conflict, means of conflict resolution

            5)  self-awareness of group as an entity distinct from other groups

6)  emergence of roles, hierarchy, governance, rituals

Criteria 1) and 4) relate to “sociability”; criteria 3) and 6) (loosely) to “support”, and criteria 2) and 5) to “identity.”[17]

            These six criteria suggest concrete ways in which the notion of “virtual community” might be broken down into component behaviors that can be objectively assessed.

1) Participation can be measured over time, and core participants identified on the basis of frequency of posting and rate of response received to messages posted (Herring, in press b), or via text-based social network analysis (Paolillo, 2001; cf. Koku & Wellman, this volume).

2) Shared history can be assessed through the availability and use of archives (Millen, 2000). Culture is indexed through the use of group-specific abbreviations, jargon, and language routines (Baym, 1995a; Cherny, 1999; Jacobs-Huey, in press; Kendall, 1996), as well as through choice of language, register, and dialect (Georgakopoulou, in press; Paolillo, 1996). Norms and values are revealed through an examination of netiquette statements (Herring, 1996a), FAQs (Voth, 1999) and verbal reactions to violations of appropriate conduct (McLaughlin et al., 1995; Weber, in press).

3) Solidarity can be measured through the use of verbal humor (Baym, 1995b); support through speech act analysis focusing, e.g., on acts of positive politeness (Herring, 1994); and reciprocity through analysis of turn initiation and response (Rafaeli & Sudweeks, 1997).

4) Criticism and conflict can be analyzed through speech acts violating positive politeness (Herring, 1994). Conflict resolution might usefully be considered as an interactive sequence of acts (cf. Condon & Cech, 1996b on decision-making sequences); it also lends itself to ethnographic analysis (e.g., Cherny, 1999).

5) A group’s self-awareness can be manifested in its members’ references to the group as a group, and in ‘us vs. them” language, particularly in statements to the effect, “We do things this way here” (implying an awareness that they might be done differently elsewhere; Weber, in press). (See also “norms” above.)

6) Evidence of roles and hierarchy can be adduced through participation patterns (see “participation” above) and speech act analysis (e.g., Herring & Nix, 1997, which considers the different acts performed by group leaders and non-leaders). The study of governance and ritual would appear to require an ethnographic approach in which a group’s practices are observed over time and described in terms of their meanings to participants (Cherny, 1999; Jacobson, 1996; Kolko & Reid, 1998). Note, however, that the reification of cultural practices in the form of governance and ritual appears to represent a relatively advanced stage of community (see, e.g., Dibbell’s 1993 account of how this happened in LambdaMOO); thus it probably should not be taken as part of the basic definition of virtual community.

Some of the above features are more useful than others as potential indicators of virtual community on the Linguist List and the ILF. Certain features occur rarely or not at all in either group: language routines, code switching, humor, and governance and ritual. Their relative absence is due to a variety of circumstances, for example the professional (serious) focus of the groups, and the fact that their members are proficient in written English.[18] Other features occur only or nearly exclusively on the Linguist List, e.g., criticism, conflict, and netiquette statements.[19] Conversely, such features as participation patterns, reciprocity, indicators of group self-awareness, and evidence of roles and hierarchy are evident in both and might usefully be assessed as community indicators for these environments.

 

Analytical Methods

Analytical methods in CMDA are drawn from discourse analysis and other language-related paradigms, adapted to address the properties of computer-mediated communication. In principle, nearly any language-related method could be so adapted; in practice, this chapter focuses on methods of linguistic discourse analysis, these being the methods with which the author is most familiar. These include approaches traditionally used to analyze written text and spoken conversation, approaches to discourse as social interaction, and critical (socio-political) approaches.

            Given that we have already identified content analysis as the basic methodological apparatus of CMDA, the question might arise as to what the more specific linguistic approaches add to the research endeavor. In fact, it is possible to conduct a perfectly responsible CMDA analysis without drawing on any more specific paradigm than language-focused content analysis. For example, one could let the phenomenon of interest emerge out of a sample of computer-mediated data and devise coding categories on the basis of the observed phenomenon, as in the grounded theory approach (Glaser & Strauss, 1967). This approach is especially well suited to analyzing new and as yet relatively undescribed forms of CMC, in that it allows the researcher to remain open to the possibility of discovering novel phenomena, rather than making the assumption in advance that certain categories of phenomena will be found.

However, grounded theory is less useful for evaluating specific research hypotheses, or for making systematic comparisons across data samples. For these purposes, the CMDA researcher can profit from the structure, experience, and understandings available through specific discourse analysis paradigms. Such paradigms define issues of theoretical interest, a set of discourse phenomena about which much may already be known in other modalities and contexts, and discovery procedures for revealing the patterns and constraints that characterize the phenomena. Table 3 summarizes this information for five discourse analysis paradigms commonly invoked in CMDA research.

 

Table 3. Five discourse analysis paradigms

 

Issues

Phenomena

Procedures

Text Analysis

(cf. Longacre, 1996)

classification, description, “texture” of texts

genres, schematic organization, reference, salience, cohesion, etc.

identification of structural regularities within and across texts

Conversation Analysis

(cf. Psathas, 1995)

interaction as a jointly negotiated accomplishment

turn-taking, sequences, topic development, etc.

close analysis of the mechanics of interaction; unit is the turn

Pragmatics

(cf. Levinson, 1983)

language as an activity—“doing things” with words

speech acts, relevance, politeness, etc.

interpretation of speakers’ intentions from discourse evidence

Interactional Sociolinguistics

(cf. Gumperz, 1982; Tannen, 1993)

role of culture in shaping and interpreting interaction

verbal genres, discourse styles, (mis)communication, framing, etc.

analysis of the socio-cultural meanings indexed through interaction

Critical Discourse Analysis

(cf. Fairclough, 1992)