Medical Outcomes Study Short Form 36 (SF-36)

Evidence Reviewed as of before: 19-08-2008

Author(s)*: Lisa Zeltzer, MSc OT

Editor(s): Nicol Korner-Bitensky, PhD OT; Elissa Sitcoff, BA BSc; Maxim Ben Yakov, BSc PT

Purpose

In-Depth Review

Purpose of the measure

The Medical Outcomes Study 36-item Short-Form Health Survey is a widely used, generic, patient-report measure created to assess health-related quality of life (HRQOL) in the general population. It was developed as part of the Medical Outcomes Study (a two-year study of patients with chronic conditions) (Ware & Sherbourne, 1992). Today, the SF-36 is the most commonly used generic instrument for measuring quality of life (de Haan, 2002). The SF-36 can be used, but is not limited to, persons with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain..

Available versions

The SF-36 was published in 1992 by Ware and Sherbourne, and further developed and validated in 1993 and 1994 respectively (Ware & Sherbourne, 1992; McHorney, Ware & Raczek, 1993; McHorney, Ware, Lu & Sherbourne, 1994). In 1996, Version 2.0 of the SF-36 (SF-36v2) was introduced, to correct for deficiencies identified in the original version. Changes include a few wording alterations, for example, “downhearted and blue” in a question on mental health symptoms is now “downhearted and depressed”. SF-36v2 is now considered “the international version” of the SF-36 (Andresen & Meyers, 2000). The original SF-36 questions had variable numbers and formats for response categories, and these have been increased and/or standardized among scales and questions. Role Functioning items now have five levels of responses rather than two. This may increase the responsivenessThe ability of an instrument to detect clinically important change over time.
of the scales. Early reports of tests of this new version have been positive (Jenkinson, Stewart-Brown, Petersen & Paice, 1999). Versions 1.0 and 2.0 of the SF-36 are available with two recall periods: the standard 4-week recall, and the acute 1-week recall period.

Features of the measure

Items:

Items of the SF-36 are divided into eight different domains:

Physical component:

Physical functioning (10 items)
Role limitations due to physical problems (4 items)
Bodily pain (2 items)
General health perceptions (5 items)

Mental component

Social functioning (2 items)
General mental health (5 items)
Role limitations due to emotional problems (3 items)
Vitality (4 items)

Other

Health transition (1 question): The respondent is asked to rate their current health status compared to their health status one year ago. This question remain separate from the 8 subscales and is not scored.

There are 11 questions in the SF-36, with 36 items in total. With the exception of the general change in health status questions, subjects are asked to respond with reference to the past 4 weeks. An acute version of the SF-36 refers to problems in the past week only (McDowell & Newell, 1996).

Scoring:

The SF-36 does not lend itself to the generation of an overall summary score. This is because information within the individual responses is lost in the total scale score (since the total score can be achieved in a variety of ways from individual item responses) (Dorman et al., 1999). The recommended scoring system for the SF-36 is a weighted Likert system for each item. Items within subscales are totaled to provide a summed score for each subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
or dimension. Each of the 8 summed scores is linearly transformed onto a scale from 0 (negative health) to 100 (positive health) to provide a score for each subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
. A physical component score (PCS) and mental component score (MCS) can be derived from the scale items. However, these summary scores should be interpreted with caution. Hobart et al. (2002) examined the use of this two-dimensional model and found that these two scales accounted for only 60% of the variance in SF-36 scores. This finding suggests that there is a significant loss of information when this two-dimensional model is used.

Subscales:

The SF-36 has 8 subscales

Physical Functioning,
Role Limitations due to Physical Problems,
General Health Perceptions,
Vitality,
Social Functioning,
Role Limitations due to Emotional Problems,
General Mental Health,
Health Transition.

Equipment:

Only the test and a pencil are required. Computer administered and telephone voice recognition interactive systems of administration of the SF-36 are currently being evaluated (SF-36 Health Survey Update: John E. Ware, Jr.).

Training:

No training is required for administration of the SF-36. The SF-36 is suitable for self-administration, computerized administration, or administration by a trained interviewer in person or by telephone, to persons age 14 and older (Ware & Sherbourne, 1992).

Time:

The SF-36 is considered simple to administer and takes an average of 10 minutes to complete (Andreson & Meyers, 2000). The SF-36 has been studied for use by a proxy, however, administration by proxy is not recommended for patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain., as agreement has been found to be poor in this patient population (Segal & Schall, 1994; Dorman, Slattery, Farrell, & Dennis, 1998). Instead, a stroke-specific quality of life measure such as the StrokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. Impact Scale, which has been evaluated successfully for use by proxy respondents, may be more a more appropriate measure to be administered by proxy. Another reliable measure of health status for strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. patients by proxy is the Health Utilities Index (HUI) which has been reported to have adequate to excellent agreement in between patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. and their proxies (Mathias, Bates, Pasta, Cisternas, Feeny & Patrick, 1997).

The SF-36 can also be completed as a mail survey. As a self-completed, mailed questionnaire, it has been shown to have reasonably high response rates (83% – Brazier et al., 1992, O’Mahoney, Rodgers, Thomson, Dobson, & James, 1998; 75% – 83% Dorman et al., 1998; 85% – Dorman et al., 1999; 82% overall and 69% for those over age 85 – Walters et al., 2001). However, data is typically more complete when interviewer administration is used. However, low completion rates may not be limited to self-completion or postal administration. Andresen et al. (1999) administered the SF-36 to nursing home residents by face-to-face interview and reported that only 1 in 5 residents were able to complete it. It is possible that data completeness is indicative of respondent acceptance and understanding of the survey as relevant to them (O’Mahoney et al., 1998; Andresen et al., 1999). Hayes et al. (1995) identified that the most common items missing on the self-completed questionnaire referred to work or to vigorous activity. Older respondents recognized these questions as relevant to much younger people and not pertinent to their own situation. The authors suggested modifications to some of the questions, which may increase acceptability to older populations.

Alternative forms of the SF-36

SF-12 (Ware, Kosinski, & Keller, 1996)

The SF-12 was developed as an abbreviated version of the SF-36 for use in large surveys of general and specific populations as well as large longitudinal studies of health outcomes. It can be self-administered, or administered via interview, telephone, or computer. The SF-12 takes 5 minutes or less to complete (Nemeth, 2006). The SF-12v2 was later developed to correspond to the SF-36v2 and has demonstrated the same improvements as observed with the SF-36v2 (Ware, Kosinski, Turner-Bowker & Gandek, 2002). Versions 1.0 and 2.0 of the SF-12 are available with two recall periods: the standard 4-week recall, and the acute 1-week recall period.

SF-8 (QualityMetric, Incorporated)

The SF-8, a new generic eight-item assessment, generates a health profile consisting of eight scales and two summary measures describing HRQOL. The SF-8 uses one question to measure each of the eight SF-36 domains. The development, validation and norming of the new SF-8, including standard (4-week recall), acute (1-week recall), and 24-hour recall versions is documented in the SF-8 manual, “How to Score and Interpret Single-Item Health Status Measures: A Manual for Users of the SF-8 Health Survey” (Ware, Kosinski, Dewey & Gandek, 2001). The SF-8 Health Survey can be self-administered, computer-administered, or given by a trained interviewer in person or by telephone to persons aged 14 and older. It takes approximately 1-2 minutes to complete and it has been translated and validated for use in more than 30 countries (for a list of these countries, click on this list) (accessed July 12, 2006).

SF-6D (Brazier, Usherwood, Harper, & Thomas, 1998; Brazier, Roberts, & Deverill, 2002)

The SF-6D is a preference-based scoring system that uses six subscales from the SF-36, to allow for calculations of utilities from SF-36 and SF-36v2 responses. The eight dimensions from SF-36 were reduced to six by omitting General Health Perceptions and combining Role Limitations-Physical and Role Limitatons-Emotional. Good reliabilityReliability can be defined in a variety of ways. It is generally understood to be the extent to which a measure is stable or consistent and produces similar results when administered repeatedly. A more technical definition of reliability is that it is the proportion of "true" variation in scores derived from a particular measure. The total variation in any given score may be thought of as consisting of true variation (the variation of interest) and error variation (which includes random error as well as systematic error). True variation is that variation which actually reflects differences in the construct under study, e.g., the actual severity of neurological impairment. Random error refers to "noise" in the scores due to chance factors, e.g., a loud noise distracts a patient thus affecting his performance, which, in turn, affects the score. Systematic error refers to bias that influences scores in a specific direction in a fairly consistent way, e.g., one neurologist in a group tends to rate all patients as being more disabled than do other neurologists in the group. There are many variations on the measurement of reliability including alternate-forms, internal consistency , inter-rater agreement , intra-rater agreement , and test-retest .
and validityThe degree to which an assessment measures what it is supposed to measure.
have been reported for the SF-6D (Petrou & Hockley, 2005; Brazier, Roberts, Tsuchiya & Busschbach, 2004).

For a fee, all versions of the SF Health Survey can be scored online via Quality Metric’s website (accessed July 12, 2006).

Client suitability

Can be used with:

Individuals with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain..

The SF-36 is the most widely used measure to assess HRQOL in patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain., however, its suitability in this patient population has been contentious:

Hobart, Williams, Moran, and Thompson (2002) reported that of their sample of 177 post-stroke patients, five of the eight SF-36 subscales were found to have limited validityThe degree to which an assessment measures what it is supposed to measure.
as outcome measures, and that the reporting of physical and mental summary scores were not supported. The authors questioned the use of the SF-36 in patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain..
de Haan (2002) reported that when the results of the relatively small study of Hobart et al. (2002) were taken in conjunction with the findings of previous research, there was insufficient evidence to question the reliabilityReliability can be defined in a variety of ways. It is generally understood to be the extent to which a measure is stable or consistent and produces similar results when administered repeatedly. A more technical definition of reliability is that it is the proportion of "true" variation in scores derived from a particular measure. The total variation in any given score may be thought of as consisting of true variation (the variation of interest) and error variation (which includes random error as well as systematic error). True variation is that variation which actually reflects differences in the construct under study, e.g., the actual severity of neurological impairment. Random error refers to "noise" in the scores due to chance factors, e.g., a loud noise distracts a patient thus affecting his performance, which, in turn, affects the score. Systematic error refers to bias that influences scores in a specific direction in a fairly consistent way, e.g., one neurologist in a group tends to rate all patients as being more disabled than do other neurologists in the group. There are many variations on the measurement of reliability including alternate-forms, internal consistency , inter-rater agreement , intra-rater agreement , and test-retest .
and validityThe degree to which an assessment measures what it is supposed to measure.
of the SF-36 subscales in strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain..

Should not be used in:

Patients who cannot understand written or spoken language. Make sure the patient is fluent in the language used in the survey.
More severely affected strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. survivors who need a proxy to complete (Dorman et al., 1998). Instead, a stroke-specific quality of life measure such as the StrokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. Impact Scale, which has been evaluated successfully for use by proxy respondents, may be more a more appropriate measure to be administered by proxy. Another more reliable measure of health status for strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. patients by proxy is the Health Utilities Index (HUI) which has been reported to have moderate to high agreement in interrater reliabilityReliability can be defined in a variety of ways. It is generally understood to be the extent to which a measure is stable or consistent and produces similar results when administered repeatedly. A more technical definition of reliability is that it is the proportion of "true" variation in scores derived from a particular measure. The total variation in any given score may be thought of as consisting of true variation (the variation of interest) and error variation (which includes random error as well as systematic error). True variation is that variation which actually reflects differences in the construct under study, e.g., the actual severity of neurological impairment. Random error refers to "noise" in the scores due to chance factors, e.g., a loud noise distracts a patient thus affecting his performance, which, in turn, affects the score. Systematic error refers to bias that influences scores in a specific direction in a fairly consistent way, e.g., one neurologist in a group tends to rate all patients as being more disabled than do other neurologists in the group. There are many variations on the measurement of reliability including alternate-forms, internal consistency , inter-rater agreement , intra-rater agreement , and test-retest .
between strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. patients and proxies (Mathias et al., 1997).
Patients with aphasiaAphasia is an acquired disorder caused by an injury to the brain and affects a person's ability to communicate. It is most often the result of stroke or head injury.
An individual with aphasia may experience difficulty expressing themselves when speaking, difficulty understanding the speech of others, and difficulty reading and writing. Sadly, aphasia can mask a person's intelligence and ability to communicate feelings, thoughts and emotions. (The Aphasia Institute, Canada). For patients with aphasiaAphasia is an acquired disorder caused by an injury to the brain and affects a person's ability to communicate. It is most often the result of stroke or head injury.
An individual with aphasia may experience difficulty expressing themselves when speaking, difficulty understanding the speech of others, and difficulty reading and writing. Sadly, aphasia can mask a person's intelligence and ability to communicate feelings, thoughts and emotions. (The Aphasia Institute, Canada), a stroke-specific quality of life measure developed specifically for patients with aphasiaAphasia is an acquired disorder caused by an injury to the brain and affects a person's ability to communicate. It is most often the result of stroke or head injury.
An individual with aphasia may experience difficulty expressing themselves when speaking, difficulty understanding the speech of others, and difficulty reading and writing. Sadly, aphasia can mask a person's intelligence and ability to communicate feelings, thoughts and emotions. (The Aphasia Institute, Canada), such as the StrokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. and AphasiaAphasia is an acquired disorder caused by an injury to the brain and affects a person's ability to communicate. It is most often the result of stroke or head injury.
An individual with aphasia may experience difficulty expressing themselves when speaking, difficulty understanding the speech of others, and difficulty reading and writing. Sadly, aphasia can mask a person's intelligence and ability to communicate feelings, thoughts and emotions. (The Aphasia Institute, Canada) Quality Of Life Scale (SAQOL-39), should be used (Hilari, Byng, Lamping, & Smith, 2003).
The SF-36 should not be used to document individual patient change. Dorman, Slattery, Farrell, Dennis, and Sandercock (1998) found that although the SF-36 can function effectively as a discriminatory measure for assessing health-related quality-of-life outcomes in groups of patients after strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain., the SF-36 may not be adequate for serial assessments of individual patients, unless large differences over time are expected. Thus, the SF-36 should be used for large group comparisons only.

In what languages is the measure available?

The SF-36 is available in a number of languages. In 1991, the International Quality of Life Assessment launched a project aimed at translating, validating and norming the SF-36 health survey. The project, which is based at the Health Assessment Lab in Boston, has sponsored investigators from 14 countries: Australia, Belgium, Canada, Denmark, France, Germany, Italy, Japan, The Netherlands, Norway, Spain, Sweden, the United Kingdom (English version), and the United States (English and Spanish versions). In addition, the SF-36 has been translated for use in more than 40 other countries, including: Argentina, Armenia, Austria, Bangladesh, Brazil, Bulgaria, Cambodia, Chile, China, Colombia, Costa Rica, Croatia, Czech Republic, Finland, Greece, Guatemala, Honduras, Hong Kong, Hungary, Iceland, Israel, Korea, Latvia, Lithuania, Mexico, New Zealand, Peru, Poland, Portugal, Romania, Russia, Singapore, Slovak Republic, South Africa, Switzerland, Taiwan, Tanzania, Turkey, the United Kingdom (Welsh), the United States (Chinese, Japanese, Vietnamese), Uruguay, Venezuela, and Yugoslavia. There are more than 500 publications that use translations or English-language adaptations of the SF-36. For information about the availability of SF-36 translations, visit https://www.qualitymetric.com/health-surveys-old/the-sf-36v2-health-survey/.

Summary

What does the tool measure?	Health related quality of life
What types of clients can the tool be used for?	The SF-36 is a generic measure that can be used, but is not limited to, persons with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain..
Is this a screeningTesting for disease in people without symptoms. or assessment tool?	Assessment
Time to administer	The SF-36 is considered simple to administer and takes an average of 10 minutes to complete.
Versions	SF-12; SF-8, SF-6D
Other Languages	The SF-36 is available in a number of languages. There are more than 500 publications that use translations or English-language adaptations of the SF-36. For information about the availability of SF-36 translations, visit www.sf-36.org
Measurement Properties
ReliabilityReliability can be defined in a variety of ways. It is generally understood to be the extent to which a measure is stable or consistent and produces similar results when administered repeatedly. A more technical definition of reliability is that it is the proportion of "true" variation in scores derived from a particular measure. The total variation in any given score may be thought of as consisting of true variation (the variation of interest) and error variation (which includes random error as well as systematic error). True variation is that variation which actually reflects differences in the construct under study, e.g., the actual severity of neurological impairment. Random error refers to "noise" in the scores due to chance factors, e.g., a loud noise distracts a patient thus affecting his performance, which, in turn, affects the score. Systematic error refers to bias that influences scores in a specific direction in a fairly consistent way, e.g., one neurologist in a group tends to rate all patients as being more disabled than do other neurologists in the group. There are many variations on the measurement of reliability including alternate-forms, internal consistency , inter-rater agreement , intra-rater agreement , and test-retest .	Internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency.: Out of 10 studies examining the internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. of the SF-36, five reported excellent internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. (except for the subscales of Social Functioning in three studies and General Health in one study, which were considered adequate). Two studies reported adequate to excellent internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency.. Three studies reported poor to excellent internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency.. Test-retest: Out of the five studies examining test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society). of the SF-36, three reported adequate to excellent test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society). . One study reported adequate test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society). . One reported poor to excellent test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society). . Inter-rater: No studies have examined the inter-rater reliabilityA method of measuring reliability . Inter-rater reliability determines the extent to which two or more raters obtain the same result when using the same instrument to measure a concept. of the SF-36.
ValidityThe degree to which an assessment measures what it is supposed to measure.	Criterion: Predictive: Subscales of the SF-36 have been found to be predictive of death, hospitalizations, physician visits, and the burden of depressionIllness involving the body, mood, and thoughts, that affects the way a person eats and sleeps, the way one feels about oneself, and the way one thinks about things. A depressive disorder is not the same as a passing blue mood or a sign of personal weakness or a condition that can be wished away. People with a depressive disease cannot merely "pull themselves together" and get better. Without treatment, symptoms can last for weeks, months, or years. Appropriate treatment, however, can help most people with depression. among depressed elderly persons. Construct: Convergent: Adequate correlations between the SF-36 Physical Health subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society). and the ActivitiesAs defined by the International Classification of Functioning, Disability and Health, activity is the performance of a task or action by an individual. Activity limitations are difficulties in performance of activities. These are also referred to as function. of Daily Living Index; the SF-36 Social Functioning subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society). and social isolation on the Nottingham Health Profile; the General Health subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society). and the EuroQol overall HRQOL rating; the SF-36 Bodily Pain subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society). and all EuroQol domains; and the Role Functioning-Emotional subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society). with the EuroQol psychological domain. Excellent correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation. between the Physical Health scores from the SF-36 and the Geriatric DepressionIllness involving the body, mood, and thoughts, that affects the way a person eats and sleeps, the way one feels about oneself, and the way one thinks about things. A depressive disorder is not the same as a passing blue mood or a sign of personal weakness or a condition that can be wished away. People with a depressive disease cannot merely "pull themselves together" and get better. Without treatment, symptoms can last for weeks, months, or years. Appropriate treatment, however, can help most people with depression. Scale; the Vitality subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society). on the SF-36 and energy subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society). on the Nottingham Health Profile; and the Bodily Pain subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society). on the SF-36 with the EuroQol pain domain. Known groups: SF-36 scores discriminated between patients diagnosed with one or more chronic physical problems and healthy age-matched controls; individuals older than 75 and younger than 75; groups based on setting (general practice versus hospital outpatients); migraine sufferers and controls; groups based on recent visits to their family doctor, hospital inpatient stays and longstanding illness; patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. and their age and gender matched controls.
Floor/Ceiling Effects	Of the 8 studies examined, 6 reported that the SF-36 had significant floor and ceiling effects, 1 reported significant ceiling effects only, and 1 reported significant floor effects only.
Does the tool detect change in patients?	Out of 3 studies examined, 1 reported that the SF-36 had a large ability to detect change, 1 reported moderate to large ability to detect change, (except for the Social Functioning and Mental Health dimensions which both had small effect sizes); 1 reported small (Role Limitations-Emotional, Mental component summary score) to large (Bodily Pain, Physical component summary score) ability to detect change. To our knowledge, no studies have examined the ability of the SF-36 to detect change in patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain..
Acceptability	The SF-36 cannot be used with patients who cannot understand written or spoken language, severely affected patients who need a proxy to complete, or patients with aphasiaAphasia is an acquired disorder caused by an injury to the brain and affects a person's ability to communicate. It is most often the result of stroke or head injury. An individual with aphasia may experience difficulty expressing themselves when speaking, difficulty understanding the speech of others, and difficulty reading and writing. Sadly, aphasia can mask a person's intelligence and ability to communicate feelings, thoughts and emotions. (The Aphasia Institute, Canada). To our knowledge, no studies have examined the ability of the SF-36 to detect change in patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain..
Feasibility	The SF-36 is simple to administer and requires no training or special equipment. It is suitable for self-administration, computerized administration, or administration by a trained interviewer in person or by telephone, to persons age 14 and older.
How to obtain the tool?	All versions of the SF-36 can be viewed by visiting the website: www.qualitymetric.com

Psychometric Properties

Overview

Extensive psychometric testing has been conducted on the SF-36. However, little research has been conducted specifically in a post-stroke population. For the purposes of this review, we conducted a literature search to identify all relevant publications on the psychometric properties of the SF-36. We then selected to review articles from high impact journals, and from a variety of authors. The creators of the SF-36 have performed many of the psychometric studies that exist on the survey, however, we preferentially reviewed studies carried out by other authors who were not implicated in the development of the SF-36.

Floor and Ceiling Effects

Lai, Perera, Duncan, and Bode (2003) administered the StrokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. Impact Scale and the SF-36 to 278 strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. subjects approximately 90 days after strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. In comparison to the StrokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. Impact Scale-16 (characterizes physical functioning), the SF-36 Physical Functioning subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
had major floor effects (floor effects of 37% and 100% were observed for patients with a modified Rankin scale grade 4 or 5, respectively). Further, in contrast to the StrokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. Impact Scale-Participation (characterizes social functioning), the SF-36 Social Functioning subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
had major ceiling effects (ceiling effects up to 60% for modified Rankin scale grade 0).

Anderson et al. (1996) examined the SF-36 in a cohort of 90 long-term (1-year) stroke survivors. The validityThe degree to which an assessment measures what it is supposed to measure.
of the SF-36 was assessed by comparing patients’ scores on the SF-36 with those obtained for the Barthel Index, the 28-item General Health Questionnaire, and the Adelaide ActivitiesAs defined by the International Classification of Functioning, Disability and Health, activity is the performance of a task or action by an individual. Activity limitations are difficulties in performance of activities. These are also referred to as function.
Profile. Large ceiling effects were reported for the SF-36 Role Limitations-Physical (53%), Bodily Pain (43%), Social Functioning (67%) and Role Limitations-Emotional (72%) subscales. No floor effects exceeding 7% were reported for the SF-36, and scores for the SF-36 Physical Functioning subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
were more uniformly distributed than Barthel Index scores suggesting the SF-36 has lower floor and ceiling effects than the Barthel Index.

Brazier et al. (1996) tested the psychometric properties of the SF-36 and the EuroQol on an elderly female population (n=380) aged 75 and older, and compared these scales to the Office of Population Census and Surveys Disability Survey. Patients were administered the scales at baseline and again six months later. Major floor effects (in excess of 25%) were reported for the Role Limitations-Physical and Role Limitations-Emotional subscales.

Hobart et al. (2002) examined SF-36 data from 177 people after strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. Notable floor effects were observed for the Role Limitations-Physical (59.1%), Role Limitations-Emotional (63.1%), Social Functioning (29.9%), and Bodily Pain (25.6%) subscales. Notable ceiling effects were also observed for the Role Limitations-Emotional (63.1%), Social Functioning (29.9%) and Bodily Pain (25.6%) subscales.

O’Mahoney et al. (1998) examined the suitability of the SF-36 for assessing quality of life in older patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. Floor effects were high for the Role Limitations-Physical (54%) and Role Limitations-Emotional (35%) subscales and for the Social Functioning (17%) and Physical Functioning (18%) subscales. Ceiling effects were also substantial for the Role Limitations-Physical (16%), Role Limitations-Emotional (51%), Social Functioning (18%) and Bodily Pain (25%) subscales.

Weinberger, Oddone, Samsa and Landsm (1996) administered the SF-36 three times over a 4-week period to 172 veterans receiving care in a General Medicine Clinic. Telephone, face-to-face, and self-administration modes of administering the SF-36 were compared. For face-to-face administration of the SF-36, notable floor effects were observed for the Role Limitations-Physical (43.8%) and Role Limitations-Emotional (30.3%) subscales. Notable ceiling effects were observed for the Social Functioning (31.5%), Role Limitations-Physical (14.6%), and Role Limitations-Emotional (47.2%) subscales. For telephone administration, significant floor effects were observed for the Role Limitations-Physical (53.2%) and Role Limitations-Emotional (34.0%) subscales. Significant ceiling effects were observed for the Role Limitations-Emotional (36.2%) subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
only. Self-administration of the SF-36 resulted in significant floor effects for the Role Limitations-Physical (47.1%), and Role Limitations-Emotional (25.0%) subscales. Further, notable ceiling effects were observed for the Social Functioning (27.8%), Role Limitations-Physical (14.7%), and Role Limitations-Emotional (52.8%) subscales.

Walters, Munro and Brazier (2001) administered the SF-36 to a community-dwelling population over the age of 65. Substantial floor (30.9-61%) and ceiling effects across all age groupings (65-69, 70-74, 75-79, 80-84, and 85+) were observed for the Role Functioning-Physical (floor effects: 30.9%-60% and ceiling effects: 11.7%-38.6%) and Role Functioning-Emotional (floor effects: 25.6%-50.4% and ceiling effects: 32.2% – 53.2%) subscales. Substantial ceiling effects were also noted for the Social Functioning and Bodily Pain subscales (15%-46.7% and 14.1%-21.1%, respectively).

Andresen, Gwendell, Gravitt, Aydelotte, and Podgorski (1999) administered the SF-36 to 97 nursing home residents and reported substantial floor effects of 26.8% and 29.5% for the Physical Functioning and Role Limitations-Physical subscales, respectively. Substantial ceiling effects of 36.1%, 49.5% and 21.6% were reported for the Social Functioning, Role Limitations-Emotional, and Bodily Pain subscales, respectively.

ReliabilityReliability can be defined in a variety of ways. It is generally understood to be the extent to which a measure is stable or consistent and produces similar results when administered repeatedly. A more technical definition of reliability is that it is the proportion of "true" variation in scores derived from a particular measure. The total variation in any given score may be thought of as consisting of true variation (the variation of interest) and error variation (which includes random error as well as systematic error). True variation is that variation which actually reflects differences in the construct under study, e.g., the actual severity of neurological impairment. Random error refers to "noise" in the scores due to chance factors, e.g., a loud noise distracts a patient thus affecting his performance, which, in turn, affects the score. Systematic error refers to bias that influences scores in a specific direction in a fairly consistent way, e.g., one neurologist in a group tends to rate all patients as being more disabled than do other neurologists in the group. There are many variations on the measurement of reliability including alternate-forms, internal consistency , inter-rater agreement , intra-rater agreement , and test-retest .
studies have demonstrated excellent internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency., with Cronbach’s alpha generally exceeding 0.80 for all scales except Social Functioning. Social Functioning may sometimes be lower due to the fact that there are fewer items (only 2 items) in the subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
(Ware, Snow, Kosinski & Gandek, 1993; Brazier et al., 1992; Lyons, Perry, & Littlepage, 1994; McHorney, Ware, Lu, & Sherbourne, 1994; Ruta, Garratt, Wardlaw, & Russell, 1994). Test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society).
evaluations have also suggested that the SF-36 scores can generally be reproduced (Brazier et al. 1992; Beaton, Hogg-Johnson, & Bombardier, 1997).

Brazier et al. (1992) found considerable evidence for the reliabilityReliability can be defined in a variety of ways. It is generally understood to be the extent to which a measure is stable or consistent and produces similar results when administered repeatedly. A more technical definition of reliability is that it is the proportion of "true" variation in scores derived from a particular measure. The total variation in any given score may be thought of as consisting of true variation (the variation of interest) and error variation (which includes random error as well as systematic error). True variation is that variation which actually reflects differences in the construct under study, e.g., the actual severity of neurological impairment. Random error refers to "noise" in the scores due to chance factors, e.g., a loud noise distracts a patient thus affecting his performance, which, in turn, affects the score. Systematic error refers to bias that influences scores in a specific direction in a fairly consistent way, e.g., one neurologist in a group tends to rate all patients as being more disabled than do other neurologists in the group. There are many variations on the measurement of reliability including alternate-forms, internal consistency , inter-rater agreement , intra-rater agreement , and test-retest .
of the SF-36. For the internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. of the SF-36, Cronbach’s alpha was found to be excellent, exceeding 0.85, and reliabilityReliability can be defined in a variety of ways. It is generally understood to be the extent to which a measure is stable or consistent and produces similar results when administered repeatedly. A more technical definition of reliability is that it is the proportion of "true" variation in scores derived from a particular measure. The total variation in any given score may be thought of as consisting of true variation (the variation of interest) and error variation (which includes random error as well as systematic error). True variation is that variation which actually reflects differences in the construct under study, e.g., the actual severity of neurological impairment. Random error refers to "noise" in the scores due to chance factors, e.g., a loud noise distracts a patient thus affecting his performance, which, in turn, affects the score. Systematic error refers to bias that influences scores in a specific direction in a fairly consistent way, e.g., one neurologist in a group tends to rate all patients as being more disabled than do other neurologists in the group. There are many variations on the measurement of reliability including alternate-forms, internal consistency , inter-rater agreement , intra-rater agreement , and test-retest .
coefficients exceeded 0.75 for all dimensions of the scale with the exception of the Social Functioning subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
(alpha = 0.73). To identify the test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society).
, Brazier et al. (1992) calculated correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
coefficients and found coefficients ranging from adequate (0.60 for Social Functioning) to excellent (0.81 for Physical Functioning).

Jenkinson, Coulter and Wright (1993) mailed the SF-36 in a large community sample to explore the questionnaire’s internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. and validityThe degree to which an assessment measures what it is supposed to measure.
. Cronbach’s alpha on all subscales of the SF-36 were excellent, exceeding 0.80, with the exception being the Social Functioning subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
, which was of adequate internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. (alpha = 0.76). In the case of the Social Functioning dimension, the results were considered acceptable due to the small number of items (2 items using a 5-point scale).

Jenkinson, Wright and Coulter (1994) mailed the SF-36 to 13,042 randomly selected subjects between the ages of 16-64 years. The internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. of the SF-36 was found range from adequate to excellent (alpha ranged from 0.76 for Social Functioning to 0.90 for Physical Functioning). The internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. was then calculated by breaking the data down into five subgroups of overall self-rated general health (poor, fair, good, very good, excellent). All alpha values were adequate, exceeding 0.70, except for the Social Functioning subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
, which was poor (exceeded 0.50). Due to the small number of items in this domain this result is considered acceptable.

Brazier et al. (1996) calculated the reliabilityReliability can be defined in a variety of ways. It is generally understood to be the extent to which a measure is stable or consistent and produces similar results when administered repeatedly. A more technical definition of reliability is that it is the proportion of "true" variation in scores derived from a particular measure. The total variation in any given score may be thought of as consisting of true variation (the variation of interest) and error variation (which includes random error as well as systematic error). True variation is that variation which actually reflects differences in the construct under study, e.g., the actual severity of neurological impairment. Random error refers to "noise" in the scores due to chance factors, e.g., a loud noise distracts a patient thus affecting his performance, which, in turn, affects the score. Systematic error refers to bias that influences scores in a specific direction in a fairly consistent way, e.g., one neurologist in a group tends to rate all patients as being more disabled than do other neurologists in the group. There are many variations on the measurement of reliability including alternate-forms, internal consistency , inter-rater agreement , intra-rater agreement , and test-retest .
of the SF-36 in 380 women over the age of 75. Spearman’s rank correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
coefficients between scores for those who said their health had not changed between initial assessment and first follow-up by perceived health change were calculated and coefficients ranged from poor (r = 0.28 for Social Functioning) to adequate (0.70 for Vitality) over a retest period of 6 months. These results suggest that the SF-36 has only adequate test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society).
in the elderly. Brazier et al (1996) also examined the internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. of the SF-36 and reported excellent internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. (alpha ≥ 0.80) for all subscales but poor internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. for the subscales Social Functioning (0.56) and General Health (0.66).

Andresen et al. (1999) administered the SF-36 to 97 nursing home residents and then re- administered the SF-36 after 1 week. Test-retest intraclass correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
coefficients (ICC) ranged from adequate to excellent (from 0.55 to 0.82). Further, the ICCs for both the physical summary and mental summary scores were excellent (ICC = 0.82 and 0.79 respectively).

Essink-Bot, Krabbe, Bonsel, and Aaronson (1997) administered the SF-36, The Nottingham Health Profile, the COOP/WONCA charts (The Dartmouth Primary Care Cooperative Information Project/World Organization of National Colleges, Academies, and Academic Associations of General Practices/Family Physicians), and the EuroQol to migraine sufferers. The scales of the SF-36 yielded internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. estimates ranging from adequate (alpha = 0.76 for General Health) to excellent (0.91 for Physical Functioning). The mean alpha coefficient was considered excellent (alpha = 0.84). The internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. of the SF-36 subscales exceeded that of the Nottingham Health Profile scales.

Walters, Munro and Brazier (2001) reported excellent internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. (Cronbach’s alpha ≥ 0.80) for all subscales of the SF-36 except for the Social Functioning subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
(alpha = 0.79) when the survey was administered by mail to a sample of 9,897 subjects aged 65-104 years.

McHorney, Ware and Sherbourne (1994) evaluated data from 3,445 patients from the Medical Outcomes Study (MOS) and replicated data across 24 subgroups differing in socio-demographic characteristics, diagnosis, and disease severity. Across patient groups, all scales passed tests for item- internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. (97% passed). ReliabilityReliability can be defined in a variety of ways. It is generally understood to be the extent to which a measure is stable or consistent and produces similar results when administered repeatedly. A more technical definition of reliability is that it is the proportion of "true" variation in scores derived from a particular measure. The total variation in any given score may be thought of as consisting of true variation (the variation of interest) and error variation (which includes random error as well as systematic error). True variation is that variation which actually reflects differences in the construct under study, e.g., the actual severity of neurological impairment. Random error refers to "noise" in the scores due to chance factors, e.g., a loud noise distracts a patient thus affecting his performance, which, in turn, affects the score. Systematic error refers to bias that influences scores in a specific direction in a fairly consistent way, e.g., one neurologist in a group tends to rate all patients as being more disabled than do other neurologists in the group. There are many variations on the measurement of reliability including alternate-forms, internal consistency , inter-rater agreement , intra-rater agreement , and test-retest .
coefficients ranged from a low of 0.65 to a high of 0.94 across scales (median = 0.85) and varied somewhat across patient subgroups.

Weinberger et al. (1996) tested whether the SF-36 is influenced by method of administration (face-to-face interview, self administration and telephone interview) in 172 veterans receiving care at a General Medical Clinic. All patients were asked to complete the SF-36 three times over a 4-week period. Cronbach’s alpha coefficients indicated that items in all eight SF-36 domains were highly internally consistent, regardless of the mode of administration, however they showed large variation over short intervals. Specifically, of 24 computed Cronbach’s alphas (i.e., eight scales times three modes of administration), only one was below 0.70 (Social Function via telephone administration), whereas 17 exceeded 0.80. Cronbach’s alphas did not differ significantly by method of administration. Test-retest correlations ranged from r = 0.55 (Physical Role Function by telephone administration) to r = 0.94 (Physical Function by self-administration).

Hagen, Bugge, and Alexander (2003) examined the reliabilityReliability can be defined in a variety of ways. It is generally understood to be the extent to which a measure is stable or consistent and produces similar results when administered repeatedly. A more technical definition of reliability is that it is the proportion of "true" variation in scores derived from a particular measure. The total variation in any given score may be thought of as consisting of true variation (the variation of interest) and error variation (which includes random error as well as systematic error). True variation is that variation which actually reflects differences in the construct under study, e.g., the actual severity of neurological impairment. Random error refers to "noise" in the scores due to chance factors, e.g., a loud noise distracts a patient thus affecting his performance, which, in turn, affects the score. Systematic error refers to bias that influences scores in a specific direction in a fairly consistent way, e.g., one neurologist in a group tends to rate all patients as being more disabled than do other neurologists in the group. There are many variations on the measurement of reliability including alternate-forms, internal consistency , inter-rater agreement , intra-rater agreement , and test-retest .
of the SF-36 in patients in the early post-stroke period. The SF-36 was administered at 1, 3 and 6 months after strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. onset. The internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. of the eight subscales at all three time-points was good except for 1-month Vitality (alpha = 0.68) and 3-month General Health (alpha = 0.67), which were considered poor.
Dorman et al. (1998) assessed the test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society).
and the internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. of the SF-36 in 2,253 patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. ICC’s ranged from poor (0.28 for Mental Health) to excellent (0.80 for Social Functioning). Internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. of the SF-36 was excellent (ranging from 0.81 for Social Functioning to 0.96 for Emotional Role Functioning). Dorman et al. concluded that although the SF-36 can function effectively as a discriminatory measure for assessing health-related quality-of-life outcomes in groups of patients after strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain., the level of test re-test reliabilityReliability can be defined in a variety of ways. It is generally understood to be the extent to which a measure is stable or consistent and produces similar results when administered repeatedly. A more technical definition of reliability is that it is the proportion of "true" variation in scores derived from a particular measure. The total variation in any given score may be thought of as consisting of true variation (the variation of interest) and error variation (which includes random error as well as systematic error). True variation is that variation which actually reflects differences in the construct under study, e.g., the actual severity of neurological impairment. Random error refers to "noise" in the scores due to chance factors, e.g., a loud noise distracts a patient thus affecting his performance, which, in turn, affects the score. Systematic error refers to bias that influences scores in a specific direction in a fairly consistent way, e.g., one neurologist in a group tends to rate all patients as being more disabled than do other neurologists in the group. There are many variations on the measurement of reliability including alternate-forms, internal consistency , inter-rater agreement , intra-rater agreement , and test-retest .
reported in strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. populations indicates that the SF-36 may not be adequate for serial assessments of individual patients, unless large differences over time are expected. Thus, the SF-36 should be used for large group comparisons only.

Furthermore, test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society).
was negatively affected by the use of proxy respondents in this study. While the use of a proxy may be the only means by which to include data from more severely affected strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. survivors, the subjective nature of the SF-36 may make proxy use difficult or even inadvisable.

Hobart, Williams, Moran and Thompson (2002) argue that the SF-36 has limited reliabilityReliability can be defined in a variety of ways. It is generally understood to be the extent to which a measure is stable or consistent and produces similar results when administered repeatedly. A more technical definition of reliability is that it is the proportion of "true" variation in scores derived from a particular measure. The total variation in any given score may be thought of as consisting of true variation (the variation of interest) and error variation (which includes random error as well as systematic error). True variation is that variation which actually reflects differences in the construct under study, e.g., the actual severity of neurological impairment. Random error refers to "noise" in the scores due to chance factors, e.g., a loud noise distracts a patient thus affecting his performance, which, in turn, affects the score. Systematic error refers to bias that influences scores in a specific direction in a fairly consistent way, e.g., one neurologist in a group tends to rate all patients as being more disabled than do other neurologists in the group. There are many variations on the measurement of reliability including alternate-forms, internal consistency , inter-rater agreement , intra-rater agreement , and test-retest .
as the General Health Perceptions and Social Functioning scales generate low reliabilityReliability can be defined in a variety of ways. It is generally understood to be the extent to which a measure is stable or consistent and produces similar results when administered repeatedly. A more technical definition of reliability is that it is the proportion of "true" variation in scores derived from a particular measure. The total variation in any given score may be thought of as consisting of true variation (the variation of interest) and error variation (which includes random error as well as systematic error). True variation is that variation which actually reflects differences in the construct under study, e.g., the actual severity of neurological impairment. Random error refers to "noise" in the scores due to chance factors, e.g., a loud noise distracts a patient thus affecting his performance, which, in turn, affects the score. Systematic error refers to bias that influences scores in a specific direction in a fairly consistent way, e.g., one neurologist in a group tends to rate all patients as being more disabled than do other neurologists in the group. There are many variations on the measurement of reliability including alternate-forms, internal consistency , inter-rater agreement , intra-rater agreement , and test-retest .
scores and have limited convergent and discriminant validityMeasures that should not be related are not. Discriminant validity examines the extent to which a measure correlates with measures of attributes that are different from the attribute the measure is intended to assess.
. However, de Haan (2002) argues that Hobart et al.’s conclusions can be challenged. The reliabilityReliability can be defined in a variety of ways. It is generally understood to be the extent to which a measure is stable or consistent and produces similar results when administered repeatedly. A more technical definition of reliability is that it is the proportion of "true" variation in scores derived from a particular measure. The total variation in any given score may be thought of as consisting of true variation (the variation of interest) and error variation (which includes random error as well as systematic error). True variation is that variation which actually reflects differences in the construct under study, e.g., the actual severity of neurological impairment. Random error refers to "noise" in the scores due to chance factors, e.g., a loud noise distracts a patient thus affecting his performance, which, in turn, affects the score. Systematic error refers to bias that influences scores in a specific direction in a fairly consistent way, e.g., one neurologist in a group tends to rate all patients as being more disabled than do other neurologists in the group. There are many variations on the measurement of reliability including alternate-forms, internal consistency , inter-rater agreement , intra-rater agreement , and test-retest .
of only one scale (General Health Perceptions) was marginally less (Cronbach’s alpha = 0.68) than the authors’ predefined criteria of alpha = 0.70. Although it is often recommended that coefficient values should be above 0.80, de Haan points out that coefficients above 0.70 are generally regarded as acceptable for scales when assessing outcome on a group level.

Anderson, Laubscheret and Burns (1996) administered the Australian version of the SF-36 to 90 individuals at one-year post-stroke. The authors concluded that the SF-36 has satisfactory internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency., however alphas ranged from 0.60 for the Vitality scale (indicating poor internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency.) to 0.90 for Physical Functioning, Bodily Pain and Role Limitations-Emotional (excellent internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency.). The Cronbach’s alphas of four subscales of the SF-36 fell below 0.80 (General Health, Vitality, Social Functioning and Mental Health).

Validity

Criterion:

Predictive:
McHorney (1996) examined data from the Medical Outcomes Study. The General Health Perceptions subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
was found to be most predictive of death (death rate of patients in lowest quartile for SF-36 General Health scale was three times greater than for patients with SF-36 scores in the highest quartile, followed by scores in Physical Functioning). Baseline Physical Functioning, Role Limitations-physical, and Pain subscales were most predictive of hospitalizations. Moreover, Pain, General Health and Vitality subscales were most predictive of physician visits.
Beusterien, Steinwald, & Ware (1996) found that the SF-36 Mental Health subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
and mental component summary measure were strongly associated with severity of depressionIllness involving the body, mood, and thoughts, that affects the way a person eats and sleeps, the way one feels about oneself, and the way one thinks about things. A depressive disorder is not the same as a passing blue mood or a sign of personal weakness or a condition that can be wished away. People with a depressive disease cannot merely "pull themselves together" and get better. Without treatment, symptoms can last for weeks, months, or years. Appropriate treatment, however, can help most people with depression.
in cross-sectional analyses. These results suggest that the SF-36 is useful for estimating the burden of depressionIllness involving the body, mood, and thoughts, that affects the way a person eats and sleeps, the way one feels about oneself, and the way one thinks about things. A depressive disorder is not the same as a passing blue mood or a sign of personal weakness or a condition that can be wished away. People with a depressive disease cannot merely "pull themselves together" and get better. Without treatment, symptoms can last for weeks, months, or years. Appropriate treatment, however, can help most people with depression.
among depressed elderly persons.

Rumsfeld et al. (1999) tested whether the physical and mental component summary scores from the preoperative SF-36 health status survey predicted mortality in 3,956 patients following coronary artery bypass graft surgery (CABG). The physical component summary of the preoperative SF-36 was found to be a statistically significant risk factor for 6-month mortality following CABG surgery. In multivariate analysis, a 10-point lower SF-36 physical component summary score had an odds ratio (OR) of 1.39 for predicting mortality. The SF-36 mental component summary score was not associated with 6-month mortality in multivariate analyses (OR = 1.09). Thus, preoperative patient self-report of the physical component of the SF-36 health status may be helpful for risk stratification and clinical decision making for patients undergoing CABG surgery.

Construct:

Walters et al. (2001) reported significant relationships in expected directions to support construct validityReflects the ability of an instrument to measure an abstract concept, or construct. For some attributes, no gold standard exists. In the absence of a gold standard , construct validation occurs, where theories about the attribute of interest are formed, and then the extent to which the measure under investigation provides results that are consistent with these theories are assessed.
among older adults. Scores in all scales were reported to decrease as age increased. Women reported worse health than men on all scales even after adjusting for age. Respondents who had recently visited their physician reported poorer health on all scales and people living alone had lower scores except on general health.

Ware, Kosinski, and Keller (1994) examined the construct validityReflects the ability of an instrument to measure an abstract concept, or construct. For some attributes, no gold standard exists. In the absence of a gold standard , construct validation occurs, where theories about the attribute of interest are formed, and then the extent to which the measure under investigation provides results that are consistent with these theories are assessed.
of the 8 subscales of the SF-36. Physical Functioning was shown to be the best all around measure of physical health (r = 0.85), and Mental Health was the most valid measure of mental health (r = 0.87). Interestingly, Mental Health was one of the poorest measures of the physical component (r = 0.17) and Physical Functioning was the poorest measure of the mental component (r = 0.12). The Vitality (r = 0.47 for physical health and r = 0.65 mental health component) and General Health (r = 0.69 for the physical health component and r = 0.37 for the mental health component) subscales had excellent or adequate validityThe degree to which an assessment measures what it is supposed to measure.
for both components.

Construct (in patients with stroke):

Wilkinson et al. (1997) interviewed 106 people less than 75 years old and their caregivers following a first-ever strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. Rank correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
coefficients of the Barthel Index with the SF-36 subscales in first-ever strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. patients ranged from poor (r = 0.22 for Role Limitation-Emotional subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
) to excellent (0.81 for Physical Functioning subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
).

Convergent/Discriminant:
Convergent validityA type of validity that is determined by hypothesizing and examining the overlap between two or more tests that presumably measure the same construct. In other words, convergent validity is used to evaluate the degree to which two or more measures that theoretically should be related to each other are, in fact, observed to be related to each other.
of the SF-36 is generally strongly supported in comparison to similar domains of condition-specific measures (Fielder, Denholm, Lyons, & Fielder, 1996; Nortvedt, Riise, Myhr, & Nyland, 1999; The Counseling Versus Antidepressants in Primary Care Study Group, 1999; Benninger, Ahuja, Gardner, and Grywalski, 1998; Buchwald et al., 1996; Anderson, Laubscher, & Burns, 1996) and other generic HRQOL measures (Andresen et al., 1999; Andresen, Rothenberg, & Kaplan, 1998; Rothwell, McDowell, Wong, & Dorman, 1997). Discriminant validityMeasures that should not be related are not. Discriminant validity examines the extent to which a measure correlates with measures of attributes that are different from the attribute the measure is intended to assess.
is usually rated highly for the SF-36 (e.g. Andresen et al., 1999; The Canadian Burden of Illness Study Group, 1998; Buchwald, Pearlman, Umali, Schmaling, & Katon, 1996, Komaroff et al., 1996, O’Neill & Kelly, 1996) although some studies disagree (e.g. Colantonio, Dawson, McLellan, 1998; Lalonde, Clarke, Joseph, Mackenzie, & Grover, 1999; Myers & Wilks, 1999).

Andresen et al (1999) administered the SF-36, the Geriatric DepressionIllness involving the body, mood, and thoughts, that affects the way a person eats and sleeps, the way one feels about oneself, and the way one thinks about things. A depressive disorder is not the same as a passing blue mood or a sign of personal weakness or a condition that can be wished away. People with a depressive disease cannot merely "pull themselves together" and get better. Without treatment, symptoms can last for weeks, months, or years. Appropriate treatment, however, can help most people with depression.
Scale and the Mini-Mental State Examination to 97 nursing home residents. ActivitiesAs defined by the International Classification of Functioning, Disability and Health, activity is the performance of a task or action by an individual. Activity limitations are difficulties in performance of activities. These are also referred to as function.
of daily living and medication intake data were recorded. Convergent validityA type of validity that is determined by hypothesizing and examining the overlap between two or more tests that presumably measure the same construct. In other words, convergent validity is used to evaluate the degree to which two or more measures that theoretically should be related to each other are, in fact, observed to be related to each other.
between the SF-36 Physical Health subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
and the ActivitiesAs defined by the International Classification of Functioning, Disability and Health, activity is the performance of a task or action by an individual. Activity limitations are difficulties in performance of activities. These are also referred to as function.
of Daily Living Index was adequate (r ranged from -0.37 to -0.43). These correlations are negative because a high score on the SF-36 indicates positive health status, whereas a high score on the ActivitiesAs defined by the International Classification of Functioning, Disability and Health, activity is the performance of a task or action by an individual. Activity limitations are difficulties in performance of activities. These are also referred to as function.
of Daily Living index indicates dependence. Physical health scores from the SF-36 correlated more strongly with Geriatric DepressionIllness involving the body, mood, and thoughts, that affects the way a person eats and sleeps, the way one feels about oneself, and the way one thinks about things. A depressive disorder is not the same as a passing blue mood or a sign of personal weakness or a condition that can be wished away. People with a depressive disease cannot merely "pull themselves together" and get better. Without treatment, symptoms can last for weeks, months, or years. Appropriate treatment, however, can help most people with depression.
Scale scores than ActivitiesAs defined by the International Classification of Functioning, Disability and Health, activity is the performance of a task or action by an individual. Activity limitations are difficulties in performance of activities. These are also referred to as function.
of Daily Living Index scores (-0.63 vs. 0.01). However, the Role Limitations-Physical subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
correlated more strongly with Geriatric DepressionIllness involving the body, mood, and thoughts, that affects the way a person eats and sleeps, the way one feels about oneself, and the way one thinks about things. A depressive disorder is not the same as a passing blue mood or a sign of personal weakness or a condition that can be wished away. People with a depressive disease cannot merely "pull themselves together" and get better. Without treatment, symptoms can last for weeks, months, or years. Appropriate treatment, however, can help most people with depression.
Scale scores than ActivitiesAs defined by the International Classification of Functioning, Disability and Health, activity is the performance of a task or action by an individual. Activity limitations are difficulties in performance of activities. These are also referred to as function.
of Daily Living scores. Social Functioning, Role Limitations-Emotional, Vitality and Mental Health subscales all correlated more strongly with Geriatric DepressionIllness involving the body, mood, and thoughts, that affects the way a person eats and sleeps, the way one feels about oneself, and the way one thinks about things. A depressive disorder is not the same as a passing blue mood or a sign of personal weakness or a condition that can be wished away. People with a depressive disease cannot merely "pull themselves together" and get better. Without treatment, symptoms can last for weeks, months, or years. Appropriate treatment, however, can help most people with depression.
Scale scores than ActivitiesAs defined by the International Classification of Functioning, Disability and Health, activity is the performance of a task or action by an individual. Activity limitations are difficulties in performance of activities. These are also referred to as function.
of Daily Living scores.

Brazier et al. (1992) reported correlations of -0.41 (Social Functioning vs. social isolation) to -0.68 (Vitality vs. energy) between similar scales on the SF-36 and Nottingham Health Profile. Correlations between dimensions less clearly related ranged form -0.18 (Physical Functioning vs. emotional reaction) to -0.53 (Social Functioning vs. emotional reactions). These correlations are negative because a high score on the SF-36 indicates positive health status, whereas a high score on the Nottingham Health Profile indicates poorer perceived health status.

Dorman et al (1999) reported that the SF-36 Physical Functioning subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
correlated most closely with mobility, self-care and activitiesAs defined by the International Classification of Functioning, Disability and Health, activity is the performance of a task or action by an individual. Activity limitations are difficulties in performance of activities. These are also referred to as function.
domains of EuroQol (r = 0.57, 0.65 and 0.63, respectively) and less strongly with the EuroQol psychological domain (r = 0.34). SF-36 Bodily Pain subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
correlated with the EuroQol pain domain (r = 0.66) and adequately correlated with all EuroQol domains. Role Functioning-Emotional correlated most closely with the EuroQol psychological domain (r = 0.43), and correlated least with the EuroQol self care domain (r = 0.24). The SF-36 Mental Health subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
was not closely related to the psychological domain (r = 0.21) or to the physical EuroQol domains (r = 0.06 to 0.10). The SF-36 General Health subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
correlated adequately with EuroQol overall HRQOL rating (r = 0.66).

Known Groups:
Patients diagnosed with ≥ 1 chronic physical problem had lower scores on all dimensions of the SF-36 except Mental Health, in comparison to healthy age-matched controls. The SF-36 scores were distributed as expected for sex, age, social class and use of health services (Brazier et al., 1992).

The SF-36 was found to discriminate between age groups (>75 years versus 75+) on Physical Functioning, Vitality and Change in Health subscales and between groups based on setting (general practice versus hospital outpatients) on the Physical Functioning and Role Functioning-Physical subscales (Hayes et al. 1995).

Essink-Bot et al. (1997) reported that the SF-36 was able to discriminate between migraine sufferers and controls on all subscales (ROC/AUC = 0.54 – 0.67) although this relationship was poor. The SF-36 was also able to discriminate between groups of migraine sufferers based on absence from work (0 vs. ≥ 0.5 days, ROC/AUC ranged from poor, 0.61 to adequate, 0.79).

Brazier et al. (1996) reported that SF-36 scores distinguished groups based on recent visits to their family doctor, hospital inpatient stays and longstanding illness.

Known Groups (in patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.):
Anderson et al. (1996) administered the Australian version of the SF-36 to 90 strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. survivors (1-year post-stroke). ValidityThe degree to which an assessment measures what it is supposed to measure.
was assessed by comparing patients’ scores on the SF-36 with those obtained for the Barthel Index, the 28-item General Health Questionnaire, and the Adelaide ActivitiesAs defined by the International Classification of Functioning, Disability and Health, activity is the performance of a task or action by an individual. Activity limitations are difficulties in performance of activities. These are also referred to as function.
Profile, an instrument developed from the Frenchay ActivitiesAs defined by the International Classification of Functioning, Disability and Health, activity is the performance of a task or action by an individual. Activity limitations are difficulties in performance of activities. These are also referred to as function.
Index. Construct validityReflects the ability of an instrument to measure an abstract concept, or construct. For some attributes, no gold standard exists. In the absence of a gold standard , construct validation occurs, where theories about the attribute of interest are formed, and then the extent to which the measure under investigation provides results that are consistent with these theories are assessed.
was demonstrated by significant differences across all eight SF-36 scales for patients with identified health problems. For patients dependent in activitiesAs defined by the International Classification of Functioning, Disability and Health, activity is the performance of a task or action by an individual. Activity limitations are difficulties in performance of activities. These are also referred to as function.
of daily living, the difference in mean scores was greatest for the physical functioning and general health scales, whereas for patients with emotional health problems, the strongest associations were with the Social Functioning, Role Limitations-Emotional, and Mental Health subscales.

Mayo et al. (2002) interviewed persons with first-ever strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. and a population-based sample of community-dwelling individuals without strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. by telephone at 6-month intervals for 2 years of follow-up. SF-36 scores successfully discriminated those with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. from their age and gender-matched controls.

Cross-diagnostic:

Dallmeijer et al. (2007) examined the unidimensionality and differential item functioning of the Physical Functioning subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
of the SF-36 using Rasch analysisRasch analysis is a statistical measurement method that allows the measurement of an attribute - such as upper limb function - independently of particular tests or indices.Â Â It creates a linear representationÂ using many individual items, ranked byÂ item difficulty (e.g. picking up a very small item, versus a task requiring a very gross grasp) and person ability.Â Â Â A well performing Rasch model will have items hierarchically placed from simple to more difficult, and individuals with high abilities should be able to perform all the items below a level of difficulty.Â The Rasch model is statistically strong because it enables ordinal measures to be converted into meaningful interval measures. It also allows information fromÂ various tests or tools with different scoring systems to be applied using the Rasch model.
in patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain., multiple sclerosis, and amyotrophic lateral sclerosis (ALS). All items of the Physical Functioning subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
, except one for the ALS group (bathing/dressing item), formed a unidimensional scale, supporting the use of a sum score as a measure of Physical Functioning within these diagnostic groups. The pooled analysis showed inadequate fit to the Rasch model for the ‘walking several hundred meters’ item of the other 9 items, 5 showed differential item functioning for strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. vs. multiple sclerosis and ALS, while no differential item functioning was found between multiple sclerosis and ALS. Thus, when comparing the data of patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain., with that of patients with multiple sclerosis and/or patients with ALS, adjustments are necessary for differential item functioning.

Responsiveness

Harwood and Ebrahim (2000) examined the sensitivitySensitivity refers to the probability that a diagnostic technique will detect a particular disease or condition when it does indeed exist in a patient (National Multiple Sclerosis Society). See also "Specificity."
to change of the SF-36 in 81 patients before and after hip replacement. Eighty-nine percent of patients reported improvements three months after surgery. The largest changes were seen on the SF-36 Pain scale (large effect sizes of 1.2 at three months and 1.5 at 6-12 months), Physical Function (large effect sizes of 1.1 at 3 months and 1.3 at 6-12 months) and Role Limitation-Physical (large effect sizes of 0.8 at 3 months and 1.2 at 6-12 months) scales, suggesting that some of the SF-36 dimensions are very sensitive to change.

Brazier, Walters, Nicholl and Kohler (1996) tested the sensitivitySensitivity refers to the probability that a diagnostic technique will detect a particular disease or condition when it does indeed exist in a patient (National Multiple Sclerosis Society). See also "Specificity."
of the SF-36, EuroQol and the Office of Population Census and Surveys Disability Survey in an elderly female population. These measures were administered by interview in a hospital clinic at baseline. A random subsample of respondents was retested six months later. SensitivitySensitivity refers to the probability that a diagnostic technique will detect a particular disease or condition when it does indeed exist in a patient (National Multiple Sclerosis Society). See also "Specificity."
of the instruments was quantified by estimating effect sizes for hypothesized changes in health status. There was some evidence of greater sensitivitySensitivity refers to the probability that a diagnostic technique will detect a particular disease or condition when it does indeed exist in a patient (National Multiple Sclerosis Society). See also "Specificity."
to lower levels of morbidity in the SF-36. Hypothesizing a change from having a long standing illness to no long-standing illness was associated with moderate to large effect sizes across dimensions of the three instruments, except the Social Functioning (ES = 0.41) and Mental Health (ES = 0.31) dimensions of the SF-36 which both had small effect sizes. The effect sizes for differences in instrument scores between the age groups were small (in the range 0.00-0.50), with the highest for Physical Functioning. The SF-36 was rated as more sensitive to change than the EuroQol for older adult women.

In a study by Mossberg and McFarland (2001), 6 outpatient rehabilitation clinics incorporated the SF-36 into everyday practice. Ninety patients completed the SF-36 health status questionnaire before initiating treatment and again at discharge. Only nonsurgical patients without comorbidities were enrolled. Effect sizes for the SF-36 (admission to outpatient rehabilitation to discharge) ranged from small (0.48 for Role Limitations-Emotional) to large (1.38 for Bodily Pain). The physical component summary score effect sizeEffect size (ES) is a name given to a family of indices that measure the magnitude of a treatment effect. Unlike significance tests, these indices are independent of sample size. The ES is generally measured in two ways: as the standardized difference between two means, or as the correlation between the independent variable classification and the individual scores on the dependent variable. This correlation is called the "effect size correlation".
was large (ES = 0.80) and the mental component summary score effect sizeEffect size (ES) is a name given to a family of indices that measure the magnitude of a treatment effect. Unlike significance tests, these indices are independent of sample size. The ES is generally measured in two ways: as the standardized difference between two means, or as the correlation between the independent variable classification and the individual scores on the dependent variable. This correlation is called the "effect size correlation".
was small (ES = 0.45).

The SF-36 is increasingly being used in strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. studies (Anderson, Laubscher & Burns, 1996; Duncan et al. 1997) and in strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. clinical trials. However, the psychometric properties of the SF-36 soon after strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. are not well known, as most of the current data are from patients one year or more after the strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. (e.g. Anderson et al., 1996; Duncan et al., 1997). We did not identify any studies on the responsivenessThe ability of an instrument to detect clinically important change over time.
of the SF-36 in patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain..

Muller-Nordhorn et al. (2004) examined the responsivenessThe ability of an instrument to detect clinically important change over time.
to change of the SF-12 in patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. or transitory ischemic attack. Patients (n=558) were administered the SF-12 at baseline (referring to status prior to the event) and after 12 months. In patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain., standardized response means (SRMs) were small for the physical component summary scale of the SF-12 (SRM 0.49) and moderate for the mental component summary scale of the SF-12 (SRM 0.52). In patients with transitory ischemic attack, SRMs were below 0.2 for the physical component summary scale of the SF-12 and small for the mental component summary scale of the SF-12 (SRM 0.34). SRMs increased with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. severity as indicated by the National Institutes of Health StrokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. Scale score. Thus, the SF-12 summary scales show a small to moderate responsivenessThe ability of an instrument to detect clinically important change over time.
to change in patients after strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. ResponsivenessThe ability of an instrument to detect clinically important change over time.
to change was higher in patients with greater strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. severity.

The observation that patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. had scores similar to patients with transient ischemic attacks raises questions about the ability of the SF-36 to discriminate and to be responsive to clinical changes in patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. (Duncan et al., 1997). Currently, no evaluative stroke-specific HRQOL instrument is available, and it remains to be seen whether the generic HRQOL instruments such as the SF-36 are sufficiently responsive to be useful in clinical trials. More information regarding the responsivenessThe ability of an instrument to detect clinically important change over time.
of the SF-36 will be known when a number of ongoing current strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. trials are completed (Williams, 1998).

References

Aaronson, N. K., Muller, M., Cohen, P. D. A., Essink-Bot, M. L., Fekkes, M., Sanderman, R., Sprangers, M. A., Velder, A., Verrips, E. (1998). Translation, validation and norming of the Dutch language version of the SF-36 health survey in community and chronic disease populations. J Clin Epidemiol, 51, 1055-1068
Anderson, C., Laubscher, S., Burns, R. (1996). Validation of the Short Form 36 (SF-36) Health Survey Questionnaire among stroke patients. Stroke, 27(10), 1812-1816.
Andresen, E. M., Meyers, A. R. (2000). Health-related quality of life outcomes measures. Arch Phys Med Rehabil, 81(12), S30-45.
Andresen, E. M., Gwendell, W., Gravitt, G. W., Aydelotte, M. E., Podgorski, C. A. (1999). Limitations of the SF-36 in a sample of nursing home residents. Age and Ageing, 28, 562-566.
Andresen, E. M., Fouts, B. S., Romeis, J. C., Brownson, C. A. (1999). Performance of health-related quality-of-life instruments in a spinal cord injured population. Arch Phys Med Rehabil, 80. 877-884.
Andresen, E. M., Rothenberg, B. M., Kaplan, R. M. (1998). Performance of a self-administered mailed version of the Quality of Well-Being (QWB-SA) questionnaire among older adults. Med Care, 36, 1349-1360.
Beaton, D. E., Hogg-Johnson, S., Bombardier, C. (1997). Evaluating changes in health status: Reliability and responsiveness of five generic health status measures in workers with musculoskeletal disorders. J Clin Epidemiol, 50(1), 79-93.
Beaton, D. E., Hogg-Johnson, S., Bombardier, C. (1997). Evaluating changes in the health status: reliability and responsiveness of five generic health status measures in workers with musculoskeletal disorders. J Clin Epidemiol, 50, 79-93.
Benninger, M. S., Ahuja, A. S., Gardner, G., Grywalski, C. (1998). Assessing outcomes for dysphonic patients. J Voice, 12, 540-550.
Beusterien, K. M., Steinwald, B., Ware, J. E. (1996). Usefulness of the SF-36 Health Survey in measuring health outcomes in the depressed elderly. J Geriatr Psychiatry Neurol, 9(1), 13-21.
Beck, A. T., Rial, W. Y., Rickets, K. (1974). Short form of Depression Inventory: Cross-validation. Psychological-Reports , 34(3), 1184-1186.
Brazier, J., Roberts, J., Tsuchiya, A., Busschbach, J. (2004). A comparison of the EQ-5D and SF-6D across seven patient groups. Health Econ 13, 873-884.
Brazier, J., Usherwood, T., Harper, R., Thomas, K. (1998). Deriving a preference-based single index from the UK SF-36 Health Survey. J Clin Epidemiol, 51, 1115-1128.
Brazier, J.E., Walters, S.J., Nicholl, J.P. & Kohler, B. (1996). Using the SF-36 and EuroQol on an Elderly Population. Quality of Life Research, 5, 195-204.
Brazier, J., Roberts, J., Deverill, M. (2002). The estimation of a preference-based measure of health from the SF-36. J Health Econ, 21, 271-292.
Brazier, J. E., Harper, R., Jones, N. M. B. et al. (1992). Validating the SF-36 health survey questionnaire: new outcome measure for primary care. BMJ, 305, 160-164.
Buchwald, D., Pearlman, T., Umali, J., Schmaling, K., Katon, W. (1996). Functional status in patients with chronic fatigue syndrome, other fatiguing illnesses, and healthy individuals. Am J Med, 101, 364-370.
Ciconelli, R. M. (1997). Translation and validation to the Portuguese of the Medical Outcomes Study 36-Item Short-Form Health Survey (SF-36) [doctoral thesis]. Federal University of SÃÂƒÃ‚£o Paulo, SÃÂƒÃ‚£o Paulo, Brazil.
Colantonio, A., Dawson, D. R., McLellan, B. A. (1998). Head injury in young adults: long-term outcome. Arch Phys Med Rehabil, 79, 550-558.
Dallmeijer, A. J., de Groot, V., Roorda, L. D., Schepers, V. P. M., Lindeman, E., van den Berg, L. H., Beelen, A., Dekker, J. (2007). Cross-diagnostic validity of the SF-36 physical functioning scale in patients with stroke, multiple sclerosis and amyotrophic lateral sclerosis: A study using rasch analysis. J Rehabil Med, 9, 63 -169.
de Haan, R. J. (2002). Measuring quality of life after stroke using the SF-36. Stroke, 33, 1176-1177.
Dorman, P., Slattery, J., Farrell, B., Dennis, M., Sandercock, P. (1998). Qualitative comparison of the reliability of health status assessments with the EuroQol and SF-36 Questionnaires After Stroke. Stroke, 29, 63-68.
Dorman, P. J., Dennis, M., Sandercock, P. (1999). How do scores on the EuroQol relate to scores on the SF-36 after stroke? Stroke, 30(10), 2146-2151.
Duncan, P. W., Samsa, G. P., Weinberger, M., Goldstein, L. B., Bonito, A., Witter, D. M., Enarson, C., Matchar, D. (1997). Health status of individuals with mild stroke. Stroke, 28, 740-745.
Essink-Bot, M. A., Krabbe, P. F., Bonsel, G. J., Aaronson, N. K. (1997). An empirical comparison of four generic health status measures: The Nottingham Health Profile, the Medical Outcomes Study 36-Item Short-Form Health Survey, the COOP/WONCA Charts, and the EuroQol Instrument. Med Care, 35(5), 522-537.
Fielder, H., Denholm, S. W., Lyons, R. A., Fielder, C. P. (1996). Measurement of health status in patients with vertigo. Clin Otolaryngol, 21,124-126.
Fukuhara, S., Ware, J. E., Kosinski, M., Wada, S., Gandek, B. (1998). Psychometric and Clinical Tests of Validity of the Japanese SF-36 Health Survey. J Clin Epidemiol, 1, 1045-1053.
Hagen, S., Bugge, C., Alexander, H. (2003). Psychometric properties of the SF-36 in the early post-stroke phase. Journal of Advanced Nursing, 44(5), 461-468.
Harwood, R. H., Ebrahim, S. (2000). A comparison of the responsiveness of the Nottingham extended activities of daily living scale, London handicap scale, and SF-36. Disability & Rehabilitation , 22(17), 786-793.
Hayes, V., Morris, J., Wolfe, C., Morgan, M. (1995). The SF-36 Health Survey Questionnaire: Is it suitable for use with older adults? Age and Ageing, 24, 120-125.
Hilari, K., Byng, S., Lamping, D. L., Smith, S. C. (2003). Stroke and Aphasia Quality of Life Scale-39 (SAQOL-39): Evaluation of acceptability, reliability, and validity. Stroke, 34, 1944-1950.
Hobart, J. C., Williams, L. S., Moran, K., Thompson, A. J. (2002). Quality of life measurement after stroke: Uses and abuses of the SF-36. Stroke, 33, 1348-1356.
Jenkinson, C., Coulter, A., Wright, L. (1993). Short form 36 (SF36) health survey questionnaire: Normative data for adults of working age. BMJ, 306(6890), 1437-1440.
Jenkinson, C., Wright, L., Coulter, A. (1994). Criterion validity and reliability of the SF-36 in a population sample. Quality of Life Research, 3(1), 7-12.
Jenkinson, C., Stewart-Brown, S., Petersen, S., Paice, C. (1999). Assessment of the SF-36 version 2 in the United Kingdom. J Epidemiol Community Health, 53(1), 46-50.
Komaroff, A.L., Fagioli, L.R., Doolittle, T.H., Gandek, B., Gleit, M.A., Gueriero, R.T., et al. (1996). Health status in patients with chronic fatigue syndrome and in general population and disease comparison groups. Am J Med,101, 281-90.
Lai, S-M., Perera, S., Duncan, P. W., Bode, R. (2003). Physical and social functioning after stroke: Comparison of the Stroke Impact Scale and Short Form-36. Stroke, 34, 488-493.
Lalonde, L., Clarke, A. E., Joseph, L., Mackenzie, T., Grover, S. A. (1999). Comparing the psychometric properties of preference-based and nonpreference-based health-related quality of life in coronary heart disease. Qual Life Res, 8, 399-409.
Lyons, R. A., Perry, H. M., Littlepage, B. N. C. (1994). Evidence for the validity of the Short-Form 36 Questionnaire (SF-36) in an elderly population. Age Aging, 23, 182-184.
Mathias, S. D., Bates, M. M., Pasta, D. J., Cisternas, M. G., Feeny, D., Patrick, D. L. (1997). Use of the Health Utilities Index with stroke patients and their caregivers. Stroke, 28, 1888-1894.
Mayo, N. E., Wood-Dauphinee, S., Cote, R., Durcan, L., Carlton, J. (2002). Activity, Participation, and Quality of Life 6 Months Poststroke. Arch Phys Med Rehabil, 83, 1035-1042.
McDowell, I., Newell, C. (1996). Measuring Health. A Guide to Rating Scales and Questionnaires. 2nd ed. NewYork: Oxford University Press.
McHorney, C. A. (1996). Measuring and monitoring general health status in elderly persons: Practical and methodological issues in using the SF-36 health survey. The Gerontologist, 36(5), 571-583.
McHorney, C. A., Ware, J. E. Jr., Raczek, A. E. (1993). The MOS 36-Item Short-Form Health Survey (SF-36): II Psychometric and clinical tests of validity in measuring physical and mental health constructs. Med Care, 31, 247-263.
McHorney, C. A., Ware, J. E. Jr., Lu, J. F., Sherbourne, C. D. (1994). The MOS 36-item Short-Form Health Survey (SF-36): III Tests of data quality, scaling assumptions, and reliability across diverse patient groups. Med Care, 32, 40-66.
Mossberg, K., McFarland, C. (2001). A patient-oriented health status measure in outpatient rehabilitation. Am J Phys Med Rehabil, 80(12), 896-902.
Muller-Nordhorn, J., Nolte, C. H., Rossnagel, K., Jungehulsing, G. J., Reich, A., Roll, S., Villringer, A., Wllich, S. N. (2004). Responsiveness to change of the SF-12 in patients with cerebrovascular disease. Biometrical Journal, 46(S1), 50.
Myers, C., Wilks, D. (1999). Comparison of Euroqol EQ-5D and SF-36 in patients with chronic fatigue syndrome. Qual Life Res, 8, 9-16.
Nemeth, G. (2006). Health related quality of life outcome instruments. European Spine Journal, 15(1), S44-S51.
Nortvedt, M. W., Riise, T., Myhr, K. M., Nyland, H. I. (1999). Quality of life in multiple sclerosis: measuring the disease effects more broadly. Neurology, 53, 1098-1103.
O’Mahony, P. G., Rodgers, H., Thomson, R. G., Dobson, R., James, O. F. W. (1998). Is the SF-36 suitable for assessing health status of older stroke patients? Age and Ageing, 27, 19-22.
O’Neill, P., Kelly, P. (1996). Postal questionnaire study of disability in the community associated with psoriasis. Br Med J, 313, 919-921.
Petrou, S., Hockley, C. (2005). An investigation into the empirical validity of the EQ-5D and SF-6D based on hypothetical preferences in a general population. Health Econ, 14, 1169-1189.
Ren, X. S., Amick, B., Zhou, L., et al. (1998). Translation and Psychometric Evaluation of a Chinese Version of the SF-36 Health Survey in the U.S. J Clin Epidemiol, 51(11), 1129.
Rothwell, P. M., McDowell, Z., Wong, C. K., Dorman, P. J. (1997). Doctors and patients don’t agree: cross sectional study of patients’ and doctors’ perceptions and assessments of disability in multiple sclerosis. British Med J, 314, 1580-1583.
Rumsfeld, J. S., MaWhinney, S., McCarthy, M., Shroyer, A. L., VillaNueva, C. B., O’Brien, M., Moritz, T. E., Henderson, W. G., Grover, F. L., Sethi, G. K., Hammermeister, K. E. (1999). Health-related quality of life as a predictor of mortality following coronary artery bypass graft surgery. Participants of the Department of Veterans Affairs Cooperative Study Group on Processes, Structures, and Outcomes of Care in Cardiac Surgery. JAMA, 14(281), 1298-1303.
Ruta, D. A., Garratt, A. M., Wardlaw, D., Russell, I. T. (1994). Developing a valid and reliable measure of health outcome for patients with low back pain. Spine, 19, 1887-1896.
Segal, M. E., Schall, R. R. (1994). Determining functional/health status and its relation to disability in stroke survivors. Stroke, 25, 2391-2397.
The Canadian Burden of Illness Study Group. (1998). Burden of illness of multiple sclerosis: part II: quality of life. Can J Neurol Sci, 25, 31-38.
The Counselling Versus Antidepressants in Primary Care Study Group. (1999). How disabling is depression? Evidence from a primary care sample. Br J Gen Pract, 49(439), 95-98.
Walters, S. J., Munro, J. F., Brazier, J. E. (2001). Using the SF-36 with older adults: A cross-sectional community-based survey. Age and Ageing, 30, 337-343.
Ware, J. E., Kosinski, M., Dewey, J. E., Gandek, B. (2001). How to Score and Interpret Single-Item Health Status Measures: A Manual for Users of the SF-8 Health Survey. Lincoln RI: QualityMetric Incorporated.
Ware, J. E., Kosinski, M., Keller, S. D. (1994). SF-36 Physical and Mental Health Summary Scales: A User’s Manual. Boston, MA: The Health Institute.
Ware, J. E. Jr., Sherbourne, C. D. (1992) The MOS 36-item short-form health survey (SF-36). I. Conceptual framework and item selection. Med Care, 30, 473-483.
Ware, J. Jr., Kosinski, M., Keller, S. D. (1996). A 12-item short-form health survey: Construction of scales and preliminary tests of reliability and validity. Med Care, 34(3), 220-233.
Ware, J. E., Snow, K. K., Kosinski, M., Gandek, B. (1993). SF-36Ã‚® Health Survey Manual and Interpretation Guide. Boston, MA: New England Medical Center, The Health Institute.
Ware, J. E., Kosinski, M., Turner-Bowker, D. M., Gandek, B (2002) SF-12v2: How to score version 2 of the SF-12 Health Survey. Lincoln RI: QualityMetric Incorporated.
Weinberger, M., Oddone, E. Z., Samsa, G. P., Landsman, P. B. (1996). Are health-related quality-of-life measures affected by the mode of administration? J Clin Epidemiol, 49(2), 135-140.
Wilkinson, P. R., Wolfe, C. D., Warburton, F. G., Rudd, A. G., Howard, R. S., Ross-Russell, R. W., Beech, R. (1997). Longer term quality of life and outcome in stroke patients: Is the Barthel Index alone an adequate measure of outcome? Quality in Health Care, 6, 125-130.
Williams, L. S. (1998). Health-Related Quality of Life Outcomes in Stroke. Neuroepidemiology , 17, 116-120.

See the measure

How to obtain the SF-36

Permission to use the SF-36 should be obtained from the Medical Outcomes Trust who oversees the standardized administration of the SF-36 and will provide updates on administration and scoring (McDowell & Newell 1996). Various computer applications are available to assist in scoring the SF-36 including free Excel templates that can be downloaded from the Internet.

All versions of the SF-36 can be viewed by visiting the website www.qualitymetric.com

Samples of the various versions of the SF-36 are also available on this website Please click here to see a copy of the SF-36