Action Research Arm Test (ARAT)

Evidence Reviewed as of before: 09-06-2011

Author(s)*: Sabrina Figueiredo, BSc

Editor(s): Lisa Zeltzer, MSc OT; Nicol Korner-Bitensky, PhD OT; Elissa Sitcoff, BA BSc

Purpose

The Action Research Arm Test (ARAT) is an evaluative measure to assess specific changes in limb function among individuals who sustained cortical damage resulting in hemiplegia (Lyle, 1981). It assesses a client’s ability to handle objects differing in size, weight and shape and therefore can be considered to be an arm-specific measure of activity limitation (Platz, Pinkowski, Kim, di Bella, & Johnson, 2005).

In-Depth Review

Purpose of the measure

The Action Research Arm Test (ARAT) is an evaluative measure to assess specific changes in limb function among individuals who sustained cortical damage resulting in hemiplegiaComplete paralysis of the arm, leg, and trunk on one side of the body that results from damage to the parts of the brain that control muscle movements. Hemiplegia is not a progressive condition, nor is it a disease. (Lyle, 1981). It assesses a client’s ability to handle objects differing in size, weight and shape and therefore can be considered to be an arm-specific measure of activity limitation (Platz, Pinkowski, Kim, di Bella, & Johnson, 2005).

Available versions

The ARAT was developed by Ronald Lyle in 1981 by adapting the Upper Extremity Function Test (UEFT) (Carroll, 1965). The UEFT test administration and scoring was simplified, the time required to administer the test was shorted, and items were grouped based on the hierarchical scale (Guttman Scale) (Lang, Wagner, Dromerick, & Edwards, 2006). Due to the need for more specific and detailed instructions related to the client’s position, scoring and test administration, Yozbatiran, Der-Yeghiaian, and Cramer (2008) proposed a standardized approach to the ARAT.

Features of the measure

Items:

The ARAT consists of 19 items grouped into four subscales: grasp, grip, pinch, and gross movement. Each subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
constitutes a hierarchical Guttman scale, which means that all items are ordered according to ascending difficulty. In the ARAT, if the client succeeds in completing the most difficult item in a subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
, this suggests he/she will succeed in the easier items for that same subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
. Similarly, failure on an item suggests the client will be unable to complete the remaining more challenging items in the subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
.

According to the rules defained by Lyle (1981), the client must first try to perform the most difficult task in a subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
. If the maximum score (score = 3) is obtained for this task then the maximum score for this entire subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
should be assigned, and the evaluator should move to the next subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
to be administered. When the client is unable to complete the most difficult item (scoring between 0-2), then the easiest item in this specific subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
should be performed. If the client fails completely (score = 0) when performing the easiest task, then the other intermediate items must not be tested, the entire subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
should be scored as zero, and the evaluator should then move to the next subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
. However, if the client succeeds at the easiest task either partially (score = 1 or 2) or completely (score = 3), then all the other tasks in that same subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
should be tested before moving to the next subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
. Following these rules, the items administered will range from a minimum of 4 to a maximum of 19 (van der Lee, Roorda, & Lankhorst, 2002).

The ARAT must be administered in a formal setting, since a specially designed table and chair are required (see equipment section for more information). For the starting position, the client should be seated in a chair, with a firm back and no armrests. The client’s trunk should be in contact with the back of the chair at all times during the test performance. Instructions about the required seating posture should be provided to the client prior to initiating the test. Additionally, reminders about the maintenance of this position should be given to the client when this condition is not respected. The client’s feet should be in contact with the floor throughout testing (van der Lee, DeGroot, Beckerman, Wagenaar, Lankhorst, & Bouter, 2001a; Yozbatiran et al., 2008). Both hands should be tested, beginning with the non- or less-affected hand, in order to practice and register baseline scores. Should the client be unable to understand the instructions for the required task, the evaluator should demonstrate the task and allow the client to try it as a trial (Yozbatiran et al., 2008). To facilitate recording the time for each task, the client’s hands should start and finish the task with palms down on the table. However, for the gross movement tasks, the client’s hands should be placed pronated on their lap. (Lyle, 1981; Yozbatiran et al., 2008).

In the grasp and pinch subscales, testing materials are lifted 37 cm from the surface of the table to the top of the shelf. In the grip subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
, testing materials are moved from one side of the table to the other. Finally, in the gross movement subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
, the client is requested to place the hand being tested either behind his/her head, on top of his/her head, or to his/her mouth (Lyle, 1981; Hsieh, Hsueh, Chiang, & Lin, 1998; Hsueh, Lee, & Hsieh, 2002a). The proper sequence for testing is 1) grasp subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
, 2) grip subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
, 3) pinch subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
, 4) gross movement subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
(Lyle, 1981). The ARAT comes with simple instructions to guide the evaluator on scoring and administering the test (Lyle, 1981).

Scoring:

The ARAT is scored on a four-level ordinal scale (0-3) (Lyle, 1981).

0 = can not perform any part of the test,
1 = performs the test partially,
2 = completes the test, but takes abnormally long, time
3 = performs the test normally

In order to facilitate scoring, time limits have been suggested (Wagenaar, Meijer, van Wierinen, Kuik, Hazenberg, Lindeboom, Wichers, & Rijswijk, 1990; Yozbatiran et al., 2008). Incorporating the time limits to Lyle’s scoring definition, the new scoring system would be:

0 = cannot perform any part of the test;
1 = performs the test partially;
2 = completes the test, but takes an abnormally long time, varying from 5 to 60 seconds.
If a client takes more than 60 seconds to perform an item, the evaluator should interrupt after 60 seconds and a score of 1 is given on that specific item.
3 = performs the test normally in less than 5 seconds.

The subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
scores range according to the number of items on each subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
, as follows:

Subscales on the ARAT	Number of items per subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).	Score ranges per subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
Grasp subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).	6 items	Score 0-18
Grip subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).	4 items	Score 0-12
Pinch subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).	6 items	Score 0-18
Gross Movement subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).	3 items	Score 0-9

The total score on the ARAT ranges from 0 to 57, with the lowest score indicating that no movements can be performed, and the upper score indicating normal performance. Thus, higher scores will indicate better performance (Lang et al., 2006; van der Lee et al., 2002). The ARAT scores is a continuous measure, with no categorical cutoff scores. Therefore the score obtained at the ARAT does not allow classifying the clients into categories such as normal, mild limited, or severely limited.

Time:

The time required to complete the ARAT will depend on the number of items administered. Based on its hierarchical design, the ARAT was constructed to save testing time. Thus, no more than 7-10 minutes should be required to assess a client with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. (DeWeerdt, & Harrinson, 1985). However, if all 19 items are performed, the ARAT usually takes 20 minutes to administer (van der Lee et al., 2002). In one study by Hsieh and colleagues (1998), the ARAT took, on average, 8 minutes to administer to clients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain..

Subscales:

The ARAT is divided in four subscales: Grasp; Grip; Pinch and Gross movement.

The grasp and pinch subscales have 6 items each, the grip subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
has 4 items, and the gross movement has 3 items (Lyle, 1981).

Equipment:

Standardized equipment is required to administer the ARAT. It can be ordered only from Netherlands’ representatives. The average cost for this equipment is approximately 850 Euros ($1200 CAD) with an additional delivery fee of 179 Euros ($252 CAD).

The complete ARAT kit consists of:

A specially designed table of 92cm x 45cm x 83cm high, with a shelf of 93cm x 10cm, positioned 37cm above the main surface of the table (Lyle, 1981; Hsueh et al., 2002a).
A chair with back rest and no arm rests, that should be placed 44cm above floor level (Lyle, 1981; Hsueh et al., 2002a).
Woodblocks of 2.5, 5, 7.5 and 10cm³ (Lyle, 1981; Hsueh et al., 2002a).
A cricket ball 7.5cm in diameter (Lyle, 1981; Hsueh et al., 2002a).
Two alloy tubes: one 2.25cm in diameter x 11.5 cm long, the second one 1.0cm in diameter x 16cm long (Lyle, 1981; Hsueh et al., 2002a).
A washer and bolt; which is a type of screw with its anchor (Lyle, 1981; Hsueh et al., 2002a).
Two glasses (Lyle, 1981; Hsueh et al., 2002a).
A marble 1.5cm in diameter (Lyle, 1981; Hsueh et al., 2002a).
A ball bearing 6mm in diameter (Lyle, 1981; Hsueh et al., 2002a).
A stopwatch (Wagenaar et al., 1990; Yozbatiran et al., 2008)
Paper and pencil for the evaluator.

Training:

None typically reported.

Alternative forms of the Action Research Arm Test

None.

Client suitability

Can be used with:

The ARAT was constructed for assessing recovery of upper limb function following cortical damage (Lyle, 1981).
Clients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain..

Should not be used in:

When administering the ARAT for clients with finger amputation, pinch subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
should be scored as 0 as well all other tasks that require movement of an amputated body part (Yozbatiran et al., 2008).

In what languages is the measure available?

There are no official translations of the ARAT.

Nevertheless, some peer-reviewed publications from the Netherlands and Taiwan have used the ARAT as an outcome measure, which may indicate that instructions have been informally translated to other languages (Hsieh et al., 1998; Hsueh et al., 2002a; van der Lee et al., 2002).

Summary

What does the tool measure?	The ARAT measures specific changes in limb function among individuals who sustained cortical damage resulting in hemiplegiaComplete paralysis of the arm, leg, and trunk on one side of the body that results from damage to the parts of the brain that control muscle movements. Hemiplegia is not a progressive condition, nor is it a disease..
What types of clients can the tool be used for?	The ARAT can be used with, but is not limited to clients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain..
Is this a screeningTesting for disease in people without symptoms. or assessment tool?	Assessment
Time to administer	An average of 7 to 10 minutes.
Versions	There are no alternative versions.
Other Languages	There are no official translations.
Measurement Properties
ReliabilityReliability can be defined in a variety of ways. It is generally understood to be the extent to which a measure is stable or consistent and produces similar results when administered repeatedly. A more technical definition of reliability is that it is the proportion of "true" variation in scores derived from a particular measure. The total variation in any given score may be thought of as consisting of true variation (the variation of interest) and error variation (which includes random error as well as systematic error). True variation is that variation which actually reflects differences in the construct under study, e.g., the actual severity of neurological impairment. Random error refers to "noise" in the scores due to chance factors, e.g., a loud noise distracts a patient thus affecting his performance, which, in turn, affects the score. Systematic error refers to bias that influences scores in a specific direction in a fairly consistent way, e.g., one neurologist in a group tends to rate all patients as being more disabled than do other neurologists in the group. There are many variations on the measurement of reliability including alternate-forms, internal consistency , inter-rater agreement , intra-rater agreement , and test-retest .	Internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency.: One study examined the internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. of the ARAT and reported excellent internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. using Cronbach’s alpha. Test-retest: Three studies have examined the test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society). of the ARAT. All reported excellent test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society). using ICCs. Intra-rater: Four studies have examined the intra-rater reliabilityThis is a type of reliability assessment in which the same assessment is completed by the same rater on two or more occasions. These different ratings are then compared, generally by means of correlation. Since the same individual is completing both assessments, the rater's subsequent ratings are contaminated by knowledge of earlier ratings. of the ARAT and reported excellent intra-rater reliabilityThis is a type of reliability assessment in which the same assessment is completed by the same rater on two or more occasions. These different ratings are then compared, generally by means of correlation. Since the same individual is completing both assessments, the rater's subsequent ratings are contaminated by knowledge of earlier ratings. using Spearman rho correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation. , intraclass correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation. coefficients (ICC) and weighted kappa. Inter-rater: Seven studies examined the inter-rater reliabilityA method of measuring reliability . Inter-rater reliability determines the extent to which two or more raters obtain the same result when using the same instrument to measure a concept. of the ARAT and reported excellent inter-rater reliabilityA method of measuring reliability . Inter-rater reliability determines the extent to which two or more raters obtain the same result when using the same instrument to measure a concept. using Spearman rho correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation. , Intra ICC and weighted kappa.
ValidityThe degree to which an assessment measures what it is supposed to measure.	Criterion: Concurrent: One study has examined the concurrent validityTo validate a new measure, the results of the measure are compared to the results of the gold standard obtained at approximately the same point in time (concurrently), so they both reflect the same construct. This approach is useful in situations when a new or untested tool is potentially more efficient, easier to administer, more practical, or safer than another more established method and is being proposed as an alternative instrument. See also "gold standard." of the ARAT and reported adequate to excellent correlations with the Box and Block Test (BBT) and the Nine-Hole Peg Test (NHPT) at pre and post-treatment. Predictive: No studies have examined the predictive validityA form of criterion validity that examines a measure's ability to predict some subsequent event. Example: can the Berg Balance Scale predict falls over the following 6 weeks? The criterion standard in this example would be whether the patient fell over the next 6 weeks. of the ARAT. Construct: Convergent: Seven studies examined convergent validityA type of validity that is determined by hypothesizing and examining the overlap between two or more tests that presumably measure the same construct. In other words, convergent validity is used to evaluate the degree to which two or more measures that theoretically should be related to each other are, in fact, observed to be related to each other. of the ARAT and reported excellent correlations between the ARAT and the Brunnstrom-Fugl-Meyer test; the upper extremity subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society). of the Motor Assessment scale; the Motricity Index; the upper extremity movement of Modified Motor Assessment Chart; the BTT; the motor function subscore of the Fugl-Meyer test; the Hemispheric StrokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. Scale; upper extremity strength and grasp speed. Adequate correlations were reported between the ARAT and the passive joint motion/joint pain of the Fugl-Meyer test, the Functional Independence Measure and spasticityInvoluntary muscle tightness and stiffness that can occur after a stroke. It is characterized by exaggerated deep tendon reflexes that interfere with muscular activity, gait, movement, or speech. Spasticity can increase initially but wane down later on, after stroke. . Poor correlations were reported between the ARAT and the sensation score of the Fugl-Meyer test, the Ashworth scale, the Modified Barthel Index, the National Institutes of Health StrokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. Scale, the light touch sensation and pain.
Floor/Ceiling Effects	– One study examined the floor/ceiling effects of the ARAT in clients with acute strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. and reported that at earlier phases of the strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain., floor effects were poor. At discharge from the acute rehabilitation ward, ceiling effects on the ARAT were adequate. – One study examined the floor/ceiling effects of the ARAT in strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. clients with mild to moderate hemiparesis and reported adequate floor and ceiling effects.
SensitivitySensitivity refers to the probability that a diagnostic technique will detect a particular disease or condition when it does indeed exist in a patient (National Multiple Sclerosis Society). See also "Specificity." / SpecificitySpecificity refers to the probability that a diagnostic technique will indicate a negative test result when the condition is absent (true negative).	No studies have examined the specificitySpecificity refers to the probability that a diagnostic technique will indicate a negative test result when the condition is absent (true negative). of the ARAT.
Does the tool detect change in patients?	Six studies have examined the responsivenessThe ability of an instrument to detect clinically important change over time. of the ARAT and reported that the ARAT has a moderate to large Standardized Response MeanThe standardized response mean (SRM) is calculated by dividing the mean change by the standard deviation of the change scores. , moderate to large effect sizeEffect size (ES) is a name given to a family of indices that measure the magnitude of a treatment effect. Unlike significance tests, these indices are independent of sample size. The ES is generally measured in two ways: as the standardized difference between two means, or as the correlation between the independent variable classification and the individual scores on the dependent variable. This correlation is called the "effect size correlation". and large responsivenessThe ability of an instrument to detect clinically important change over time. ratio, therefore, is able to detect change in clients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain..
Acceptability	When administering the ARAT to clients with upper extremity amputations attention is required when scoring (i.e. – a score of 0 is given).
Feasibility	The administration of the ARAT is quick and simple, but requires standardized equipment.
How to obtain the tool?	Information on the ARAT can be obtained in the study by Lyle (1981), Hsieh et al. (1998), van der Lee et al. (2002), Rabadi & Rabadi (2006), and Yozbatiran et al. (2008) and at the website: http://www.aratest.eu/Index_english.htm Standardized equipment can be purchased from the following website: http://www.aratest.eu/ or from http://www.saliarehab.com/

Psychometric Properties

Overview

We conducted a literature search to identify all relevant publications on the psychometric properties of the Action Research Arm Test (ARAT) in individuals with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. We identified twelve studies. The ARAT appears to be floor effects.

Floor/Ceiling Effects

Hsueh and Hsieh (2002b) examined floor and ceilings effects for the ARAT and the Upper Extremity Motor Assessment Scale (Carr, Shepherd, Nordholm, & Lynne, 1985) in 48 clients with acute strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. Participants were assessed at admission and discharge from an acute rehabilitation ward. At admission, the ARAT total score demonstrated a poor floor effectThe floor effect is when data cannot take on a value lower than some particular number. Thus, it represents a subsample for whom clinical decline may not register as a change in score, even if there is worsening of function/behavior etc. because there are no items or scaling within the test that measure decline from the lowest possible score. See also "ceiling effect."
, with 52.1% of participants scoring 0. Although all subscales were classified as having a poor floor effectThe floor effect is when data cannot take on a value lower than some particular number. Thus, it represents a subsample for whom clinical decline may not register as a change in score, even if there is worsening of function/behavior etc. because there are no items or scaling within the test that measure decline from the lowest possible score. See also "ceiling effect."
, when comparing ARAT’s subscales among themselves, 72.9% of participants were unable to perform the pinch subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
, 70.8% were unable to perform both grasp and grip subscales and 52.1 % were unable to complete the gross movement subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
. At discharge, the ARAT total score demonstrated an adequate ceiling effectA ceiling effect occurs when test items aren't challenging enough for a group of individuals. Thus, the test score will not increase for a subsample of people who may have clinically improved because they have already reached the highest score that can be achieved on that test. In other words, because the test has a limited number of difficult items, the most highly functioning individuals will score at the highest possible score. This becomes a measurement problem when you are trying to identify changes - the person may continue to improve but the test does not capture that improvement. Example: A memory test that assesses how many words a participant can recall has a total of five words that each participant is asked to remember. Because most individuals can remember all five words, this measure has a ceiling effect. See also "floor effect.", with only 7% of participants scoring the maximal value. When analyzing ARAT’s subscales individually the gross movement subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
presented the poorest ceiling effectA ceiling effect occurs when test items aren't challenging enough for a group of individuals. Thus, the test score will not increase for a subsample of people who may have clinically improved because they have already reached the highest score that can be achieved on that test. In other words, because the test has a limited number of difficult items, the most highly functioning individuals will score at the highest possible score. This becomes a measurement problem when you are trying to identify changes - the person may continue to improve but the test does not capture that improvement. Example: A memory test that assesses how many words a participant can recall has a total of five words that each participant is asked to remember. Because most individuals can remember all five words, this measure has a ceiling effect. See also "floor effect.", with 29.2% of participants scoring the maximum score, followed by 27% of participants on the grasp subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
. The grip and pinch subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
had the best classification, with an adequate ceiling effectA ceiling effect occurs when test items aren't challenging enough for a group of individuals. Thus, the test score will not increase for a subsample of people who may have clinically improved because they have already reached the highest score that can be achieved on that test. In other words, because the test has a limited number of difficult items, the most highly functioning individuals will score at the highest possible score. This becomes a measurement problem when you are trying to identify changes - the person may continue to improve but the test does not capture that improvement. Example: A memory test that assesses how many words a participant can recall has a total of five words that each participant is asked to remember. Because most individuals can remember all five words, this measure has a ceiling effect. See also "floor effect." of 18.8% and 16.7%, respectively.

Compared to the ARAT, at admission the Upper Extremity Motor Assessment Scale had 58% of participants scoring the minimal value, indicating a poor floor effectThe floor effect is when data cannot take on a value lower than some particular number. Thus, it represents a subsample for whom clinical decline may not register as a change in score, even if there is worsening of function/behavior etc. because there are no items or scaling within the test that measure decline from the lowest possible score. See also "ceiling effect."
. However, at discharge the Upper Extremity Motor Assessment Scale demonstrated a more adequate ceiling effectA ceiling effect occurs when test items aren't challenging enough for a group of individuals. Thus, the test score will not increase for a subsample of people who may have clinically improved because they have already reached the highest score that can be achieved on that test. In other words, because the test has a limited number of difficult items, the most highly functioning individuals will score at the highest possible score. This becomes a measurement problem when you are trying to identify changes - the person may continue to improve but the test does not capture that improvement. Example: A memory test that assesses how many words a participant can recall has a total of five words that each participant is asked to remember. Because most individuals can remember all five words, this measure has a ceiling effect. See also "floor effect." than the ARAT, with only 4.3 % of participants obtaining the maximum score.

Reliability

Internal ConsistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency.:
Nijland et al. (2010) investigated the internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. of the ARAT in 40 patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. with mild to moderate hemiparesis. Internal consistencyA method of measuring reliability . Internal consistency reflects the extent to which items of a test measure various aspects of the same characteristic and nothing else. Internal consistency coefficients can take on values from 0 to 1. Higher values represent higher levels of internal consistency. of the ARAT, as calculated using Cronbach’s Coefficient Alpha was excellent (α = 0.98).

Test-retest:
Note: From the descriptions provided of the following studies it appears that some authors called the testing test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society).
while others called the same analysis intra-rater reliabilityThis is a type of reliability assessment in which the same assessment is completed by the same rater on two or more occasions. These different ratings are then compared, generally by means of correlation. Since the same individual is completing both assessments, the rater's subsequent ratings are contaminated by knowledge of earlier ratings.
.

Lyle (1981) examined test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society).
in 20 individuals who sustained cortical damage, either from strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. or traumatic brain lesion. The mean age was 53 years, ranging from 26 to 72 years. Participants were re-assessed with a 1-week interval by the same rater and under the same conditions. The test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society).
, as calculated using Pearson correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
, was excellent (r = 0.98).

Hsueh, Lee, and Hsieh (2002a) evaluated test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society).
performed using a regular table instead of the specially designed table for this test in 61 individuals with sub-acute strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. and a mean age of 63 years old. Participants were re-assessed after a two-day interval by the same rater. The test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society).
, as calculated using the Intraclass Correlation Coefficient (ICC)Intraclass correlation (ICC) is used to measure inter-rater reliability for two or more raters. It may also be used to assess test-retest reliability. ICC may be conceptualized as the ratio of between-groups variance to total variance., was excellent for the total score (ICC = 0.99) as well as for the grasp, grip, pinch and gross movement subscales (ICC = 0.99, 0.98, 0.96 and 0.95, respectively).

Platz, Pinkowski, van Wijck, Kim, di Bella, and Johnson (2005) estimated test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society).
for the ARAT, the Box and Block Test (Cromwell, 1965; Mathiowetz, Volland, Kashman, & Weber, 1985a), and the Fugl-Meyer Test upper extremity items (including items from the Motor function, Sensation and Passive Joint Motion/Joint pain subscores) (Fugl-Meyer, Jääskö, Leyman, Olsson, & Steglind, 1975) in 23 participants with upper extremity paresis either from strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain., multiple sclerosis, or traumatic brain injury. The participant’s most affected arm was re-assessed 1 week later by the same rater. The test-retest reliabilityA way of estimating the reliability of a scale in which individuals are administered the same scale on two different occasions and then the two scores are assessed for consistency. This method of evaluating reliability is appropriate only if the phenomenon that the scale measures is known to be stable over the interval between assessments. If the phenomenon being measured fluctuates substantially over time, then the test-retest paradigm may significantly underestimate reliability. In using test-retest reliability, the investigator needs to take into account the possibility of practice effects, which can artificially inflate the estimate of reliability (National Multiple Sclerosis Society).
of the ARAT total score, as calculated using ICC’s and Spearman rho correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
, was excellent (ICC = 0.96 and rho = 0.96). Furthermore, test-retest reliabilities for each subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
were all excellent: grasp (ICC = 0.94 and rho = 0.96), grip (ICC = 0.94 and rho = 0.95), pinch (ICC = 0.89 and rho = 0.89) and gross movement (ICC = 0.97 and rho = 0.97).
Note: These results applies only to the most affected upper limb.

Intra-rater:
Wagenaar, Meijer, van Wierinen, Kuik, Hazenberg, Lindeboom, Wichers and Rijswijk (1990) evaluated intra-rater reliabilityThis is a type of reliability assessment in which the same assessment is completed by the same rater on two or more occasions. These different ratings are then compared, generally by means of correlation. Since the same individual is completing both assessments, the rater's subsequent ratings are contaminated by knowledge of earlier ratings.
in seven patients with acute strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. The timeframe for assessments were not provided by the author. Intra-rater reliabilityThis is a type of reliability assessment in which the same assessment is completed by the same rater on two or more occasions. These different ratings are then compared, generally by means of correlation. Since the same individual is completing both assessments, the rater's subsequent ratings are contaminated by knowledge of earlier ratings.
as calculated using Spearman rho correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
, was excellent (rho = 0.99).

Van der Lee, DeGroot, Beckerman, Wagenaar, Lankhorst, and Bouter (2001a) estimated intra-rater reliabilityThis is a type of reliability assessment in which the same assessment is completed by the same rater on two or more occasions. These different ratings are then compared, generally by means of correlation. Since the same individual is completing both assessments, the rater's subsequent ratings are contaminated by knowledge of earlier ratings.
in 20 patients with chronic strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. and a median age of 62 years. Participants were evaluated by the same rater at three points in time. At the baseline assessment participants were videotaped. The second assessment was 4-27 months following the first assessment, and the final assessment was 4-6 weeks after. Scoring the last two assessments was based on the videotaped recorded at baseline. Intra-rater reliabilityThis is a type of reliability assessment in which the same assessment is completed by the same rater on two or more occasions. These different ratings are then compared, generally by means of correlation. Since the same individual is completing both assessments, the rater's subsequent ratings are contaminated by knowledge of earlier ratings.
results were analyzed between the two first assessments, where scoring sources were different (live vs. videotape) and between the two last assessments, were scoring sources were the same (videotape only). Intra-rater reliabilityThis is a type of reliability assessment in which the same assessment is completed by the same rater on two or more occasions. These different ratings are then compared, generally by means of correlation. Since the same individual is completing both assessments, the rater's subsequent ratings are contaminated by knowledge of earlier ratings.
, as calculated using ICC and Spearman rho correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
, was excellent (ICC = 0.99 and rho = 0.99), independent of scoring sources. Intra-rater reliabilityThis is a type of reliability assessment in which the same assessment is completed by the same rater on two or more occasions. These different ratings are then compared, generally by means of correlation. Since the same individual is completing both assessments, the rater's subsequent ratings are contaminated by knowledge of earlier ratings.
, as calculated using weighted kappa was also excellent: scoring with the same information source resulted in a kappa = 1.00 versus only a slightly lower kappa when scoring from two different information sources (kappa = 0.94). The gross movement subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
showed the lowest weighted kappa value (kappa = 0.83), suggesting that this subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
had the lowest agreement level.

Yozbatiran, Der-Yeghiaian, and Cramer (2008) examined intra-rater reliabilityThis is a type of reliability assessment in which the same assessment is completed by the same rater on two or more occasions. These different ratings are then compared, generally by means of correlation. Since the same individual is completing both assessments, the rater's subsequent ratings are contaminated by knowledge of earlier ratings.
in 8 clients with chronic strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. Participants were re-assessed by the same rater and under the same conditions with a 1-week interval. Intra-rater reliabilityThis is a type of reliability assessment in which the same assessment is completed by the same rater on two or more occasions. These different ratings are then compared, generally by means of correlation. Since the same individual is completing both assessments, the rater's subsequent ratings are contaminated by knowledge of earlier ratings.
for the total score, as calculated using ICC and Spearman rho correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
, was excellent (ICC = 0.99 and rho = 0.99). Additionally, the same excellent level of intra-rater reliabilityThis is a type of reliability assessment in which the same assessment is completed by the same rater on two or more occasions. These different ratings are then compared, generally by means of correlation. Since the same individual is completing both assessments, the rater's subsequent ratings are contaminated by knowledge of earlier ratings.
was found for the grasp, grip, pinch, and gross motor movement subscales (ICC = 0.98 and rho = 0.93; ICC = 0.97 and rho = 0.93; ICC = 0.99 and rho = 0.98; ICC = 0.93 and rho = 0.91, respectively).

Nijland et al. (2010) investigated the psychometric properties of the ARAT and Wolf Motor Function Test in 40 patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. with mild to moderate hemiparesis. 18 patients participated in the reproducibility testing of the ARAT and were assessed twice by the same observer approximately 10 days apart. Intra-rater reliabilityThis is a type of reliability assessment in which the same assessment is completed by the same rater on two or more occasions. These different ratings are then compared, generally by means of correlation. Since the same individual is completing both assessments, the rater's subsequent ratings are contaminated by knowledge of earlier ratings.
, as analyzed using the ICC was found to be excellent (ICC = 0.97).

Inter-rater:
Lyle (1981) examined inter-rater reliabilityA method of measuring reliability . Inter-rater reliability determines the extent to which two or more raters obtain the same result when using the same instrument to measure a concept.
in 20 individuals who had sustained cortical damage, either from strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. or traumatic brain injury. The mean age was 53 years, ranging from 26 to 72 years. Participants were assessed independently by two different raters. Agreement between raters as calculated using Pearson correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
, was excellent (r = 0.99).

Hsieh, Hsueh, Chiang, and Lin (1998) assessed inter-rater reliabilityA method of measuring reliability . Inter-rater reliability determines the extent to which two or more raters obtain the same result when using the same instrument to measure a concept.
in 50 clients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. Their mean age was 65 years old. Participants were evaluated independently, on three different days, by three raters. ICC for the total score showed excellent agreement (ICC = 0.98). Agreement between raters was also excellent for grasp, grip, pinch and gross movement subscales (ICC = 0.98; ICC = 0.96; ICC = 0.96; ICC = 0.95, respectively).

Van der Lee et al. (2001a) estimated inter-rater reliabilityA method of measuring reliability . Inter-rater reliability determines the extent to which two or more raters obtain the same result when using the same instrument to measure a concept.
in 20 patients with chronic strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. and a median age of 62 years old. Participants were videotaped and scored independently by two raters. Inter-rater reliabilityA method of measuring reliability . Inter-rater reliability determines the extent to which two or more raters obtain the same result when using the same instrument to measure a concept.
, as calculated using ICC, weighted kappa, and Spearman rho correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
, was excellent (ICC = 0.98; kappa = 0.93; rho = 0.99). With respect to the individual subscales, the gross movement scale had the lowest weighted kappa value (kappa = 0.87), suggesting this subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
has the lowest agreement between raters.

Hsueh, Lee, and Hsieh (2002a) evaluated inter-rater reliabilityA method of measuring reliability . Inter-rater reliability determines the extent to which two or more raters obtain the same result when using the same instrument to measure a concept.
of the ARAT performed with a regular table instead of the specially designed table for this test in 61 individuals with sub-acute strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. and a mean age of 63 years old. Participants were re-assessed with a two-day interval by three different raters. ICC for the total score showed excellent agreement (ICC = 0.99) as well as for grasp, grip, pinch and gross movement subscales (ICC = 0.99; ICC = 0.98; ICC = 0.96; ICC = 0.94, respectively).

Platz et al. (2005) analyzed inter-rater reliabilityA method of measuring reliability . Inter-rater reliability determines the extent to which two or more raters obtain the same result when using the same instrument to measure a concept.
of the ARAT, the Box and Block Test and the Fugl-Meyer Test upper extremity items (including items from the Motor function, Sensation and Passive Joint Motion/Joint pain subscores) in 44 individuals with upper limb paresis either from strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain., multiple sclerosis, or traumatic brain injury. Participants had the most affected arm videotaped and scored independently by two raters. Inter-rater reliabilityA method of measuring reliability . Inter-rater reliability determines the extent to which two or more raters obtain the same result when using the same instrument to measure a concept.
for the ARAT total score, as calculated using the ICC and Spearman rho correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
, was excellent (ICC = 0.99 and rho = 0.99). Additionally, the scores for each subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
were provided and inter-rater reliabilityA method of measuring reliability . Inter-rater reliability determines the extent to which two or more raters obtain the same result when using the same instrument to measure a concept.
for grasp (ICC = 0.99 and rho = 0.99), grip (ICC = 0.96 and rho = 0.95), pinch (ICC = 0.99 and rho = 0.99) and gross movement (ICC = 0.98 and rho = 0.98) subscales were all excellent.
Note: These results applies only to the most affected upper limb.

Yozbatiran et al. (2008) evaluated inter-rater reliabilityA method of measuring reliability . Inter-rater reliability determines the extent to which two or more raters obtain the same result when using the same instrument to measure a concept.
in 9 clients with chronic strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. Participants were scored simultaneously and independently by two raters. Inter-rater reliabilityA method of measuring reliability . Inter-rater reliability determines the extent to which two or more raters obtain the same result when using the same instrument to measure a concept.
for the total score, as calculated using the ICC and Spearman rho correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
, was excellent (ICC = 0.99 and rho = 0.96). The same excellent level of inter-rater reliabilityA method of measuring reliability . Inter-rater reliability determines the extent to which two or more raters obtain the same result when using the same instrument to measure a concept.
was found for the grasp, grip, pinch and gross motor movement subscales (ICC = 0.99 and rho = 1; ICC = 0.99 and rho = 0.99; ICC = 0.99 and rho = 0.98; ICC = 0.97 and rho = 0.93, respectively).

Nijland et al. (2010) investigated the psychometric properties of the ARAT and Wolf Motor Function Test in 40 patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. with mild to moderate hemiparesis. 18 patients participated in the reproducibility testing of the ARAT and were assessed in random order by two observers, within one week. Inter-rater reliabilityA method of measuring reliability . Inter-rater reliability determines the extent to which two or more raters obtain the same result when using the same instrument to measure a concept.
, as analyzed using the ICC was found to be excellent (ICC = 0.92).

Validity

Content:

Lyle, 1981 generated the 19 ARAT items from the 33 items of the Upper Extremity Function Test (UEFT – Caroll, 1965). Item reduction was based on a low inter-item correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
, on item redundancy, confirmed through a very high inter-item correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
(above r = 0.9) and on items that were extremely difficult to perform. Nevertheless, ARAT items were not based on a theoretical model (Finch, Brooks, Stratford, & Mayo, 2002).

Criterion:

Concurrent:
No gold standardA measurement that is widely accepted as being the best available to measure a construct.
exists against which to compare the ARAT.

Lin, Chuang, Wu, Hsieh and Chang (2010) compared the concurrent validityTo validate a new measure, the results of the measure are compared to the results of the gold standard obtained at approximately the same point in time (concurrently), so they both reflect the same construct. This approach is useful in situations when a new or untested tool is potentially more efficient, easier to administer, more practical, or safer than another more established method and is being proposed as an alternative instrument. See also "gold standard."
of the ARAT, Box and Block Test (BBT) and Nine-Hole Peg Test (NHPT) for evaluating hand dexterity in 59 patients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. The Fugl-Meyer Assessment of Sensorimotor Recovery After StrokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. (FMA), Motor Activity Log (MAL) and StrokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. Impact Scale (SIS) were also administered to assess the concurrent validityTo validate a new measure, the results of the measure are compared to the results of the gold standard obtained at approximately the same point in time (concurrently), so they both reflect the same construct. This approach is useful in situations when a new or untested tool is potentially more efficient, easier to administer, more practical, or safer than another more established method and is being proposed as an alternative instrument. See also "gold standard."
of the ARAT, BBT and NHPT. Using Spearman rank correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
coefficient, the ARAT, BBT and NHPT were found to have adequate to excellent correlations at pre-treatment (ranging from rho=-0.55 to -0.80) and post-treatment (ranging from rho=-0.57 to -0.71). In addition, the ARAT and BBT were found to have adequate correlations with the FMA, MAL and SIS (ranging from rho=0.31-59); however, the NHPT had only poor to adequate correlations with the FMA and MAL (ranging from rho=-0.16 to -0.33); and adequate to excellent correlations with the SIS (ranging from rho=-0.58 to -0.66). When considering both the results of responsivenessThe ability of an instrument to detect clinically important change over time.
and validation components of the study, the ARAT and BBT are believed to be more appropriate than the NHPT for evaluating dexterity.

Predictive:
No studies have examined the predictive validityA form of criterion validity that examines a measure's ability to predict some subsequent event. Example: can the Berg Balance Scale predict falls over the following 6 weeks? The criterion standard in this example would be whether the patient fell over the next 6 weeks.
of the ARAT.

Construct:

Convergent/Discriminant:
DeWeerdt and Harrison (1985) evaluated the convergent validityA type of validity that is determined by hypothesizing and examining the overlap between two or more tests that presumably measure the same construct. In other words, convergent validity is used to evaluate the degree to which two or more measures that theoretically should be related to each other are, in fact, observed to be related to each other.
of the ARAT by comparing it to the Fugl-Meyer test (Fugl-Meyer et al., 1975) in 53 clients with acute strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. Their mean age was 68 years. Correlations were calculated at two points in time after strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. onset using Spearman correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
coefficient. Excellent correlations were found between the ARAT and Fugl-Meyer test at 2 months (rho = 0.91) and at 8 months (rho = 0.94) post-stroke.

Wagenaar, Meijer, van Wierinen, Kuik, Hazenberg, Lindeboom, Wichers and Rijswijk (1990) evaluated the convergent validityA type of validity that is determined by hypothesizing and examining the overlap between two or more tests that presumably measure the same construct. In other words, convergent validity is used to evaluate the degree to which two or more measures that theoretically should be related to each other are, in fact, observed to be related to each other.
of the ARAT by comparing it to the Sollerman test (Jacobson-Sollerman & Sperling, 1977) in seven patients with acute strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. An excellent correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
, as calculated using Spearman rho, was found (rho = 0.94).
Note: The Sollerman test measures hand grip function using 20 different daily life activitiesAs defined by the International Classification of Functioning, Disability and Health, activity is the performance of a task or action by an individual. Activity limitations are difficulties in performance of activities. These are also referred to as function.
requiring hand movements.

Hsieh et al. (1998) assessed convergent validityA type of validity that is determined by hypothesizing and examining the overlap between two or more tests that presumably measure the same construct. In other words, convergent validity is used to evaluate the degree to which two or more measures that theoretically should be related to each other are, in fact, observed to be related to each other.
of the ARAT by comparing it to the Upper Extremity portion of the Motor Assessment Scale (Carr et al., 1985), the arm subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
of the Motricity Index (Demeurisse, Demol, & obaye, 1980), and the upper extremity movements of the Modified Motor Assessment Chart (Lindmark & Hamrin, 1988) in 50 clients with strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. The mean age of clients was 65 years old. Correlations were calculated using Pearson CorrelationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
Coefficients. Excellent correlations were found between the ARAT and the Upper Extremity part of the Motor Assessment Scale ((r = 0.96), Motricity Index (r = 0.87) and the upper extremity movements of the Modified Motor Assessment Chart (r = 0.94).

Platz et al. (2005) tested convergent validityA type of validity that is determined by hypothesizing and examining the overlap between two or more tests that presumably measure the same construct. In other words, convergent validity is used to evaluate the degree to which two or more measures that theoretically should be related to each other are, in fact, observed to be related to each other.
of the ARAT by comparing it to the Box and Block Test (Cromwell, 1965; Mathiowetz et al., 1985a), the Fugl-Meyer Test upper extremity items (including items from the Motor Function, Sensation and Passive Joint Motion/Joint Pain subscores) (Fugl-Meyer et al., 1975), the Motricity Index (Demeurisse et al., 1980), the Ashworth Scale (Ashworth, 1964), the Hemispheric StrokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. Scale (Adams, Meador, Sethi, Grotta, & Thomson, 1986) and the Modified Barthel Index (Collin, Wade, Davies, & Horne, 1988) in 56 participants with upper extremity paresis either from strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. (n=37), multiple sclerosis (n=14), or traumatic brain injury (n=5). Correlations were calculated using the Spearman CorrelationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
Coefficient. Excellent correlations were found between the ARAT and the Box and Block Test (rho = 0.95), the Motor Function subscore of the Fugl-Meyer Test (rho = 0.92), the Motricity Index (rho = 0.81), and the Hemispheric StrokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. Scale (rho = -0.66). Adequate correlations were found between the ARAT and the Passive Joint Motion/Joint Pain subscore of Fugl Meyer Test (rho = 0.42). Poor correlations were found between the ARAT and the Sensation Subscore of the Fugl-Meyer Test (rho = 0.29), the Ashworth Scale (rho = -0.29) and the Modified Barthel Index (rho = 0.04).
Note: Negative correlations are observed because a high score on the ARAT indicates normal performance, whereas a low score on the Hemispheric StrokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. Scale and the Ashworth Scale indicates normal performance.

Lang, Wagner, Dromerick, and Edwards (2006) evaluated the convergent validityA type of validity that is determined by hypothesizing and examining the overlap between two or more tests that presumably measure the same construct. In other words, convergent validity is used to evaluate the degree to which two or more measures that theoretically should be related to each other are, in fact, observed to be related to each other.
of the ARAT in 50 individuals with acute to sub acute strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain., mean age of 63 years old, attending an acute neurology strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. service at three points in time: admission (day 0); post intervention (day 14); and 90 days poststroke (day 90). The ARAT was compared to measures of sensorimotor impairment (e.g. light touch sensation, pain, elbow joint spasticityInvoluntary muscle tightness and stiffness that can occur after a stroke. It is characterized by exaggerated deep tendon reflexes that interfere with muscular activity, gait, movement, or speech. Spasticity can increase initially but wane down later on, after stroke.
, upper extremity strength), to kinematic measures (e.g. reach and grasp), to the Functional Independence Measure (FIM) (Keith, Granger, Hamilton, & Sherwin, 1987), and to the National Institutes of Health StrokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. Scale (NIHSS) (Brott, Adams, Olinger, Marler, Barsan, Biller, et al., 1989). At day 0, excellent correlations were found between the ARAT and upper extremity strength (r = 0.60) and grasp speed (r = 0.60). Adequate correlations were found between the ARAT and grasp efficiency (r = 0.42), reach efficiency (r = -0.38) and reach speed (r = 0.40), and the FIM upper extremity score (r = 0.38). Poor correlations were found between the ARAT and NIHSS (r = -0.15); light touch sensation (r = 0.15), pain (r = 0.10), elbow joint spasticityInvoluntary muscle tightness and stiffness that can occur after a stroke. It is characterized by exaggerated deep tendon reflexes that interfere with muscular activity, gait, movement, or speech. Spasticity can increase initially but wane down later on, after stroke.
(r = -0.28) and the FIM total score (r = 0.20). At day 14, excellent correlations were found between the ARAT and grasp efficiency (r = 0.60) and the FIM upper extremity scores (r = 0.62). Adequate correlations were found between the ARAT and elbow spasticityInvoluntary muscle tightness and stiffness that can occur after a stroke. It is characterized by exaggerated deep tendon reflexes that interfere with muscular activity, gait, movement, or speech. Spasticity can increase initially but wane down later on, after stroke.
(r = 0.49), upper extremity strength (r = 0.42), reach efficiency (r = -0.58), grasp speed (r = 0.36) and the FIM total score (r = 0.52). Poor correlations were found between the ARAT and NIHSS (r = -0.24), light touch sensation (r = -0.20), and pain (r = -0.12). At day 90, excellent correlations were found between the ARAT and upper extremity strength (r = 0.60). Adequate correlations were found between the ARAT and elbow spasticityInvoluntary muscle tightness and stiffness that can occur after a stroke. It is characterized by exaggerated deep tendon reflexes that interfere with muscular activity, gait, movement, or speech. Spasticity can increase initially but wane down later on, after stroke.
(r = -0.42), reach efficiency (r = -0.42), reach speed (r = 0.50), grasp efficiency (r = -0.48), grasp speed (r = 0.38) and the FIM upper extremity (r = 0.42) and total scores (r = 0.40). Poor correlations were found between the ARAT and the NIHSS (r = -0.29), light touch sensation (r = 0.00), and pain (r = 0.22). In summary, from this study’s findings it appears that the NIHSS, light touch sensation, and pain do not appear to relate to the ARAT. The relationship between the ARAT and FIM scores is stronger early on post-stroke and stabilizes by the ninetieth day.

Rabadi and Rabadi (2006) examined convergent validityA type of validity that is determined by hypothesizing and examining the overlap between two or more tests that presumably measure the same construct. In other words, convergent validity is used to evaluate the degree to which two or more measures that theoretically should be related to each other are, in fact, observed to be related to each other.
of the ARAT by comparing it to the Fugl-Meyer Assessment (Fugl-Meyer et al., 1975) at admission and discharge from an acute strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. rehabilitation unit in 104 inpatients with acute strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. with a mean age of 72 years. The correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
between ARAT and the Fugl-Meyer Assessment was excellent both at admission (rho = 0.77) and discharge (rho = 0.87).

Yozbatiran et al. (2008) estimated the convergent validityA type of validity that is determined by hypothesizing and examining the overlap between two or more tests that presumably measure the same construct. In other words, convergent validity is used to evaluate the degree to which two or more measures that theoretically should be related to each other are, in fact, observed to be related to each other.
of the ARAT by comparing it to the arm motor Fugl-Meyer Assessment (Fugl-Meyer et al., 1975) score in 12 clients with chronic strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. at a mean age of 61 years. Excellent correlationThe extent to which two or more variables are associated with one another. A correlation can be positive (as one variable increases, the other also increases - for example height and weight typically represent a positive correlation) or negative (as one variable increases, the other decreases - for example as the cost of gasoline goes higher, the number of miles driven decreases. There are a wide variety of methods for measuring correlation including: intraclass correlation coefficients (ICC), the Pearson product-moment correlation coefficient, and the Spearman rank-order correlation.
(r = 0.94) was found between the ARAT and arm motor Fugl-Meyer score.

Known groups:
No studies have examined known groups validityKnown groups validity is a form of construct validation in which the validity is determined by the degree to which an instrument can demonstate different scores for groups know to vary on the variables being measured.
of the ARAT.

Responsiveness

Van der Lee, Beckerman, Lankhorst, and Bouter (2001b) evaluated the responsivenessThe ability of an instrument to detect clinically important change over time.
on the ARAT and Fugl-Meyer Assessment (Fugl-Meyer et al., 1975) in 22 clients with chronic strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain., mean age of 58 years old, receiving intensive forced use treatment. Participants were assessed two weeks pre- and two weeks post- treatment. A responsivenessThe ability of an instrument to detect clinically important change over time.
ratio was calculated. Compared to the Fugl-Meyer Assessment, the ARAT had a greater responsivenessThe ability of an instrument to detect clinically important change over time.
ratio (2.03 for ARAT vs. 0.41 for Fugl-Meyer) suggesting that the ARAT is more sensitive to detecting change.
Note: The responsivenessThe ability of an instrument to detect clinically important change over time.
ratio is a variant of effect sizeEffect size (ES) is a name given to a family of indices that measure the magnitude of a treatment effect. Unlike significance tests, these indices are independent of sample size. The ES is generally measured in two ways: as the standardized difference between two means, or as the correlation between the independent variable classification and the individual scores on the dependent variable. This correlation is called the "effect size correlation".
and higher values indicate better responsivenessThe ability of an instrument to detect clinically important change over time.
.

Van der Lee, Roorda, Beckerman, and Lankhorst (2002) estimated the responsivenessThe ability of an instrument to detect clinically important change over time.
of a modified version of the ARAT in 63 participants with chronic strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain.. In this study, researchers did not follow Lyle’s standardized instructions. Instead, they administered all 19 ARAT items to verify any possible effect of this format on its psychometric properties. A responsivenessThe ability of an instrument to detect clinically important change over time.
ratio was calculated. Compared to the hierarchical version proposed by Lyle, performing all 19 items was found to improve the measure’s responsivenessThe ability of an instrument to detect clinically important change over time.
, with a responsivenessThe ability of an instrument to detect clinically important change over time.
ratio of 1.7 compared to 1.2 with Lyle’s version.
Note: The responsivenessThe ability of an instrument to detect clinically important change over time.
ratio can be considered an estimate of effect sizeEffect size (ES) is a name given to a family of indices that measure the magnitude of a treatment effect. Unlike significance tests, these indices are independent of sample size. The ES is generally measured in two ways: as the standardized difference between two means, or as the correlation between the independent variable classification and the individual scores on the dependent variable. This correlation is called the "effect size correlation".
normalized to the variability in a stable population and higher values indicate better responsivenessThe ability of an instrument to detect clinically important change over time.
.

Hsueh et al. (2002b) analyzed the responsivenessThe ability of an instrument to detect clinically important change over time.
of the ARAT and the upper extremity section of the Motor Assessment Scale (Carr et al., 1985) in 48 participants having acute strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. and a mean age of 62 years. Participants were assessed at two points in time: admission and discharge from the acute rehabilitation centre. The ARAT total score demonstrated a moderate effect sizeEffect size (ES) is a name given to a family of indices that measure the magnitude of a treatment effect. Unlike significance tests, these indices are independent of sample size. The ES is generally measured in two ways: as the standardized difference between two means, or as the correlation between the independent variable classification and the individual scores on the dependent variable. This correlation is called the "effect size correlation".
of 0.52, while the Motor Assessment Scale total score demonstrated a small effect sizeEffect size (ES) is a name given to a family of indices that measure the magnitude of a treatment effect. Unlike significance tests, these indices are independent of sample size. The ES is generally measured in two ways: as the standardized difference between two means, or as the correlation between the independent variable classification and the individual scores on the dependent variable. This correlation is called the "effect size correlation".
of 0.45.

Lang et al. (2006) examined the responsivenessThe ability of an instrument to detect clinically important change over time.
of the ARAT in 50 participants with acute to subacute strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain., with a mean age of 63 years old, receiving constraint-induced movement therapy (CIMT). Assessments were performed at three points in time: baseline, immediately post-treatment, and 2.5 months post-treatment. Effects sizes and responsivenessThe ability of an instrument to detect clinically important change over time.
ratios were calculated. ARAT total and subscaleMany measurement instruments are multidimensional and are designed to measure more than one construct or more than one domain of a single construct. In such instances subscales can be constructed in which the various items from a scale are grouped into subscales. Although a subscale could consist of a single item, in most cases subscales consist of multiple individual items that have been combined into a composite score (National Multiple Sclerosis Society).
scores at the first follow-up evaluation were similar, with moderate to large effect sizes (ARAT total score = 1.01; grasp subscore = 1.04; pinch subscore = 0.85; grip subscore = 1.01; and gross movement subscore = 0.72). The second follow-up evaluation demonstrated large effect sizes, with individual higher values when compared to the first evaluation (ARAT total score = 1.39; grasp subscore = 1.22; pinch subscore = 1.49; grip subscore = 1.32 and gross movement subscore = 0.98). The responsivenessThe ability of an instrument to detect clinically important change over time.
ratio for the ARAT total score at the first follow-up evaluation was 5.2 and at the second was 7.0. These two responsivenessThe ability of an instrument to detect clinically important change over time.
estimations suggest that the ARAT is a sensitive tool for detecting change even months after strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. onset.
Note: ResponsivenessThe ability of an instrument to detect clinically important change over time.
ratio is a variant of effect sizeEffect size (ES) is a name given to a family of indices that measure the magnitude of a treatment effect. Unlike significance tests, these indices are independent of sample size. The ES is generally measured in two ways: as the standardized difference between two means, or as the correlation between the independent variable classification and the individual scores on the dependent variable. This correlation is called the "effect size correlation".
and higher values indicate better responsivenessThe ability of an instrument to detect clinically important change over time.
.

Rabadi and Rabadi (2008) assessed the responsivenessThe ability of an instrument to detect clinically important change over time.
of the ARAT and the Fugl-Meyer Assessment (Fugl-Meyer et al., 1975) in 104 participants with acute strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain., with a mean age of 72 years, undergoing inpatient rehabilitation. Participants were evaluated at admission and discharge from acute care. The Standardized Response MeanThe standardized response mean (SRM) is calculated by dividing the mean change by the standard deviation of the change scores.
(SRM) was used to calculate responsivenessThe ability of an instrument to detect clinically important change over time.
. Amongst these upper extremity tests, the ARAT was less sensitive than the Fugl-Meyer Assessment (SRM = 0.68 and 0.74, respectively). However, since the difference between the SRMs for these two measures was minimal, these tests can be considered equally sensitive to change during inpatient acute rehabilitation. This result is contrary to the one presented by Van der Lee at al. (2002). The reason for this difference may be due to the difference in these studies population age and strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. severity.
Note: SRM is a variant of effect sizeEffect size (ES) is a name given to a family of indices that measure the magnitude of a treatment effect. Unlike significance tests, these indices are independent of sample size. The ES is generally measured in two ways: as the standardized difference between two means, or as the correlation between the independent variable classification and the individual scores on the dependent variable. This correlation is called the "effect size correlation".
and higher values indicate better responsivenessThe ability of an instrument to detect clinically important change over time.
.

Lin, Chuang, Wu, Hsieh and Chang (2010) evaluated the responsivenessThe ability of an instrument to detect clinically important change over time.
of the ARAT, Box and Block Test (BBT), the Nine-Hole Peg Test (NHPT) for evaluating hand dexterity in 59 patients with subacute strokeAlso called a "brain attack" and happens when brain cells die because of inadequate blood flow. 20% of cases are a hemorrhage in the brain caused by a rupture or leakage from a blood vessel. 80% of cases are also know as a "schemic stroke", or the formation of a blood clot in a vessel supplying blood to the brain. (< 6-months) and Brunnstrom stage IV to VI for proximal and distal upper extremity function. Patients were randomly assigned to receive constraint-induced therapyA form of intervention that involves restraining the unaffected upper or lower extremity in order to encourage movement of the affected limbs. For persons with USN, constraint-induced therapy involves restraining the unaffected arm or hand using a sling or padded mitt, in order to promote visual scanning and movement in the neglected hemispace.
, bilateral arm training or control treatment and received 2 hours of therapy, 5 days per week for 3 weeks. Assessments were performed at baseline and 3 weeks. Using Standardized Response MeanThe standardized response mean (SRM) is calculated by dividing the mean change by the standard deviation of the change scores.
(SRM) to calculate responsivenessThe ability of an instrument to detect clinically important change over time.
, the ARAT, BBT and NHPT were all found to have moderate SRM (0.79, 0.74, 0.64 respectively), indicating sensitivitySensitivity refers to the probability that a diagnostic technique will detect a particular disease or condition when it does indeed exist in a patient (National Multiple Sclerosis Society). See also "Specificity."
for detecting change in hand dexterity. When considering both the results of responsivenessThe ability of an instrument to detect clinically important change over time.
and validation components of the study, the ARAT and BBT are believed to be more appropriate than the NHPT for evaluating dexterity.

References

Adams, R.J., Meador, K.J., Sethi, K.D., Grotta, J.C., & Thomson, D.S. (1986). Graded neurologic scale for the use in acute hemispheric stroke treatment protocols. Stroke, 18, 665-669.
Ashworth, B. (1964). Preliminary trial of carisoprodol in multiple sclerosis. Practitioner, 192, 540-542.
Brott, T. G., Adams, H. P., Olinger, C. P., Marler, J. R., Barsan, W. G., Biller, J., Spilker, J., Holleran, R., Eberle, R., Hertzberg, V., Rorick, M., Moomaw, C. J., & Walker, M. (1989). Measurements of acute cerebral infarction: a clinical examination scale. Stroke, 20, 864 -70.
Carroll, D. (1965). A quantitative test of upper extremity function. Journal of Chronic Disability, 18, 479-91.
Carr, J.H., Shepherd, R.B., Nordholm, L., & Lynne, D. (1985). Investigation of a new motor assessment scale for stroke patients. Physical Therapy, 65, 175- 180.
Collin, C., Wade, D.T., Davies, S., & Horne, V. (1988). The Barthel ADL Index: a reliability study. International Disability Study, 10, 61-63.
Cromwell, F.S (1965). Occupational therapists manual for basic skills assessment: primary prevocational evaluation. Pasadena, (CA): Fair Oaks Printing; 29-31.
Demeurisse, G., Demol, O., & Robaye, E. (1980). Motor evaluation in vascular hemiplegia. European Neurology, 19(6), 382-389.
De Weerdt, W.J.G., & Harrison, M.A. (1985). Measuring recovery of arm hand function in stroke patients: a comparison of the Brunnstrom-Fugl-Meyer test and the Action Research Arm Test. Physiotherapy Canada, 37, 65-70.
Finch, E., Brooks, D., Stratford,P.W, & Mayo, N.E. (2002). Physical Outcome Measures: A guide to enhance physical outcome measures. Ontario, Canada: Lippincott, Williams, & Wilkins.
Fugl-Meyer, A.R., Jääskö, L., Leyman, I., Olsson, S., & Steglind, S. (1975). The post-stroke hemiplegic patient 1. A method for evaluation of physical performance. Scandinavian Journal of Rehabilitation Medicine, 7, 13-31.
Gowland, C., Van-Hullenaar, S., Torresin, W., et al., (1995). Chedoke-McMaster Stroke Assessment: development, validation, and administration manual. Hamilton, (ON), Canada: School of Rehabilitation Science, McMaster University
Heller, A., Wade, D.T., Wood, V.A., Sunderland, A., Hewer, R., & Ward, E. (1987). Arm function after stroke: measurement and recovery over the first three months. Journal of Neurology, Neurosurgery & Psychiatry, 50(6), 714-719.
Hsieh, C.L., Hsueh, I.P, Chiang, F., & Lin, P. (1998). Inter-rater reliability and validity of the action research arm test in stroke patients. Age and Ageing, 27, 107-113.
Hsueh, I.P, Lee, M.M., & Hsieh, C.L. (2002a). The action research arm test: Is it necessary for patients being tested to sit at a standardized table? Clinical Rehabilitation, 16, 382-388.
Hsueh, I.P. & Hsieh, C.L. (2002b). Responsiveness of two upper extremity function instruments for stroke inpatients receiving rehabilitation. Clinical Rehabilitation, 16, 617-624.
Jacobson-Sollerman, X & Sperling, Y. (1977). Grip function of the healthy hand in a standardized hand function test. A study of the Rancho Los Amigos test. Scandinavian Journal of Rehabilitation Medicine, 9(3), 123-129.
Keith, R.A, Granger, C.V., Hamilton, B.B., & Sherwin, F.S. (1987). The Functional Independence Measure: a new tool for rehabilitation. In: Eisenberg, M.G. & Grzesiak, R.C. (Ed.), Advances in clinical rehabilitation (pp. 6-18). New York: Springer Publishing Company.
Kellor, M., Frost, J., Silberberg, N., Iversen, I., & Cummings R. (1971). Hand strength and dexterity. American Journal of Occupational Therapy, 25, 77-83.
Lang, C.E., Wagner, J.M, Dromerick, A.W., & Edwards, D.F. (2006). Measurement of upper extremity function early after stroke: properties of the action research arm test. Archives Physical Medicine and Rehabilitation, 87, 1605-1610.
Lin, K-C., Chuang, L-L., Wu, C-Y., Hseih, Y-W. & Chang, W-Y. (2010). Responsiveness and validity of three dexterous function measures in stroke rehabilitation. Journal of Rehabilitation Research and Development, 47(6), 563-572.
Lindmark, B. & Hamrin, E. (1988). Evaluation of function capacity after stroke as a basis for active intervention: Presentation of a modified chart for motor capacity assessment and its reliability. Scandinavian Journal of Rehabilitation Medicine, 20, 103-109.
Lyle, R.C. (1981). A performance test for assessment of upper limb function in physical rehabilitation treatment and research. International Journal of Rehabilitation and Research, 4, 483-492.
Mathiowetz, V., Volland, G., Kashman, N., & Weber, K. (1985a). Adult norms for the box and block test of manual dexterity. American Journal of Occupational Therapy, 39, 386-391.
Mathiowetz, V., Weber, K., Kashman, N., & Volland, G. (1985b). Adult norms for the nine hole peg test of finger dexterity. Occupational Therapy Journal of Research, 5, 24 -33.
Nijland, R., van Wegen, E., Verbunt, J, van Wijk, R., van Kordelaar, J. & Kwakkel, G. (2010) A comparison of two validated tests for upper limb function after stroke: The Wolf Motor Function Test and the Action Research Arm Test. Journal of Rehabilitation Medicine, 42, 694-696.
Platz, T., Pinkowski, C., van Wijck, F., Kim, I.H., di Bella, P., & Johnson, G. (2005). Reliability and validity of arm function assessment with standardized guidelines for the Fugl-Meyer Test, Action Research Arm Test and Box and Block Test: a multicentre study. Clinical Rehabilitation, 19(4), 404-411.
Rabadi, M.H. & Rabadi, F.M. (2006). Comparison of the action research arm test and the Fugl-Meyer Assessment as measures of upper-extremity motor weakness after stroke. Archives of Physical of Medicine Rehabilitation, 87, 962-966.
van der Lee, J.H, Beckerman, H., Lankhorst, G.J., Bouter, L.M. (2001a). The responsiveness of the Action Research Arm Test and the Fugl-Meyer Assessment Scale in chronic stroke patients. Journal of Rehabilitation Medicine, 33, 110-113.
Van der Lee, J.H, Groot, V., Beckerman, H., Wagenaar, R.C., Lankhorst, G.J., Bouter, L.M. (2001b). The intra-rater and interrater reliability of the action research arm test: a practical test of upper extremity function in patients with stroke. Archives of Physical of Medicine Rehabilitation, 82, 14-19.
Van der Lee, J.H, Roorda, L.D., & Lankhorst, G.J. (2002). Improving the Action Research Arm Test: a unidimensional hierarchical scale. Clinical Rehabilitation, 16, 646-653.
Yozbatiran, N., Der-Yerghiaian, L., & Cramer, S.C. (2008). A standardized approach to performing the action research arm test. Neurorehabilitation & Neural Repair, 22(1), 78-90.
Wagenaar, R.C., Meijer, O.G., van Wieringen, P.C., Kuik, D.J., Hazenberg, G.J., Lindeboom, J., et al. (1990). The functional recovery of stroke: a comparison between neuro-developmental treatment and the Brunnstrom method. Scandinavian Journal of Rehabilitation and Medicine, 22, 1-8.

See the measure

How to obtain the Action Research Arm Test:

The ARAT can be obtained in the study by Lyle (1981), Hsieh et al. (1998), Van der Lee et al. (2002), Rabadi & Rabadi (2006), and Yozbatiran et al. (2008) and from the website: http://www.aratest.eu/Index_english.htm Standardized equipment can be purchased from the website: http://www.aratest.eu/ or from http://www.saliarehab.com/.