Characteristics of a Good Test

Introduction

As every human being has their own characteristics which make them different from other human beings, there are characteristics of concrete and abstract things which make them different from other concrete or abstract things. Classroom testing is a procedure to measure the performance or achievements of students. This paper contains a detailed discussion on the characteristics of a good test. The five characteristics of validity, reliability, practicality, objectivity, and interpretability of a test have been given in detail.

Validity

Validity is the degree to which it measures what it is intended to measure. It is always concerned with the specific use of the results and the appropriateness of interpretation of test scores (Swain, Pradhan & Khatoi, 2000). For example, in a test, students were asked to write the advantages of heat to measure students’ knowledge about the importance of energy. The students got good marks but the test did not measure the students’ knowledge about the importance of different types of energy. This means the test results or interpretation of scores had low validity.

According to Linn and Gronlund (2000), the adequacy and correctness of the interpretation of test scores and uses of assessment results are known as validity.

For example, two articles of ten pages by the same writer, one written in 1982 and another written in 2002 on facilities for students provided by science. In this case, the article written in 2002 is more valid than the article written in 1982.

Nature of validity

Validity refers to the rightness of the interpretation of the results of a test for a given group of learners not the test for itself. It is a matter of degree, which means that it does not exist on an all or non-basis or totally valid or invalid but is expressed in categories to specify degrees such as high validity, moderate validity, or low validity. No test is valid for all purposes. Validity is always specific to a particular interpretation of scores or use of results.

There is no type of validity because it is a unitary concept based on various kinds of pieces of evidence. Construct, content, and criterion-related relationships are considerations during finding the degree of validity of a test. Validity involves overall evaluative judgment that requires an evaluation of how the results of a test have been interpreted and used. It also needs the types of evidence that are provided to justify the interpretation of scores and uses of results Linn & Gronlund, 2000; Swain et al, 2000).

Functions of validity

The validity of test results and interpretation of their scores perform various functions in testing and evaluation programs in educational institutions. Linn and Gronlund (2000) state the following functions of validity.

· Validity of a test ensures the attainment of objectives formulated by the tester for the test. For example, if a teacher wants to see the degree of understanding mechanics part of secondary school physics among the students, a valid test provides accurate information about students’ degrees of understanding of the mechanics part of physics.

· It identifies strengths and weaknesses among the students regarding mastery of content taught during the teaching and learning process. For example in the above-mentioned example of understanding mechanics part, if the test results are valid then students’ strengths and weaknesses in understanding the content of mechanics in physics will be appropriately found.

· Validity of a test helps the teacher to communicate the true picture of students’ achievement during an academic session to the parents and students and plan sound activities for enhancing students’ achievements.

· To predict truly about professions and future careers of the students, the validity of a test performs a key role in such decision-making.

· Validity of data collection instruments is one of the important characteristics that guarantee the effective teaching and learning process.

Types of validity evidence

Rational validity

a) Face validity

If an evaluator of a test asks a question about the reasonableness of the items of the test regarding the background of the testee then he is interested in the face validity of the test. It means what test items look like in the light of the objective of the test (Taiwo, 1995). According to Linn and Gronlund (2000),

Face validity refers to the appearance of the test. In evaluating face validity, the task to be performed by the learner is superficially examined which means the test appears to be a reasonable measure. A test should look like an appropriate measure to obtain the cooperation of those who are taking the test? Face validity should not be considered as a substitute for a more rigorous evaluation of content definitions and sampling adequacy.

There is a clear distinction between making valid claims based on a rationale of content definitions and making claims based on face validity. For example, to test the skill of finding the area of geometrical figures, the tester wants the area of a rectangle, he/she may ask the students to find the area of an A₄paper, a shopkeeper may be asked to find the area of a rectangular piece of cloth and a player of hockey may be asked to find the area of the nearest hockey ground. In these three test items, the idea is the same to find the area of a rectangle but phrased for each group in their own contexts.

Read More: FFH4X Injector

b) Content validity

Content validity is one of the simplest ways for a test to have sufficient validity pieces of evidence. Content validity evidence is established by a thorough examination of the test items whether they match the instructional objectives of the tester. When the achievement of the students is intended to measure where the specification of items to be included in the test is easy, the content validity claim is easy. In personality tests and aptitude tests, content validity becomes problematic (Kubiszyne & Borich, 2003).

According to Linn and Gronlund (2000), content consideration for validity gets first priority when an individual’s performance is intended to describe a domain of task that the test is supposed to represent. For example, the tester may expect the students to write plurals of 300 singular nouns, then the tester selects a sample of 30 words and if a student writes 70% plurals correctly, it means that the student can write 70% plurals correctly from 300words.

Thus that can be generalized based on a sample of items for the whole list of singular nouns. Content validity evidence is then the degree to which the test task provides a relevant and representative sample of the domain of the task about which interpretations of test results are made. To ensure content validity evidence the testers proceed from what has been taught to what is to be measured, then to what should be focused in the test, and finally to a representative sample of relevant tasks.

Rational validation of a test

Analysis and comparison are the procedures used for content-related validation of a test. The pupils are expected to make to the content and this is compared with the domain of the test is scanned to find out the subject matter of content covered and the responses with which the achievement is to be measured. The numerical value is not required for the expression of content-related validation.

It is determined by the analysis of content and the task is given in the test and domain of outcomes to be measured and reviewing the degree of connection between them (Swain et al, 2000). The data from analysis and comparison is expressed in a two-way chart called a table of specifications for validation of a test (Linn & Gronlund, 2000).

Criterion-related validity

A valued standard to measure the performance other than the test itself is known as a criterion. The use of a test for the prediction of future performance or to find out a current position against a valued measure other than the test itself is called criterion-related validation (Swain et al, 2000).

Predictive validity evidence

Linn & Gronlund, (2000) asserts that predictive validity evidence refers to the degree of adequacy of a test in predicting the future behavior of an individual. This kind of validity is important, particularly in aptitude tests. For example, A scholastic aptitude test is used to decide who should be admitted where. The predictive validity evidence of a test is determined by administering the test to a group of subjects, then measuring the subjects on whatever the test is supposed to predict after some time has over and done That means test-retest the method is used. The two sets of scores are then correlated by using Pearson's R and the coefficient that results is called a predictive validity coefficient.

Concurrent validity

The degree to which a test estimates present status or performance and thus the relationship between two measures taken concurrently is called concurrent validity (Swain et al, 2000).

According to Kubiszyne and Borich (2003), concurrent validity evidence of a test is determined by administering two similar tests at the same time or in a very short period to a group of students. Then the performance of students is measured on what the test is supposed to measure current performance at the same time. The two sets of scores are then correlated by using Pearson “r” and the coefficient is called a concurrent validity coefficient.

Presentation of the relationship of scores in criterion validity evidence

The relationship between the scores of two concurrent tests is presented or shown by using an expectancy table. It is a simple table in which the scores of two tests are arranged. Another way of communicating relationships between the scores is using a Scatter plot in which the scores are plotted in a graph (Linn & Gronlund, 2000).

Construct validity

A construct is a psychological quality that is assumed to exist to explain some aspect of behavior among individuals (Linn & Gronlund, 2000). For example, reasoning, problem-solving, and so on are some of the constructs among individuals.

Construct validation is the process of determining the extent to which a particular test measures the psychological constructs that the tester wants to measure. Construct validity is determined by defining the domain or tasks to be measured, analyzing the response process required by the assessment tasks, comparing the scores of known groups, comparing the scores before and after a particular learning experience, and correlating the scores with Pearson product-moment correlation (Swain et al, 2000).

Factors affecting the validity of a test

The validity of the test is ensured by considering the factors in the test that affect it. Unclear directions are given in a test affect the validity of a test because the testee can not be able to understand how to respond to certain questions. The difficulty of the reading vocabulary and sentence structure, too easy or too difficult items, ambiguous statements in the test, inadequate time for test-taking, inappropriate test items to measure a particular outcome, and inappropriate arrangement of test items in a test area the factors to be considered to ensure the validity of a test (Linn & Gronlund, 2000).

Related to the administration of the test, unfair add to the examinee who asks for help, cheating by the pupils during testing, unreliable scoring for essay-type answers, insufficient time to complete the test, and adverse physical and psychological conditions at the time of testing are the factors related to the administration of a test which can affect the validity of a test. Factors related to the testee, the anxiety of the student, the physical and psychological state of the pupil, and the response set is a consistent tendency to follow a certain pattern in responding to the items that affect the validity of the test (Linn & Gronlund, 2000).

Importance of validity

The test should contain what it is supposed to measure. To construct a valid test means to make sure that the results of the test are actual results. Teachers can plan effectively based on valid test results thus improving teaching and effective assessment. For making effective decisions when the decision is made based on test results validity of the test plays a key role in surety of the right decision made by Parents.

For the administration of the school, a test needs to be valid to enhance school effectiveness and improve the school by arranging the required training for the staff. Researchers are more concerned about accurate results from their research tools. In the case of tests as a research tool, they find accurate information by using a valid test and thus contribute to providing quality education (Linn & Gronlund, 2000).

Strengths and weaknesses of the validity of a test

From Kubiszyne and Borich (2003), the following strengths and weaknesses may be deduced.

Strengths

• The students learning progress is measured in various dimensions by a valid testing process. If a test is valid it will provide much information about students learning from different angles so that that information will be used for desired purposes.

• The results of a valid test are truly used for desired purposes. The accurate or true results from a valid test will be appropriately used for the intended purpose.

• All the content areas are covered in testing trying to keep the validity of a test. By making a table of specifications to make the test valid, the true representative content needs are considered in inclusion in the test.

• For ensuring validation of the test appropriate sample of content is selected.

Weaknesses

• If the content sample is not properly selected, all the efforts of teachers, students, and parents will go in vain

• Prediction is a difficult decision to make about a student’s future based on just two tests. If the decision is made by prediction is wrong the learners will suffer from this wrong decision throughout their life.

• There are chances to validity be affected by extraneous factors.

• Measuring students in two tests over a while may be affected by regression the effect, maturation, and other factors.

Reliability

The characteristic of a test about the consistency with which a test yields the same result in measuring whatever it does measure is called reliability (Swain et al, 2000). Taiwo (1995) defines reliability as it refers to the consistency of measurement that is how consistent test scores are from one measurement to another. For example, the students use a stopwatch to measure time for 15 vibrations of a pendulum. They take the reading twice or thrice. If two or three times the reading is consistent then they proceed with it further. It means that the stopwatch provides reliable readings.

Nature of reliability

Reliability refers to the consistency of the results obtained with a test but not the test itself. It means that the results obtained by a tool or test are reliable not the tool or test is said to be reliable. It refers to a particular interpretation of test scores. For example, a test score that is reliable over some time may not be reliable from one test to another equivalent test. Reliability is a statistical concept. To determine the consistency, a test is administered once or more than once. Then the consistency is measured in terms of relative shifts. It is necessary but not a sufficient condition for validity (Linn & Gronlund, 2000).

Functions of reliability

Reliability coefficient provides the most revealing statistical index of quality that is ordinarily available. Estimates of the reliability of tests provide essential information for judging the technical quality and motivating efforts to improve the tests. Reliability estimation determines how much of the variability in test scores are due to measurement error and how much is due to variability in true scores (Swain et al, 2000).

Methods of determining reliability

Test-Retest Reliability

The test is administered twice on the same group to assess the consistency of test scores over some time. The two tests are similar but not the same. Then the correlation between two sets of scores obtained by test and retest is found using Pearson product-moment “r”. Test-retest reliability is best used for things that are stable over time, for example, intelligence. Generally, reliability will be higher when little time has passed between two tests (Kubiszyne & Borich 2003).

Equivalent /Parallel-Forms method

In the parallel-forms method of determining reliability, the reliability is estimated by comparing two different tests that were created using the same content, difficulty, format, and length at the same test. The two tests are administered to the same group within a short interval of time. Then the test scores of the two tests are correlated. This correlation provides an index of equivalence. For example, in intermediate or secondary board examinations, two question papers for a particular subject are constructed and named as paper A or paper B, and sometimes paper C is prepared which shows equivalent forms of tests (Linn & Gronlund, 2000).

Internal Consistency method

The consistency of test results across items on the same test is determined in this method of determining the reliability of a test. Test items are compared with each other then the same construct to determine the test’s internal consistency. Questions are similar and designed to measure the same thing, the test taker should answer the same for both questions, which would indicate that the test has internal consistency (Swain et al, 2000). Three methods to find the internal consistency of a test known as the split-half method and Kuder Richardson 21 formula and inter-rater internal consistency are given below.

Split-half method

Linn and Gronlund (2000) share that the split-half method of determining internal consistency employs a single administration of an even-number test on a sample of pupils. The test is divided into two equivalent halves and a correlation for these half test scores is found. The test is divided into even-numbered items such as 2,4,6…, in one half and odd numbers such as1,3,5,…., in another half. Then the scores of both the halves are correlated by using the Spearman brown formula. The formula is given below.

r₂= 2 r₂/1₊ r₁

Where r_{2 =}reliability the coefficient on the full test

r₁₌correlation of coefficient between half tests

Kuder-Richardson formula 21method

Linn & Gronlund (2003)

, states that it is another method of determining reliability using a single administration of a test. It is known to provide a conservative estimate of the split-half type of reliability. The procedure is based on the consistency of an individual’s performance from item to item and on the standard deviation of the test such that the reliability coefficient obtained denotes the internal consistency of the test. Internal consistency here means the degree to which the items of a test measure a common attribute of the testee. The formula is a simpler form of the originator’s formula 20 is expressed as,

r = n/n-1 1- M_t (1-Mt/n)/s²

n= no of items, M_t= mean score on test and s²_t = variance of test scores

Inter-rater Reliability

In this method, two or more independent judges score the test. The scores are then compared to determine the consistency of the raters’ estimates. One way to test inter-rater reliability is to assign each rater score for each test. For example, each rater might score items on a scale from 1 to 10. Then the correlation between the two ratings is found to determine the level of inter-rater reliability. Another means of testing inter-rater reliability is to have raters determine which category each observation falls into and then calculate the percentage of agreement between the raters. So, if the raters agree 8 out of 10 times, the test has an 80% inter-rater reliability rate (Swain et al, 2000)

Factors affecting reliability

Factors related to testing which affect the reliability of a test are the length of the test, the content of the test, characteristics of test items, and spread of scores. If the time for taking a test is short then the reliability of the test will be affected. If the content of the test is not representative of the whole content to be tested then the reliability of the test will be reduced.

The more the spread of the test score, the less the reliability of a test. Factors related to the testee that affect the reliability of a test are; heterogeneity of the group, test wiseness of the students, and motivation of the students. A time limit for the test and cheating opportunities given to the students are the factors related to the testing procedure that affect the reliability of the test (Linn & Gronlund, 2003).

Importance of reliability of a test

According to Swain et al (2000), the following are the importance and limitations of reliability.

• Tests are used to make important decisions. Therefore the results from reliable tests add to making effective decisions.

• Individuals are grouped into many different categories based upon relatively small individual differences e.g. intelligence. Thus reliability in the results of a test helps the teachers in catering to individual differences in the classroom.

• Reliable tests provide an actual index of students' achievement.

• To ensure the assessment system is useful and fair to students, the reliability of the results of a test needs to be enhanced.

Limitations

• The content of the test over some time is not the same thus reliability is reduced.

• During parallel tests the students may get fatigued so the reliability of the test is reduced.

• Test items for measuring construct may not match the construct to be measured.

• In the test-retest method if time is short then the students are familiar with the answers and items.

• If the interval of time is longer students' maturity and learning will affect consistency in test results in test and re-test.

Comparison between validity and reliability

Ø Validity and reliability are both concerned with the purposes of test results, not the test itself.

Ø Validity and reliability are both specific to particular uses and interpretations of test scores.

Ø Validity is the appropriateness of the interpretation of the test scores while reliability is the relationship between two sets of the scores.

Ø Every valid test is reliable but every reliable test is not valid (Kubiszyne & Borich 2003).

Practicality/Usability

According to Rehman (2007), Usability or practicality is another important characteristic of a good test. It deals with all practical considerations that go into a decision to use a particular test. While constructing or selecting a test, practical considerations must be taken into account. Rehman (2007) has given the following five practical considerations of a test.

Ease of administration

The test should be easy to administer so that the tester may easily administer it. For this purpose, it should be simple and contain clear instructions. There should be a small number of subsets and an appropriate (not too long) time for administering the test.

Time required for administration

To provide appropriate time to take the test, if the time is reduced, then the reliability of the test will also be reduced. A safe procedure to allocate as much time as the test requires for providing reliable and valid results. Between 20 to 60 minutes is a fairly good time for each individual score yielded by a published test.

Ease of interpretation and application

Another important aspect of the practicality of test scores is the interpretation and application of test results. If they are misinterpreted that will be harmful to the students. On the other hand, if they are misapplied or not applied at all, then the test is useless.

Availability of equivalent forms

Equivalent forms tests help to verify the test scores. Retesting at the same time on the same domain of learning eliminates the factor of memory among the students. The availability of equivalent forms of the test should be taken into mind while constructing/selecting a test.

Cost of testing

A test should be economical in terms of preparation, administration, and scoring.

Importance of the practicality of a test

The teachers particularly untrained teachers can easily administer the tests which have been constructed/ considering practicality. The parents can be informed of the right test results if the practical considerations have been taken care of while constructing a test, which they will use in decision-making about their children. Economical tests may save unnecessary expenses on stationery, print materials, photocopies, and so on. True interpretations of test scores will be used by the students in their own plans and decisions (Linn & Gronlund, 2000).

Limitations

• There are chances of giving wrong directions to students by untrained teachers while constructing or administering the tests.

• If time is reduced for taking tests, the reliability of the test is reduced.

• Chances of misinterpretation and incorrect scoring in the absence of uniform criteria.

• Cost of testing is sometimes given far more weight than it deserves. Tests are relatively inexpensive and cost may not be a major consideration (Swain et al, 2000).

Objectivity

The degree to which a test’s results are obtained the same by scoring different scorers without influences of their biases or beliefs on scoring is known as objectivity. Most standardized tests of aptitude and achievement tests are high in objectivity. In essay-type tests requiring judgmental scoring, different persons get different results or even the same person can get different results at different times (Linn & Gronlund, 2000). For example, a student writes an answer involving all required information to a particular question using different headings and subheadings. Two persons check that response. One person likes the answer in headings and subheadings and another person likes the answers in essay form without headings.

The person who likes the headings and subheadings will assign more marks while another will assign fewer marks. The test lacks objectivity. The objectivity of a test is determined by carefully studying the administration and scoring procedures to see where judgment is basic or bias may occur. Objective-type tests such as true/false, multiple-choice, and so on are developed to overcome the lack of objectivity in tests. In essay-type tests objectivity may be increased by careful phrasing of questions and by a standard set of scoring rules (Swain et al, 2000).

Uses/ importance of objectivity

The teachers can judge and improve their own teaching and learning process by finding real strengths and weaknesses among the learners. Scorers of the tests reach a consensus about the performance/achievement of a student in a particular area of content being objective in scoring. The parents looking at the true results of tests that have been assigned to the scorers objectively may arrange for further improvement of their children if their children need extra input. The Administration uses trained clerks and machines to score the test in case of an objective type test (Swain et al, 2000).

Factors affecting objectivity

Beliefs and business of the scorers affect the scoring or style of scoring of a score influences the scoring process which affects the objectivity of a test. Ambiguous directions in tests and the unavailability of sound criteria regarding scoring a test also affect the scoring which leads to a threat to objectivity. Scoring of the tests by untrained teachers also affects objectivity (Linn & Gronlund, 2000).

Strengths

• Objectivity reduces the biases of a scorer in the test results.

• Reliability of test scores is ensured.

• Scoring essay-type tests are improved

• Instructions are given clearly on how to score the responses to items and other related topics are shared during the scoring of the test.

• Scorers are given the training to score and interpret the test results (Rehman, 2007).

Limitations

Objectivity is lacking in teacher-made tests particularly when untrained teachers score the test. Whatever the measures are taken, there is still a lack of objectivity in essay-type tests than objective-type tests, so the students with poor writing skills will suffer. If scoring is done by clerks, the professionalism of the teacher is challenged (Swain et al, 2000).

Interpretability

Linn & Gronlund (2000) defines interpretability as the degree to which the scores of a test are assigned a meaning based on a criterion or norm for a particular purpose is known as interpretability. The raw score is simply the number of points perceived on a test when the test has been scored according to the directions. For example, a student X answered 25 items correctly on an arithmetic test, their fore student X has a raw score of 25. To make a raw score means it is converted into a description of the specific task that the student can perform is the process of interpretability.

Criterion-referenced and standard-based interpretation

A test about the specific kind of domain is directed at a desire for criterion-referenced and standards-based interpretations. Criterion-referenced and standard-based interpretation of test results is most meaningful when the test has been designed for this purpose. This involves designing a test that measure the achievement domain, which is homogeneous, delimited, and clearly specified, enough items for each interpretation, Items neither easy nor difficult, Items not only selection type but all other types and Items that directly provide relevance to objectives (Linn & Gronlund, 2000).

Norm-referenced interpretation

Swain et al (2000) assert that norm-referenced interpretation tells us how an individual compares with other persons who have taken the same test. The simplest way of comparison in the classroom is ranking from highest to lowest where an individual's score falls. For more meaningful and well-defined characteristics of interpretation, raw scores are converted into derived scores which are numerical reports of test performance on a score scale.

Uses of interpretability

Teachers keep records of the students, time and improve their instruction by interpreting the scores of a test. The students can see their level of performance related to other colleagues in their class by looking at the interpretation of their test scores. Parents easily understand the actual performance of their children and decide what to do and what not to do. The administration uses the interpretation of test scores to present the position of the school in terms of students learning. The Researchers make inferences by interpreting the scores of tests as their data collection tool (Linn & Gronlund, 2000).

Strengths

Swain et al (2000) share the following strengths and weaknesses of the interpretability of test scores.

• More information can be presented to the audience using a small number of illustrations.

• Students' achievements are qualitatively expressed other than numerical values.

• Students are measured relative to the average group.

• Tables for norms are already given, so looking at the tables, interpretation becomes easy.

Weaknesses

• If the task is not selected appropriate to the domain being measured the scores will be misinterpreted.

• A large number of items are needed to ensure a correct interpretation, which takes time to carry out calculations.

• If item analysis is not done properly which means easy items are included, then the low achievers will not know about what they can do or can not do.

• Norms are generalized for all students by a pilot test but not taken care of individual differences in overall educational settings.

Conclusion

There are basically two major characteristics of a good test, validity and reliability but in the absence of other characteristics such as practicality, objectivity and interpretability, validity and reliability may not be possible to be ensured. All five characteristics are interconnected to each other if one is missed in the test others will definitely be affected. The teachers need to know about all these characteristics and incorporate these characteristics in their testing programs to ensure the effective teaching and learning to be taken place in the classroom.

References

Kubiszyne, T., & Borich, G. (2003). Educational testing and measurement: Classroom application and practice (7^th ed.). New York: John Wiley & sons.

Linn, R. L., & Gronlund, N.E. (2000). Measurement and assessment in teaching (8^th ed.). Delhi: Pearson Education.

Rehman, A. (2007). Development and validation of objective test items analysis in the subject physics for class IX in Rawalpindi city. Retrieved May 12, 2009 form International Islamic university, Department of Education Web site: http://eprints.hec.gov.pk/2518/1/2455.htm.

Swain, S. K., Pradhan, C., & Khotoi, S. P. K. (2000). Educational measurement: Statistics and guidance. Ludhiana: Kalyani.

Taiwo, A. A. (1995). Fundamentals of classroom testing. New Delhi: Vikas Publishing House.