Introduction
As
every human being has their own characteristics which make them different from
other human beings, there are characteristics of concrete and abstract things
which make them different from other concrete or abstract things. Classroom
testing is a procedure to measure the performance or achievements of students.
This paper contains a detailed discussion on the characteristics of a good
test. The five characteristics of validity, reliability, practicality, objectivity, and interpretability of a test have been given in detail.
Validity
Validity
is the degree to which it measures what it is intended to measure. It is always
concerned with the specific use of the results and the appropriateness of
interpretation of test scores (Swain, Pradhan & Khatoi, 2000). For example,
in a test, students were asked to write the advantages of heat to measure
students’ knowledge about the importance of energy. The students got good marks but
the test did not measure the students’ knowledge about the importance of different
types of energy. This means the test results or interpretation of scores had
low validity.
According
to Linn and Gronlund (2000), the adequacy and correctness of the interpretation
of test scores and uses of assessment results are known as validity.
For
example, two articles of ten pages by the same writer, one written in 1982 and
another written in 2002 on facilities for students provided by science. In this
case, the article written in 2002 is more valid than the article written in
1982.
Nature of validity
Validity
refers to the rightness of the interpretation of the results of a test for a
given group of learners not the test for itself. It is a matter of degree,
which means that it does not exist on an all or non-basis or totally valid or
invalid but is expressed in categories to specify degrees such as high validity,
moderate validity, or low validity. No test is valid for all purposes.
Validity is always specific to a particular interpretation of scores or use of
results.
There is no type of validity because it is a unitary concept based on
various kinds of pieces of evidence. Construct, content, and criterion-related relationships
are considerations during finding the degree of validity of a test. Validity
involves overall evaluative judgment that requires an evaluation of how the
results of a test have been interpreted and used. It also needs the types of evidence that
are provided to justify the interpretation of scores and uses of results Linn
& Gronlund, 2000; Swain et al, 2000).
Functions of
validity
The
validity of test results and interpretation of their scores perform various
functions in testing and evaluation programs in educational institutions. Linn
and Gronlund (2000) state the following functions of validity.
·
Validity
of a test ensures the attainment of objectives formulated by the tester for
the test. For example, if a teacher wants to see the degree of understanding
mechanics part of secondary school physics among the students, a valid test
provides accurate information about students’ degrees of understanding of the mechanics part of physics.
·
It
identifies strengths and weaknesses among the students regarding mastery of
content taught during the teaching and learning process. For example in the above-mentioned example of understanding mechanics part, if the test results are
valid then students’ strengths and weaknesses in understanding the content of
mechanics in physics will be appropriately found.
·
Validity
of a test helps the teacher to communicate the true picture of students’
achievement during an academic session to the parents and students and plan
sound activities for enhancing students’ achievements.
·
To
predict truly about professions and future careers of the students, the validity of
a test performs a key role in such decision-making.
·
Validity
of data collection instruments is one of the important characteristics that
guarantee the effective teaching and learning process.
Types of
validity evidence
Rational
validity
a)
Face validity
If
an evaluator of a test asks a question about the reasonableness of the items of
the test regarding the background of the testee then he is interested in the face
validity of the test. It means what test items look like in the light of
the objective of the test (Taiwo, 1995). According
to Linn and Gronlund (2000),
Face validity refers to the appearance of the
test. In evaluating face validity, the task to be performed by the learner is
superficially examined which means the test appears to be a reasonable measure. A test should look like an appropriate measure to obtain the cooperation of those
who are taking the test? Face validity should not be considered as a substitute
for a more rigorous evaluation of content definitions and sampling adequacy.
There is a clear distinction between making
valid claims based on a rationale of content definitions and making claims
based on face validity. For example, to test the skill of finding the area of
geometrical figures, the tester wants the area of a rectangle, he/she may ask
the students to find the area of an A4 paper, a shopkeeper may be asked to find the area of a rectangular piece of cloth and a player of hockey
may be asked to find the area of the nearest hockey ground. In these three test
items, the idea is the same to find the area of a rectangle but phrased for each group
in their own contexts.
Read More: FFH4X Injector
b)
Content validity
Content
validity is one of the simplest ways for a test to have sufficient validity pieces of evidence. Content validity evidence is established by a thorough examination of
the test items whether they match the instructional objectives of the
tester. When the achievement of the students is intended to measure where the specification
of items to be included in the test is easy, the content validity claim is easy.
In personality tests and aptitude tests, content validity becomes
problematic (Kubiszyne & Borich, 2003).
According
to Linn and Gronlund (2000), content consideration for validity gets first
priority when an individual’s performance is intended to describe a domain
of task that the test is supposed to represent. For example, the tester may
expect the students to write plurals of 300 singular nouns, then the tester selects
a sample of 30 words and if a student writes 70% plurals correctly, it means
that the student can write 70% plurals correctly from 300words.
Thus that can
be generalized based on a sample of items for the whole list of singular
nouns. Content validity evidence is then the degree to which the test task
provides a relevant and representative sample of the domain of the task about which interpretations
of test results are made. To ensure content validity evidence the testers
proceed from what has been taught to what is to be measured, then to what should be focused in the test, and finally to
a representative sample of relevant tasks.
Rational
validation of a test
Analysis and comparison are the procedures
used for content-related validation of a test. The pupils are expected to make
to the content and this is compared with the domain of the test is scanned to find
out the subject matter of content covered and the responses with which the
achievement is to be measured. The numerical value is not required for the expression of
content-related validation.
It is determined by the analysis of content and the task is given in the test and domain of outcomes to be measured and reviewing the
degree of connection between them (Swain et al, 2000). The data from analysis
and comparison is expressed in a two-way chart called a table of specifications
for validation of a test (Linn & Gronlund, 2000).
Criterion-related
validity
A
valued standard to measure the performance other than the test itself is known as
a criterion. The use of a test for the prediction of future performance or to find
out a current position against a valued measure other than the test itself is
called criterion-related validation (Swain et al, 2000).
Predictive
validity evidence
Linn
& Gronlund, (2000) asserts that predictive validity evidence refers to the
degree of adequacy of a test in predicting the future behavior of an individual.
This kind of validity is important, particularly in aptitude tests. For example, A scholastic aptitude test is used to decide who should be admitted where. The
predictive validity evidence of a test is determined by administering the test
to a group of subjects, then measuring the subjects on whatever the test is supposed
to predict after some time has over and done That means test-retest the method is used. The two sets of scores are then correlated by using Pearson's R
and the coefficient that results is called a predictive validity coefficient.
Concurrent
validity
The degree to which a test estimates present status or performance and thus the
relationship between two measures taken concurrently is called concurrent
validity (Swain et al, 2000).
According
to Kubiszyne and Borich (2003), concurrent validity evidence of a test is
determined by administering two similar tests at the same time or in a very
short period to a group of students. Then the performance of students is
measured on what the test is supposed to measure current performance at the same
time. The two sets of scores are then correlated by using Pearson “r” and the
coefficient is called a concurrent validity coefficient.
Presentation
of the relationship of scores in criterion validity evidence
The
relationship between the scores of two concurrent tests is presented or shown
by using an expectancy table. It is a simple table in which the scores of two
tests are arranged. Another way of communicating relationships between the
scores is using a Scatter plot in which the scores are plotted in a graph
(Linn & Gronlund, 2000).
Construct
validity
A construct is a psychological quality that is
assumed to exist to explain some aspect of behavior among
individuals (Linn & Gronlund, 2000).
For example, reasoning, problem-solving, and so on are some of the
constructs among individuals.
Construct
validation is the process of determining the extent to which a particular test
measures the psychological constructs that the tester wants to measure.
Construct validity is determined by defining
the domain or tasks to be measured, analyzing the response process required by
the assessment tasks, comparing
the scores of known groups, comparing the scores before and after a particular learning
experience, and correlating the scores with Pearson product-moment correlation (Swain
et al, 2000).
Factors
affecting the validity of a test
The validity of the test is ensured by considering the factors in the test that affect it. Unclear directions are given in a
test affect the validity of a test because the testee can not be able to
understand how to respond to certain questions. The difficulty of the reading
vocabulary and sentence structure, too easy or too difficult items, ambiguous
statements in the test, inadequate time
for test-taking, inappropriate test items to measure a particular outcome, and inappropriate
arrangement of test items in a test area
the factors to be considered to ensure the validity of a test (Linn & Gronlund,
2000).
Related
to the administration of the test, unfair add to the examinee who asks for help, cheating
by the pupils during testing, unreliable scoring for essay-type answers, insufficient
time to complete the test, and adverse physical and psychological conditions at
the time of testing are the factors related to the administration of a test which
can affect the validity of a test. Factors
related to the testee, the anxiety of the student, the physical and psychological
state of the pupil, and the response set is a consistent tendency to follow a certain pattern in responding to the items that affect the validity of the test (Linn &
Gronlund, 2000).
Importance of
validity
The test should contain what it is supposed to
measure. To construct a valid test means to make sure that the results of the test are actual results. Teachers can plan effectively based on valid test
results thus improving teaching and effective assessment. For making effective
decisions when the decision is made based on test results validity of
the test plays a key role in surety of the right decision made by Parents.
For the administration of the school, a test needs
to be valid to enhance school effectiveness and improve the school by arranging
the required training for the staff. Researchers
are more concerned about accurate results from their research tools. In the case of tests as a research tool, they find accurate information by
using a valid test and thus contribute to providing quality education (Linn
& Gronlund, 2000).
Strengths
and weaknesses of the validity of a test
From Kubiszyne and Borich (2003), the
following strengths and weaknesses may be deduced.
Strengths
•
The
students learning progress is measured in various dimensions by a valid testing
process. If a test is valid it will provide much information about students
learning from different angles so that that information will be used for
desired purposes.
•
The
results of a valid test are truly used for desired purposes. The accurate or
true results from a valid test will be appropriately used for the intended
purpose.
• All the content areas are covered in testing trying to keep the validity of a
test. By making a table of specifications to make the test
valid, the true representative content needs are considered in inclusion in the
test.
•
For
ensuring validation of the test appropriate sample of content is selected.
Weaknesses
•
If the content sample is not properly selected, all the efforts of teachers, students, and parents will go in vain
•
Prediction
is a difficult decision to make about a student’s future based on just
two tests. If the decision is made by prediction is wrong the learners will suffer
from this wrong decision throughout their life.
•
There
are chances to validity be affected by extraneous factors.
•
Measuring
students in two tests over a while may be affected by regression the effect, maturation, and other factors.
Reliability
The
characteristic of a test about the consistency with which a test yields the same result in measuring whatever it does measure is called reliability (Swain
et al, 2000). Taiwo
(1995) defines reliability as it refers to the consistency of measurement that
is how consistent test scores are from one measurement to another. For example,
the students use a stopwatch to measure time for 15 vibrations of a pendulum.
They take the reading twice or thrice. If two or three times the reading is
consistent then they proceed with it further. It means that the stopwatch
provides reliable readings.
Nature of
reliability
Reliability
refers to the consistency of the results obtained with a test but not the test
itself. It means that the results obtained by a tool or test are reliable not
the tool or test is said to be reliable.
It refers to a particular interpretation of test scores. For example, a
test score that is reliable over some time may not be reliable from one
test to another equivalent test. Reliability is a statistical concept. To
determine the consistency, a test is administered once or more than once. Then
the consistency is measured in terms of relative shifts. It is necessary but
not a sufficient condition for validity (Linn & Gronlund, 2000).
Functions of
reliability
Reliability
coefficient provides the most revealing statistical index of quality that is
ordinarily available. Estimates of the reliability of tests provide essential
information for judging the technical quality and motivating efforts to improve
the tests. Reliability estimation determines how much of the variability in
test scores are due to measurement error and how much is due to variability in
true scores (Swain et al, 2000).
Methods of
determining reliability
Test-Retest
Reliability
The test is administered twice on the same group to assess the consistency of test scores over some time. The two tests are similar but not the same. Then
the correlation between two sets of scores obtained by test and retest is found
using Pearson product-moment “r”. Test-retest reliability is best used for things
that are stable over time, for example, intelligence. Generally, reliability
will be higher when little time has passed between two tests (Kubiszyne &
Borich 2003).
Equivalent
/Parallel-Forms method
In the parallel-forms method of determining reliability, the reliability is estimated
by comparing two different tests that were created using the same content,
difficulty, format, and length at the same test. The two tests are administered
to the same group within a short interval of time. Then the test scores of the two
tests are correlated. This correlation provides an index of equivalence. For example, in intermediate or secondary board examinations, two question papers
for a particular subject are constructed and named as paper A or paper B, and
sometimes paper C is prepared which shows equivalent forms of tests (Linn &
Gronlund, 2000).
Internal
Consistency method
The
consistency of test results across items on the same test is determined in this
method of determining the reliability of a test. Test items are compared with each
other then the same construct to determine the test’s internal
consistency. Questions are similar and designed to measure the same thing, the test taker should answer the same for both questions, which would indicate that
the test has internal consistency (Swain et al, 2000). Three methods to find the internal consistency
of a test known as the split-half method and Kuder Richardson 21 formula and
inter-rater internal consistency are given below.
Split-half
method
Linn
and Gronlund (2000) share that the split-half method of determining internal
consistency employs a single administration of an even-number test on a sample of
pupils. The test is divided into two equivalent halves and a correlation for
these half test scores is found. The test is divided into even-numbered items
such as 2,4,6…, in one half and odd numbers such as1,3,5,…., in another half. Then the scores of both the
halves are correlated by using the Spearman brown formula. The formula is given
below.
r2 = 2 r2 /1+ r1
Where r2 = reliability the coefficient on the full test
r1= correlation of coefficient between half tests
Kuder-Richardson
formula 21method
Linn
& Gronlund (2003), states that it is another method of
determining reliability using a single administration of a test. It is known to
provide a conservative estimate of the split-half type of reliability. The procedure is based on the consistency of an individual’s performance from item
to item and on the standard deviation of the test such that the reliability
coefficient obtained denotes the internal consistency of the test. Internal
consistency here means the degree to which the items of a test measure a common attribute of the testee. The formula is a simpler form of the originator’s formula 20 is expressed as,
r = n/n-1 1- Mt
(1-Mt/n)/s2
n=
no of items, Mt = mean score on test and s2t = variance of test
scores
Inter-rater
Reliability
In this method, two or more independent judges score the test. The scores are then
compared to determine the consistency of the raters’ estimates. One way to test
inter-rater reliability is to assign each rater score for each test. For example,
each rater might score items on a scale from 1 to 10. Then the correlation
between the two ratings is found to determine the level of inter-rater
reliability. Another means of testing inter-rater reliability is to have raters
determine which category each observation falls into and then calculate the
percentage of agreement between the raters. So, if the raters agree 8 out of 10
times, the test has an 80% inter-rater reliability rate (Swain et al, 2000)
.
Factors
affecting reliability
Factors related to testing which affect
the reliability of a test are the length of the test, the content of the test, characteristics
of test items, and spread of scores. If the time for taking a test is short then
the reliability of the test will be affected. If the content of the test is not representative of the whole content to be tested then the reliability of
the test will be reduced.
The more the spread of the test score, the less the
reliability of a test. Factors related to the testee that affect the reliability of a test are; heterogeneity of the group, test wiseness of
the students, and motivation of the students. A time limit for the test and cheating
opportunities given to the students are the factors related to the testing procedure that affect the reliability of the
test (Linn & Gronlund, 2003).
Importance of
reliability of a test
According
to Swain et al (2000), the following are
the importance and limitations of reliability.
•
Tests
are used to make important decisions. Therefore the results from reliable tests
add to making effective decisions.
•
Individuals
are grouped into many different categories based upon relatively small
individual differences e.g. intelligence. Thus reliability in the results of a test
helps the teachers in catering to individual differences in the classroom.
•
Reliable
tests provide an actual index of students' achievement.
•
To
ensure the assessment system is useful and fair to students, the reliability of the results of a test
needs to be enhanced.
Limitations
• The content of the test over some time is not the same thus reliability is reduced.
•
During
parallel tests the students may get fatigued so the reliability of the test is reduced.
•
Test
items for measuring construct may not match the construct to be measured.
•
In the test-retest method if time is short then the students are familiar with the
answers and items.
•
If
the interval of time is longer students' maturity and learning will affect
consistency in test results in test and re-test.
Comparison
between validity and reliability
Ø Validity and
reliability are both concerned with the purposes of test results, not the test
itself.
Ø Validity and
reliability are both specific to particular uses and interpretations of test
scores.
Ø Validity is the
appropriateness of the interpretation of the test scores while reliability is
the relationship between two sets of the scores.
Ø Every valid test
is reliable but every reliable test is not valid (Kubiszyne & Borich 2003).
Practicality/Usability
According
to Rehman (2007), Usability or practicality is another important characteristic
of a good test. It deals with all practical considerations that go into a decision to use a particular test. While constructing or selecting a test,
practical considerations must be taken into account. Rehman (2007) has given
the following five practical considerations of a test.
Ease
of administration
The test should be easy to administer so that the tester may easily administer it. For
this purpose, it should be simple and contain clear instructions. There
should be a small number of subsets and an appropriate (not too long) time for
administering the test.
Time
required for administration
To provide appropriate time to take the test, if the time is reduced,
then the reliability of the test will also be reduced. A safe procedure to allocate as
much time as the test requires for providing reliable and valid results.
Between 20 to 60 minutes is a fairly good time for each individual score yielded
by a published test.
Ease
of interpretation and application
Another important aspect of the practicality of test scores is the interpretation and
application of test results. If they are misinterpreted that will be harmful
to the students. On the other hand, if they are misapplied or not applied at
all, then the test is useless.
Availability
of equivalent forms
Equivalent
forms tests help to verify the test scores. Retesting at the same time on the
same domain of learning eliminates the factor of memory among the
students. The availability of equivalent
forms of the test should be taken into mind while constructing/selecting a
test.
Cost
of testing
A test should be economical in terms of preparation, administration, and scoring.
Importance
of the practicality of a test
The
teachers particularly untrained teachers can easily administer the tests which
have been constructed/ considering practicality. The parents can be informed
of the right test results if the practical considerations have been taken care of while constructing a test, which they will use in decision-making about their
children. Economical tests may save unnecessary expenses on stationery, print
materials, photocopies, and so on. True interpretations of test scores will be
used by the students in their own plans and decisions (Linn & Gronlund,
2000).
Limitations
•
There
are chances of giving wrong directions to students by untrained teachers while
constructing or administering the tests.
•
If
time is reduced for taking tests, the reliability of the test is reduced.
•
Chances of misinterpretation and incorrect
scoring in the absence of uniform criteria.
•
Cost
of testing is sometimes given far more weight than it deserves. Tests are
relatively inexpensive and cost may not be a major consideration (Swain et al,
2000).
Objectivity
The degree to which a test’s results are obtained the same by scoring different
scorers without influences of their biases or beliefs on scoring is known as
objectivity. Most standardized tests of aptitude and achievement tests are high
in objectivity. In essay-type tests requiring judgmental scoring, different
persons get different results or even the same person can get different results
at different times (Linn & Gronlund, 2000). For example, a student writes
an answer involving all required information to a particular question using
different headings and subheadings. Two
persons check that response. One person likes the answer in headings and
subheadings and another person likes the answers in essay form without
headings.
The person who likes the headings and subheadings will assign more
marks while another will assign fewer marks. The test lacks objectivity. The objectivity of a test is determined by carefully studying the administration and scoring
procedures to see where judgment is basic or bias may occur. Objective-type
tests such as true/false, multiple-choice, and so on are developed to overcome
the lack of objectivity in tests. In essay-type tests objectivity may be
increased by careful phrasing of questions and by a standard set of scoring rules (Swain et al, 2000).
Uses/ importance
of objectivity
The
teachers can judge and improve their own teaching and learning process by
finding real strengths and weaknesses among the learners. Scorers of the tests reach a consensus
about the performance/achievement of a student in a particular area of content
being objective in scoring. The parents looking at the true results of tests that
have been assigned to the scorers objectively may arrange for further improvement
of their children if their children need extra input. The Administration uses trained clerks and machines
to score the test in case of an objective type test (Swain et al, 2000).
Factors
affecting objectivity
Beliefs
and business of the scorers affect the scoring or style of scoring of a score
influences the scoring process which affects the objectivity of a test. Ambiguous
directions in tests and the unavailability of sound criteria regarding scoring a
test also affect the scoring which leads to a threat to objectivity. Scoring of
the tests by untrained teachers also affects objectivity (Linn &
Gronlund, 2000).
Strengths
•
Objectivity
reduces the biases of a scorer in the test results.
•
Reliability
of test scores is ensured.
•
Scoring
essay-type tests are improved
•
Instructions
are given clearly on how to score the responses to items and other related topics
are shared during the scoring of the test.
•
Scorers
are given the training to score and interpret the test results (Rehman, 2007).
Limitations
Objectivity is lacking in teacher-made tests particularly when untrained teachers score the
test. Whatever the measures are taken, there is still a lack of objectivity in essay-type tests than objective-type tests, so the students with poor writing skills
will suffer. If scoring is done by clerks, the professionalism of the teacher is
challenged (Swain et al, 2000).
Interpretability
Linn
& Gronlund (2000) defines interpretability as the degree to which the
scores of a test are assigned a meaning based on a criterion or norm for a particular purpose is known as interpretability. The raw score is simply the number of points
perceived on a test when the test has been scored according to the directions. For example, a student X answered 25 items correctly on an arithmetic test, their
fore student X has a raw score of 25. To make a raw score means it is
converted into a description of the specific task that the student can perform is
the process of interpretability.
Criterion-referenced
and standard-based interpretation
A
test about the specific kind of domain is directed at a desire for criterion-referenced and standards-based interpretations. Criterion-referenced and
standard-based interpretation of test results is most meaningful when the test
has been designed for this purpose. This involves designing a test that measure the achievement domain, which is homogeneous, delimited, and clearly specified, enough
items for each interpretation, Items neither easy nor difficult, Items not only
selection type but all other types and Items that directly provide relevance to
objectives (Linn & Gronlund, 2000).
Norm-referenced
interpretation
Swain
et al (2000) assert that norm-referenced interpretation tells us how an
individual compares with other persons who have taken the same test. The simplest way of comparison in the classroom is ranking from highest to lowest where an
individual's score falls. For more meaningful and well-defined characteristics of
interpretation, raw scores are converted into derived scores which are
numerical reports of test performance on a score scale.
Uses of
interpretability
Teachers keep records of the students, time
and improve their instruction by interpreting the scores of a test. The
students can see their level of performance related to other colleagues in
their class by looking at the interpretation of their test scores. Parents easily understand
the actual performance of their children and decide what to do and what not to
do. The administration uses the
interpretation of test scores to present the position of the school in
terms of students learning. The Researchers
make inferences by interpreting the scores of tests as their data collection tool
(Linn & Gronlund, 2000).
Strengths
Swain et al (2000) share the following strengths
and weaknesses of the interpretability of test scores.
•
More
information can be presented to the audience using a small number of
illustrations.
• Students' achievements are qualitatively
expressed other than numerical values.
•
Students are measured relative to the average
group.
•
Tables for norms are already given, so looking
at the tables, interpretation becomes easy.
Weaknesses
•
If
the task is not selected appropriate to the domain being measured the scores will
be misinterpreted.
•
A large number of items are needed to ensure a correct interpretation, which takes time to carry out calculations.
•
If item analysis is not done properly which
means easy items are included, then the low achievers will not know about what
they can do or can not do.
•
Norms
are generalized for all students by a pilot test but not taken care of
individual differences in overall educational settings.
Conclusion
There
are basically two major characteristics of a good test, validity and
reliability but in the absence of other characteristics such as practicality,
objectivity and interpretability, validity and reliability may not be possible
to be ensured. All five characteristics are interconnected to each other if one
is missed in the test others will definitely be affected. The teachers need to
know about all these characteristics and incorporate these characteristics in
their testing programs to ensure the effective teaching and learning to be
taken place in the classroom.
References
Kubiszyne,
T., & Borich, G. (2003). Educational testing and measurement: Classroom application and practice (7th
ed.). New York: John Wiley & sons.
Linn,
R. L., & Gronlund, N.E. (2000). Measurement and assessment in teaching (8th
ed.). Delhi: Pearson Education.
Rehman, A. (2007). Development and validation of objective test items analysis in the
subject physics for class IX in
Rawalpindi city. Retrieved May 12,
2009 form International Islamic
university, Department of Education Web site: http://eprints.hec.gov.pk/2518/1/2455.htm.
Swain,
S. K., Pradhan, C., & Khotoi, S. P. K. (2000). Educational measurement:
Statistics and guidance.
Ludhiana: Kalyani.
Taiwo,
A. A. (1995). Fundamentals of classroom testing. New Delhi: Vikas Publishing House.
Note:
Plz must visit my other Apps & Tech Websites for more information below here.
My Other Tech Websites
2: Apkroot.net
1 Comments
litemod.net
ReplyDeletePost a Comment