• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Experimental Design: Definition and Types

By Jim Frost 3 Comments

What is Experimental Design?

An experimental design is a detailed plan for collecting and using data to identify causal relationships. Through careful planning, the design of experiments allows your data collection efforts to have a reasonable chance of detecting effects and testing hypotheses that answer your research questions.

An experiment is a data collection procedure that occurs in controlled conditions to identify and understand causal relationships between variables. Researchers can use many potential designs. The ultimate choice depends on their research question, resources, goals, and constraints. In some fields of study, researchers refer to experimental design as the design of experiments (DOE). Both terms are synonymous.

Scientist who developed an experimental design for her research.

Ultimately, the design of experiments helps ensure that your procedures and data will evaluate your research question effectively. Without an experimental design, you might waste your efforts in a process that, for many potential reasons, can’t answer your research question. In short, it helps you trust your results.

Learn more about Independent and Dependent Variables .

Design of Experiments: Goals & Settings

Experiments occur in many settings, ranging from psychology, social sciences, medicine, physics, engineering, and industrial and service sectors. Typically, experimental goals are to discover a previously unknown effect , confirm a known effect, or test a hypothesis.

Effects represent causal relationships between variables. For example, in a medical experiment, does the new medicine cause an improvement in health outcomes? If so, the medicine has a causal effect on the outcome.

An experimental design’s focus depends on the subject area and can include the following goals:

  • Understanding the relationships between variables.
  • Identifying the variables that have the largest impact on the outcomes.
  • Finding the input variable settings that produce an optimal result.

For example, psychologists have conducted experiments to understand how conformity affects decision-making. Sociologists have performed experiments to determine whether ethnicity affects the public reaction to staged bike thefts. These experiments map out the causal relationships between variables, and their primary goal is to understand the role of various factors.

Conversely, in a manufacturing environment, the researchers might use an experimental design to find the factors that most effectively improve their product’s strength, identify the optimal manufacturing settings, and do all that while accounting for various constraints. In short, a manufacturer’s goal is often to use experiments to improve their products cost-effectively.

In a medical experiment, the goal might be to quantify the medicine’s effect and find the optimum dosage.

Developing an Experimental Design

Developing an experimental design involves planning that maximizes the potential to collect data that is both trustworthy and able to detect causal relationships. Specifically, these studies aim to see effects when they exist in the population the researchers are studying, preferentially favor causal effects, isolate each factor’s true effect from potential confounders, and produce conclusions that you can generalize to the real world.

To accomplish these goals, experimental designs carefully manage data validity and reliability , and internal and external experimental validity. When your experiment is valid and reliable, you can expect your procedures and data to produce trustworthy results.

An excellent experimental design involves the following:

  • Lots of preplanning.
  • Developing experimental treatments.
  • Determining how to assign subjects to treatment groups.

The remainder of this article focuses on how experimental designs incorporate these essential items to accomplish their research goals.

Learn more about Data Reliability vs. Validity and Internal and External Experimental Validity .

Preplanning, Defining, and Operationalizing for Design of Experiments

A literature review is crucial for the design of experiments.

This phase of the design of experiments helps you identify critical variables, know how to measure them while ensuring reliability and validity, and understand the relationships between them. The review can also help you find ways to reduce sources of variability, which increases your ability to detect treatment effects. Notably, the literature review allows you to learn how similar studies designed their experiments and the challenges they faced.

Operationalizing a study involves taking your research question, using the background information you gathered, and formulating an actionable plan.

This process should produce a specific and testable hypothesis using data that you can reasonably collect given the resources available to the experiment.

  • Null hypothesis : The jumping exercise intervention does not affect bone density.
  • Alternative hypothesis : The jumping exercise intervention affects bone density.

To learn more about this early phase, read Five Steps for Conducting Scientific Studies with Statistical Analyses .

Formulating Treatments in Experimental Designs

In an experimental design, treatments are variables that the researchers control. They are the primary independent variables of interest. Researchers administer the treatment to the subjects or items in the experiment and want to know whether it causes changes in the outcome.

As the name implies, a treatment can be medical in nature, such as a new medicine or vaccine. But it’s a general term that applies to other things such as training programs, manufacturing settings, teaching methods, and types of fertilizers. I helped run an experiment where the treatment was a jumping exercise intervention that we hoped would increase bone density. All these treatment examples are things that potentially influence a measurable outcome.

Even when you know your treatment generally, you must carefully consider the amount. How large of a dose? If you’re comparing three different temperatures in a manufacturing process, how far apart are they? For my bone mineral density study, we had to determine how frequently the exercise sessions would occur and how long each lasted.

How you define the treatments in the design of experiments can affect your findings and the generalizability of your results.

Assigning Subjects to Experimental Groups

A crucial decision for all experimental designs is determining how researchers assign subjects to the experimental conditions—the treatment and control groups. The control group is often, but not always, the lack of a treatment. It serves as a basis for comparison by showing outcomes for subjects who don’t receive a treatment. Learn more about Control Groups .

How your experimental design assigns subjects to the groups affects how confident you can be that the findings represent true causal effects rather than mere correlation caused by confounders. Indeed, the assignment method influences how you control for confounding variables. This is the difference between correlation and causation .

Imagine a study finds that vitamin consumption correlates with better health outcomes. As a researcher, you want to be able to say that vitamin consumption causes the improvements. However, with the wrong experimental design, you might only be able to say there is an association. A confounder, and not the vitamins, might actually cause the health benefits.

Let’s explore some of the ways to assign subjects in design of experiments.

Completely Randomized Designs

A completely randomized experimental design randomly assigns all subjects to the treatment and control groups. You simply take each participant and use a random process to determine their group assignment. You can flip coins, roll a die, or use a computer. Randomized experiments must be prospective studies because they need to be able to control group assignment.

Random assignment in the design of experiments helps ensure that the groups are roughly equivalent at the beginning of the study. This equivalence at the start increases your confidence that any differences you see at the end were caused by the treatments. The randomization tends to equalize confounders between the experimental groups and, thereby, cancels out their effects, leaving only the treatment effects.

For example, in a vitamin study, the researchers can randomly assign participants to either the control or vitamin group. Because the groups are approximately equal when the experiment starts, if the health outcomes are different at the end of the study, the researchers can be confident that the vitamins caused those improvements.

Statisticians consider randomized experimental designs to be the best for identifying causal relationships.

If you can’t randomly assign subjects but want to draw causal conclusions about an intervention, consider using a quasi-experimental design .

Learn more about Randomized Controlled Trials and Random Assignment in Experiments .

Randomized Block Designs

Nuisance factors are variables that can affect the outcome, but they are not the researcher’s primary interest. Unfortunately, they can hide or distort the treatment results. When experimenters know about specific nuisance factors, they can use a randomized block design to minimize their impact.

This experimental design takes subjects with a shared “nuisance” characteristic and groups them into blocks. The participants in each block are then randomly assigned to the experimental groups. This process allows the experiment to control for known nuisance factors.

Blocking in the design of experiments reduces the impact of nuisance factors on experimental error. The analysis assesses the effects of the treatment within each block, which removes the variability between blocks. The result is that blocked experimental designs can reduce the impact of nuisance variables, increasing the ability to detect treatment effects accurately.

Suppose you’re testing various teaching methods. Because grade level likely affects educational outcomes, you might use grade level as a blocking factor. To use a randomized block design for this scenario, divide the participants by grade level and then randomly assign the members of each grade level to the experimental groups.

A standard guideline for an experimental design is to “Block what you can, randomize what you cannot.” Use blocking for a few primary nuisance factors. Then use random assignment to distribute the unblocked nuisance factors equally between the experimental conditions.

You can also use covariates to control nuisance factors. Learn about Covariates: Definition and Uses .

Observational Studies

In some experimental designs, randomly assigning subjects to the experimental conditions is impossible or unethical. The researchers simply can’t assign participants to the experimental groups. However, they can observe them in their natural groupings, measure the essential variables, and look for correlations. These observational studies are also known as quasi-experimental designs. Retrospective studies must be observational in nature because they look back at past events.

Imagine you’re studying the effects of depression on an activity. Clearly, you can’t randomly assign participants to the depression and control groups. But you can observe participants with and without depression and see how their task performance differs.

Observational studies let you perform research when you can’t control the treatment. However, quasi-experimental designs increase the problem of confounding variables. For this design of experiments, correlation does not necessarily imply causation. While special procedures can help control confounders in an observational study, you’re ultimately less confident that the results represent causal findings.

Learn more about Observational Studies .

For a good comparison, learn about the differences and tradeoffs between Observational Studies and Randomized Experiments .

Between-Subjects vs. Within-Subjects Experimental Designs

When you think of the design of experiments, you probably picture a treatment and control group. Researchers assign participants to only one of these groups, so each group contains entirely different subjects than the other groups. Analysts compare the groups at the end of the experiment. Statisticians refer to this method as a between-subjects, or independent measures, experimental design.

In a between-subjects design , you can have more than one treatment group, but each subject is exposed to only one condition, the control group or one of the treatment groups.

A potential downside to this approach is that differences between groups at the beginning can affect the results at the end. As you’ve read earlier, random assignment can reduce those differences, but it is imperfect. There will always be some variability between the groups.

In a  within-subjects experimental design , also known as repeated measures, subjects experience all treatment conditions and are measured for each. Each subject acts as their own control, which reduces variability and increases the statistical power to detect effects.

In this experimental design, you minimize pre-existing differences between the experimental conditions because they all contain the same subjects. However, the order of treatments can affect the results. Beware of practice and fatigue effects. Learn more about Repeated Measures Designs .

Assigned to one experimental condition Participates in all experimental conditions
Requires more subjects Fewer subjects
Differences between subjects in the groups can affect the results Uses same subjects in all conditions.
No order of treatment effects. Order of treatments can affect results.

Design of Experiments Examples

For example, a bone density study has three experimental groups—a control group, a stretching exercise group, and a jumping exercise group.

In a between-subjects experimental design, scientists randomly assign each participant to one of the three groups.

In a within-subjects design, all subjects experience the three conditions sequentially while the researchers measure bone density repeatedly. The procedure can switch the order of treatments for the participants to help reduce order effects.

Matched Pairs Experimental Design

A matched pairs experimental design is a between-subjects study that uses pairs of similar subjects. Researchers use this approach to reduce pre-existing differences between experimental groups. It’s yet another design of experiments method for reducing sources of variability.

Researchers identify variables likely to affect the outcome, such as demographics. When they pick a subject with a set of characteristics, they try to locate another participant with similar attributes to create a matched pair. Scientists randomly assign one member of a pair to the treatment group and the other to the control group.

On the plus side, this process creates two similar groups, and it doesn’t create treatment order effects. While matched pairs do not produce the perfectly matched groups of a within-subjects design (which uses the same subjects in all conditions), it aims to reduce variability between groups relative to a between-subjects study.

On the downside, finding matched pairs is very time-consuming. Additionally, if one member of a matched pair drops out, the other subject must leave the study too.

Learn more about Matched Pairs Design: Uses & Examples .

Another consideration is whether you’ll use a cross-sectional design (one point in time) or use a longitudinal study to track changes over time .

A case study is a research method that often serves as a precursor to a more rigorous experimental design by identifying research questions, variables, and hypotheses to test. Learn more about What is a Case Study? Definition & Examples .

In conclusion, the design of experiments is extremely sensitive to subject area concerns and the time and resources available to the researchers. Developing a suitable experimental design requires balancing a multitude of considerations. A successful design is necessary to obtain trustworthy answers to your research question and to have a reasonable chance of detecting treatment effects when they exist.

Share this:

experimental statistical analysis

Reader Interactions

' src=

March 23, 2024 at 2:35 pm

Dear Jim You wrote a superb document, I will use it in my Buistatistics course, along with your three books. Thank you very much! Miguel

' src=

March 23, 2024 at 5:43 pm

Thanks so much, Miguel! Glad this post was helpful and I trust the books will be as well.

' src=

April 10, 2023 at 4:36 am

What are the purpose and uses of experimental research design?

Comments and Questions Cancel reply

Statistical Analysis of Experimental Data

  • Reference work entry
  • Cite this reference work entry

experimental statistical analysis

  • James W. Dally Prof. 2  

Part of the book series: Springer Handbooks ((SHB))

13k Accesses

4 Citations

Statistical methods are extremely important in engineering, because they provide a means for representing large amounts of data in a concise form that is easily interpreted and understood. Usually, the data are represented with a statistical distribution function that can be characterized by a measure of central tendency (the mean x ¯ ) and a measure of dispersion (the standard deviation  S x ). A normal or Gaussian probability distribution is by far the most commonly employed; however, in some cases, other distribution functions may have to be employed to adequately represent the data.

The most significant advantage resulting from the use of a probability distribution function in engineering applications is the ability to predict the occurrence of an event based on a relatively small sample. The effects of sampling error are accounted for by placing confidence limits on the predictions and establishing the associated confidence levels. Sampling error can be controlled if the sample size is adequate. Use of Studentʼs t distribution function, which characterizes sampling error, provides a basis for determining sample size consistent with specified levels of confidence. Studentʼs t distribution also permits a comparison to be made of two means to determine whether the observed difference is significant or whether it is due to random variation.

Statistical methods can also be employed to condition data and to eliminate an erroneous data point (one) from a series of measurements. This is a useful technique that improves the data base by providing strong evidence when something unanticipated is affecting an experiment.

Regression analysis can be used effectively to interpret data when the behavior of one quantity  y depends upon variations in one or more independent quantities x 1 , x 2 , ..., x n . Even though the functional relationship between quantities exhibiting variation remains unknown, it can be characterized statistically. Regression analysis provides a method to fit a straight line or a curve through a series of scattered data points on a graph. The adequacy of the regression analysis can be evaluated by determining a correlation coefficient. Methods for extending regression analysis to multivariate functions exist. In principle, these methods are identical to linear regression analysis; however, the analysis becomes much more complex. The increase in complexity is not a concern, because computer subroutines are available that solve the tedious equations and provide the results in a convenient format.

Many probability functions are used in statistical analyses to represent data and predict population properties. Once a probability function has been selected to represent a population, any series of measurements can be subjected to a chi-squared ( χ 2 ) test to check the validity of the assumed function. Accurate predictions can be made only if the proper probability function has been selected.

Finally, statistical methods for accessing error propagation are discussed. These methods provide a means for determining error in a quantity of interest y based on measurements of related quantities x 1 , x 2 , ..., x n and the functional relationship y  =  f ( x 1 ,  x 2 , ..., x n ).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

experimental statistical analysis

Statistics and Engineering

experimental statistical analysis

Elements of Statistical Technique

experimental statistical analysis

Statistical Methods

Abbreviations.

deviation ratio

R.M. Bethea, R.R. Rhinehart: Applied Engineering Statistics (Dekker, New York 1991)

MATH   Google Scholar  

J.T. McClave, T. Sincich: Statistics , 10th edn. (Prentice Hall, Upper Saddle River 2006)

Google Scholar  

A. Agresti, C. Franklin: Statistics: The Art and Science of Learning from Data (Prentice Hall, Upper Saddle River 2006)

R. Hogg, A. Craig, J. McKean: Introduction to Mathematical Statistics , 6th edn. (Prentice Hall, Upper Saddle River 2005)

P.S. Mann: Introductory Statistics , 5th edn. (Wiley, New York 2005)

D.L. Harnett, J.F. Horrell: Data, Statistics and Decision Models with EXCEL (Wiley, New York 1998)

G.W. Snedecor, W.G. Cochran: Statistical Methods , 8th edn. (Iowa State Univ. Press, Ames 1989)

W.C. Navidi: Statistics for Engineers and Scientists (McGraw-Hill, New York 2006)

H. Pham (Ed.): Springer Handbook of Engineering Statistics (Springer, Berlin, Heidelberg 2006)

W. Weibull: Fatigue Testing and Analysis of Results (Pergamon, New York 1961)

J.S. Milton, J.C. Arnold: Introduction to Probability and Statistics (McGraw-Hill, New York 2006)

R.J. Sanford: Application of the least squares methods to photoelastic analysis, Exp. Mech. 20 , 192–197 (1980)

Article   Google Scholar  

J.R. Sanford, J.W. Dally: A general methods for determining mixed-mode stress intensity factors from isochromatic fringe patterns, Eng. Fract. Mech. 11 , 621–633 (1979)

Download references

Author information

Authors and affiliations.

University of Maryland, 5713 Glen Cove Drive, 37919, Knoxville, TN, USA

James W. Dally Prof.

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to James W. Dally Prof. .

Editor information

Editors and affiliations.

Department of Mechanical Engineering, Room 126, Latrobe Hall, The Johns Hopkins University, 21218-2681, Baltimore, MD, USA, 3400 North Charles Street

William N. Sharpe Jr. Prof.

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag

About this entry

Cite this entry.

Dally, J.W. (2008). Statistical Analysis of Experimental Data. In: Sharpe, W. (eds) Springer Handbook of Experimental Solid Mechanics. Springer Handbooks. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30877-7_11

Download citation

DOI : https://doi.org/10.1007/978-0-387-30877-7_11

Publisher Name : Springer, Boston, MA

Print ISBN : 978-0-387-26883-5

Online ISBN : 978-0-387-30877-7

eBook Packages : Engineering Reference Module Computer Science and Engineering

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Calcworkshop

Experimental Design in Statistics w/ 11 Examples!

// Last Updated: September 20, 2020 - Watch Video //

A proper experimental design is a critical skill in statistics.

Jenn (B.S., M.Ed.) of Calcworkshop® teaching why experimental design is important

Jenn, Founder Calcworkshop ® , 15+ Years Experience (Licensed & Certified Teacher)

Without proper controls and safeguards, unintended consequences can ruin our study and lead to wrong conclusions.

So let’s dive in to see what’s this is all about!

What’s the difference between an observational study and an experimental study?

An observational study is one in which investigators merely measure variables of interest without influencing the subjects.

And an experiment is a study in which investigators administer some form of treatment on one or more groups?

In other words, an observation is hands-off, whereas an experiment is hands-on.

So what’s the purpose of an experiment?

To establish causation (i.e., cause and effect).

All this means is that we wish to determine the effect an independent explanatory variable has on a dependent response variable.

The explanatory variable explains a response, similar to a child falling and skins their knee and starting to cry. The child is crying in response to falling and skinning their knee. So the explanatory variable is the fall, and the response variable is crying.

explanatory vs response variable in everyday life

Explanatory Vs Response Variable In Everyday Life

Let’s look at another example. Suppose a medical journal describes two studies in which subjects who had a seizure were randomly assigned to two different treatments:

  • No treatment.
  • A high dose of vitamin C.

The subjects were observed for a year, and the number of seizures for each subject was recorded. Identify the explanatory variable (independent variable), response variable (dependent variable), and include the experimental units.

The explanatory variable is whether the subject received either no treatment or a high dose of vitamin C. The response variable is whether the subject had a seizure during the time of the study. The experimental units in this study are the subjects who recently had a seizure.

Okay, so using the example above, notice that one of the groups did not receive treatment. This group is called a control group and acts as a baseline to see how a new treatment differs from those who don’t receive treatment. Typically, the control group is given something called a placebo, a substance designed to resemble medicine but does not contain an active drug component. A placebo is a dummy treatment, and should not have a physical effect on a person.

Before we talk about the characteristics of a well-designed experiment, we need to discuss some things to look out for:

  • Confounding
  • Lurking variables

Confounding happens when two explanatory variables are both associated with a response variable and also associated with each other, causing the investigator not to be able to identify their effects and the response variable separately.

A lurking variable is usually unobserved at the time of the study, which influences the association between the two variables of interest. In essence, a lurking variable is a third variable that is not measured in the study but may change the response variable.

For example, a study reported a relationship between smoking and health. A study of 1430 women were asked whether they smoked. Ten years later, a follow-up survey observed whether each woman was still alive or deceased. The researchers studied the possible link between whether a woman smoked and whether she survived the 10-year study period. They reported that:

  • 21% of the smokers died
  • 32% of the nonsmokers died

So, is smoking beneficial to your health, or is there something that could explain how this happened?

Older women are less likely to be smokers, and older women are more likely to die. Because age is a variable that influences the explanatory and response variable, it is considered a confounding variable.

But does smoking cause death?

Notice that the lurking variable, age, can also be a contributing factor. While there is a correlation between smoking and mortality, and also a correlation between smoking and age, we aren’t 100% sure that they are the cause of the mortality rate in women.

lurking confounding correlation causation diagram

Lurking – Confounding – Correlation – Causation Diagram

Now, something important to point out is that a lurking variable is one that is not measured in the study that could influence the results. Using the example above, some other possible lurking variables are:

  • Stress Level.

These variables were not measured in the study but could influence smoking habits as well as mortality rates.

What is important to note about the difference between confounding and lurking variables is that a confounding variable is measured in a study, while a lurking variable is not.

Additionally, correlation does not imply causation!

Alright, so now it’s time to talk about blinding: single-blind, double-blind experiments, as well as the placebo effect.

A single-blind experiment is when the subjects are unaware of which treatment they are receiving, but the investigator measuring the responses knows what treatments are going to which subject. In other words, the researcher knows which individual gets the placebo and which ones receive the experimental treatment. One major pitfall for this type of design is that the researcher may consciously or unconsciously influence the subject since they know who is receiving treatment and who isn’t.

A double-blind experiment is when both the subjects and investigator do not know who receives the placebo and who receives the treatment. A double-blind model is considered the best model for clinical trials as it eliminates the possibility of bias on the part of the researcher and the possibility of producing a placebo effect from the subject.

The placebo effect is when a subject has an effect or response to a fake treatment because they “believe” that the result should occur as noted by Yale . For example, a person struggling with insomnia takes a placebo (sugar pill) but instantly falls asleep because they believe they are receiving a sleep aid like Ambien or Lunesta.

placebo effect real life example

Placebo Effect – Real Life Example

So, what are the three primary requirements for a well-designed experiment?

  • Randomization

In a controlled experiment , the researchers, or investigators, decide which subjects are assigned to a control group and which subjects are assigned to a treatment group. In doing so, we ensure that the control and treatment groups are as similar as possible, and limit possible confounding influences such as lurking variables. A replicated experiment that is repeated on many different subjects helps reduce the chance of variation on the results. And randomization means we randomly assign subjects into control and treatment groups.

When subjects are divided into control groups and treatment groups randomly, we can use probability to predict the differences we expect to observe. If the differences between the two groups are higher than what we would expect to see naturally (by chance), we say that the results are statistically significant.

For example, if it is surmised that a new medicine reduces the effects of illness from 72 hours to 71 hours, this would not be considered statistically significant. The difference from 72 hours to 71 hours is not substantial enough to support that the observed effect was due to something other than normal random variation.

Now there are two major types of designs:

  • Completely-Randomized Design (CRD)
  • Block Design

A completely randomized design is the process of assigning subjects to control and treatment groups using probability, as seen in the flow diagram below.

completely randomized design example

Completely Randomized Design Example

A block design is a research method that places subjects into groups of similar experimental units or conditions, like age or gender, and then assign subjects to control and treatment groups using probability, as shown below.

randomized block design example

Randomized Block Design Example

Additionally, a useful and particular case of a blocking strategy is something called a matched-pair design . This is when two variables are paired to control for lurking variables.

For example, imagine we want to study if walking daily improved blood pressure. If the blood pressure for five subjects is measured at the beginning of the study and then again after participating in a walking program for one month, then the observations would be considered dependent samples because the same five subjects are used in the before and after observations; thus, a matched-pair design.

Please note that our video lesson will not focus on quasi-experiments. A quasi experimental design lacks random assignments; therefore, the independent variable can be manipulated prior to measuring the dependent variable, which may lead to confounding. For the sake of our lesson, and all future lessons, we will be using research methods where random sampling and experimental designs are used.

Together we will learn how to identify explanatory variables (independent variable) and response variables (dependent variables), understand and define confounding and lurking variables, see the effects of single-blind and double-blind experiments, and design randomized and block experiments.

Experimental Designs – Lesson & Examples (Video)

1 hr 06 min

  • Introduction to Video: Experiments
  • 00:00:29 – Observational Study vs Experimental Study and Response and Explanatory Variables (Examples #1-4)
  • Exclusive Content for Members Only
  • 00:09:15 – Identify the response and explanatory variables and the experimental units and treatment (Examples #5-6)
  • 00:14:47 – Introduction of lurking variables and confounding with ice cream and homicide example
  • 00:18:57 – Lurking variables, Confounding, Placebo Effect, Single Blind and Double Blind Experiments (Example #7)
  • 00:27:20 – What was the placebo effect and was the experiment single or double blind? (Example #8)
  • 00:30:36 – Characteristics of a well designed and constructed experiment that is statistically significant
  • 00:35:08 – Overview of Complete Randomized Design, Block Design and Matched Pair Design
  • 00:44:23 – Design and experiment using complete randomized design or a block design (Examples #9-10)
  • 00:56:09 – Identify the response and explanatory variables, experimental units, lurking variables, and design an experiment to test a new drug (Example #11)
  • Practice Problems with Step-by-Step Solutions
  • Chapter Tests with Video Solutions

Get access to all the courses and over 450 HD videos with your subscription

Monthly and Yearly Plans Available

Get My Subscription Now

Still wondering if CalcWorkshop is right for you? Take a Tour and find out how a membership can take the struggle out of learning math.

Statistical Design and Analysis of Biological Experiments

Chapter 1 principles of experimental design, 1.1 introduction.

The validity of conclusions drawn from a statistical analysis crucially hinges on the manner in which the data are acquired, and even the most sophisticated analysis will not rescue a flawed experiment. Planning an experiment and thinking about the details of data acquisition is so important for a successful analysis that R. A. Fisher—who single-handedly invented many of the experimental design techniques we are about to discuss—famously wrote

To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ( Fisher 1938 )

(Statistical) design of experiments provides the principles and methods for planning experiments and tailoring the data acquisition to an intended analysis. Design and analysis of an experiment are best considered as two aspects of the same enterprise: the goals of the analysis strongly inform an appropriate design, and the implemented design determines the possible analyses.

The primary aim of designing experiments is to ensure that valid statistical and scientific conclusions can be drawn that withstand the scrutiny of a determined skeptic. Good experimental design also considers that resources are used efficiently, and that estimates are sufficiently precise and hypothesis tests adequately powered. It protects our conclusions by excluding alternative interpretations or rendering them implausible. Three main pillars of experimental design are randomization , replication , and blocking , and we will flesh out their effects on the subsequent analysis as well as their implementation in an experimental design.

An experimental design is always tailored towards predefined (primary) analyses and an efficient analysis and unambiguous interpretation of the experimental data is often straightforward from a good design. This does not prevent us from doing additional analyses of interesting observations after the data are acquired, but these analyses can be subjected to more severe criticisms and conclusions are more tentative.

In this chapter, we provide the wider context for using experiments in a larger research enterprise and informally introduce the main statistical ideas of experimental design. We use a comparison of two samples as our main example to study how design choices affect an analysis, but postpone a formal quantitative analysis to the next chapters.

1.2 A Cautionary Tale

For illustrating some of the issues arising in the interplay of experimental design and analysis, we consider a simple example. We are interested in comparing the enzyme levels measured in processed blood samples from laboratory mice, when the sample processing is done either with a kit from a vendor A, or a kit from a competitor B. For this, we take 20 mice and randomly select 10 of them for sample preparation with kit A, while the blood samples of the remaining 10 mice are prepared with kit B. The experiment is illustrated in Figure 1.1 A and the resulting data are given in Table 1.1 .

Table 1.1: Measured enzyme levels from samples of twenty mice. Samples of ten mice each were processed using a kit of vendor A and B, respectively.
A 8.96 8.95 11.37 12.63 11.38 8.36 6.87 12.35 10.32 11.99
B 12.68 11.37 12.00 9.81 10.35 11.76 9.01 10.83 8.76 9.99

One option for comparing the two kits is to look at the difference in average enzyme levels, and we find an average level of 10.32 for vendor A and 10.66 for vendor B. We would like to interpret their difference of -0.34 as the difference due to the two preparation kits and conclude whether the two kits give equal results or if measurements based on one kit are systematically different from those based on the other kit.

Such interpretation, however, is only valid if the two groups of mice and their measurements are identical in all aspects except the sample preparation kit. If we use one strain of mice for kit A and another strain for kit B, any difference might also be attributed to inherent differences between the strains. Similarly, if the measurements using kit B were conducted much later than those using kit A, any observed difference might be attributed to changes in, e.g., mice selected, batches of chemicals used, device calibration, or any number of other influences. None of these competing explanations for an observed difference can be excluded from the given data alone, but good experimental design allows us to render them (almost) arbitrarily implausible.

A second aspect for our analysis is the inherent uncertainty in our calculated difference: if we repeat the experiment, the observed difference will change each time, and this will be more pronounced for a smaller number of mice, among others. If we do not use a sufficient number of mice in our experiment, the uncertainty associated with the observed difference might be too large, such that random fluctuations become a plausible explanation for the observed difference. Systematic differences between the two kits, of practically relevant magnitude in either direction, might then be compatible with the data, and we can draw no reliable conclusions from our experiment.

In each case, the statistical analysis—no matter how clever—was doomed before the experiment was even started, while simple ideas from statistical design of experiments would have provided correct and robust results with interpretable conclusions.

1.3 The Language of Experimental Design

By an experiment we understand an investigation where the researcher has full control over selecting and altering the experimental conditions of interest, and we only consider investigations of this type. The selected experimental conditions are called treatments . An experiment is comparative if the responses to several treatments are to be compared or contrasted. The experimental units are the smallest subdivision of the experimental material to which a treatment can be assigned. All experimental units given the same treatment constitute a treatment group . Especially in biology, we often compare treatments to a control group to which some standard experimental conditions are applied; a typical example is using a placebo for the control group, and different drugs for the other treatment groups.

The values observed are called responses and are measured on the response units ; these are often identical to the experimental units but need not be. Multiple experimental units are sometimes combined into groupings or blocks , such as mice grouped by litter, or samples grouped by batches of chemicals used for their preparation. More generally, we call any grouping of the experimental material (even with group size one) a unit .

In our example, we selected the mice, used a single sample per mouse, deliberately chose the two specific vendors, and had full control over which kit to assign to which mouse. In other words, the two kits are the treatments and the mice are the experimental units. We took the measured enzyme level of a single sample from a mouse as our response, and samples are therefore the response units. The resulting experiment is comparative, because we contrast the enzyme levels between the two treatment groups.

Three designs to determine the difference between two preparation kits A and B based on four mice. A: One sample per mouse. Comparison between averages of samples with same kit. B: Two samples per mouse treated with the same kit. Comparison between averages of mice with same kit requires averaging responses for each mouse first. C: Two samples per mouse each treated with different kit. Comparison between two samples of each mouse, with differences averaged.

Figure 1.1: Three designs to determine the difference between two preparation kits A and B based on four mice. A: One sample per mouse. Comparison between averages of samples with same kit. B: Two samples per mouse treated with the same kit. Comparison between averages of mice with same kit requires averaging responses for each mouse first. C: Two samples per mouse each treated with different kit. Comparison between two samples of each mouse, with differences averaged.

In this example, we can coalesce experimental and response units, because we have a single response per mouse and cannot distinguish a sample from a mouse in the analysis, as illustrated in Figure 1.1 A for four mice. Responses from mice with the same kit are averaged, and the kit difference is the difference between these two averages.

By contrast, if we take two samples per mouse and use the same kit for both samples, then the mice are still the experimental units, but each mouse now groups the two response units associated with it. Now, responses from the same mouse are first averaged, and these averages are used to calculate the difference between kits; even though eight measurements are available, this difference is still based on only four mice (Figure 1.1 B).

If we take two samples per mouse, but apply each kit to one of the two samples, then the samples are both the experimental and response units, while the mice are blocks that group the samples. Now, we calculate the difference between kits for each mouse, and then average these differences (Figure 1.1 C).

If we only use one kit and determine the average enzyme level, then this investigation is still an experiment, but is not comparative.

To summarize, the design of an experiment determines the logical structure of the experiment ; it consists of (i) a set of treatments (the two kits); (ii) a specification of the experimental units (animals, cell lines, samples) (the mice in Figure 1.1 A,B and the samples in Figure 1.1 C); (iii) a procedure for assigning treatments to units; and (iv) a specification of the response units and the quantity to be measured as a response (the samples and associated enzyme levels).

1.4 Experiment Validity

Before we embark on the more technical aspects of experimental design, we discuss three components for evaluating an experiment’s validity: construct validity , internal validity , and external validity . These criteria are well-established in areas such as educational and psychological research, and have more recently been discussed for animal research ( Würbel 2017 ) where experiments are increasingly scrutinized for their scientific rationale and their design and intended analyses.

1.4.1 Construct Validity

Construct validity concerns the choice of the experimental system for answering our research question. Is the system even capable of providing a relevant answer to the question?

Studying the mechanisms of a particular disease, for example, might require careful choice of an appropriate animal model that shows a disease phenotype and is accessible to experimental interventions. If the animal model is a proxy for drug development for humans, biological mechanisms must be sufficiently similar between animal and human physiologies.

Another important aspect of the construct is the quantity that we intend to measure (the measurand ), and its relation to the quantity or property we are interested in. For example, we might measure the concentration of the same chemical compound once in a blood sample and once in a highly purified sample, and these constitute two different measurands, whose values might not be comparable. Often, the quantity of interest (e.g., liver function) is not directly measurable (or even quantifiable) and we measure a biomarker instead. For example, pre-clinical and clinical investigations may use concentrations of proteins or counts of specific cell types from blood samples, such as the CD4+ cell count used as a biomarker for immune system function.

1.4.2 Internal Validity

The internal validity of an experiment concerns the soundness of the scientific rationale, statistical properties such as precision of estimates, and the measures taken against risk of bias. It refers to the validity of claims within the context of the experiment. Statistical design of experiments plays a prominent role in ensuring internal validity, and we briefly discuss the main ideas before providing the technical details and an application to our example in the subsequent sections.

Scientific Rationale and Research Question

The scientific rationale of a study is (usually) not immediately a statistical question. Translating a scientific question into a quantitative comparison amenable to statistical analysis is no small task and often requires careful consideration. It is a substantial, if non-statistical, benefit of using experimental design that we are forced to formulate a precise-enough research question and decide on the main analyses required for answering it before we conduct the experiment. For example, the question: is there a difference between placebo and drug? is insufficiently precise for planning a statistical analysis and determine an adequate experimental design. What exactly is the drug treatment? What should the drug’s concentration be and how is it administered? How do we make sure that the placebo group is comparable to the drug group in all other aspects? What do we measure and what do we mean by “difference?” A shift in average response, a fold-change, change in response before and after treatment?

The scientific rationale also enters the choice of a potential control group to which we compare responses. The quote

The deep, fundamental question in statistical analysis is ‘Compared to what?’ ( Tufte 1997 )

highlights the importance of this choice.

There are almost never enough resources to answer all relevant scientific questions. We therefore define a few questions of highest interest, and the main purpose of the experiment is answering these questions in the primary analysis . This intended analysis drives the experimental design to ensure relevant estimates can be calculated and have sufficient precision, and tests are adequately powered. This does not preclude us from conducting additional secondary analyses and exploratory analyses , but we are not willing to enlarge the experiment to ensure that strong conclusions can also be drawn from these analyses.

Risk of Bias

Experimental bias is a systematic difference in response between experimental units in addition to the difference caused by the treatments. The experimental units in the different groups are then not equal in all aspects other than the treatment applied to them. We saw several examples in Section 1.2 .

Minimizing the risk of bias is crucial for internal validity and we look at some common measures to eliminate or reduce different types of bias in Section 1.5 .

Precision and Effect Size

Another aspect of internal validity is the precision of estimates and the expected effect sizes. Is the experimental setup, in principle, able to detect a difference of relevant magnitude? Experimental design offers several methods for answering this question based on the expected heterogeneity of samples, the measurement error, and other sources of variation: power analysis is a technique for determining the number of samples required to reliably detect a relevant effect size and provide estimates of sufficient precision. More samples yield more precision and more power, but we have to be careful that replication is done at the right level: simply measuring a biological sample multiple times as in Figure 1.1 B yields more measured values, but is pseudo-replication for analyses. Replication should also ensure that the statistical uncertainties of estimates can be gauged from the data of the experiment itself, without additional untestable assumptions. Finally, the technique of blocking , shown in Figure 1.1 C, can remove a substantial proportion of the variation and thereby increase power and precision if we find a way to apply it.

1.4.3 External Validity

The external validity of an experiment concerns its replicability and the generalizability of inferences. An experiment is replicable if its results can be confirmed by an independent new experiment, preferably by a different lab and researcher. Experimental conditions in the replicate experiment usually differ from the original experiment, which provides evidence that the observed effects are robust to such changes. A much weaker condition on an experiment is reproducibility , the property that an independent researcher draws equivalent conclusions based on the data from this particular experiment, using the same analysis techniques. Reproducibility requires publishing the raw data, details on the experimental protocol, and a description of the statistical analyses, preferably with accompanying source code. Many scientific journals subscribe to reporting guidelines to ensure reproducibility and these are also helpful for planning an experiment.

A main threat to replicability and generalizability are too tightly controlled experimental conditions, when inferences only hold for a specific lab under the very specific conditions of the original experiment. Introducing systematic heterogeneity and using multi-center studies effectively broadens the experimental conditions and therefore the inferences for which internal validity is available.

For systematic heterogeneity , experimental conditions are systematically altered in addition to the treatments, and treatment differences estimated for each condition. For example, we might split the experimental material into several batches and use a different day of analysis, sample preparation, batch of buffer, measurement device, and lab technician for each batch. A more general inference is then possible if effect size, effect direction, and precision are comparable between the batches, indicating that the treatment differences are stable over the different conditions.

In multi-center experiments , the same experiment is conducted in several different labs and the results compared and merged. Multi-center approaches are very common in clinical trials and often necessary to reach the required number of patient enrollments.

Generalizability of randomized controlled trials in medicine and animal studies can suffer from overly restrictive eligibility criteria. In clinical trials, patients are often included or excluded based on co-medications and co-morbidities, and the resulting sample of eligible patients might no longer be representative of the patient population. For example, Travers et al. ( 2007 ) used the eligibility criteria of 17 random controlled trials of asthma treatments and found that out of 749 patients, only a median of 6% (45 patients) would be eligible for an asthma-related randomized controlled trial. This puts a question mark on the relevance of the trials’ findings for asthma patients in general.

1.5 Reducing the Risk of Bias

1.5.1 randomization of treatment allocation.

If systematic differences other than the treatment exist between our treatment groups, then the effect of the treatment is confounded with these other differences and our estimates of treatment effects might be biased.

We remove such unwanted systematic differences from our treatment comparisons by randomizing the allocation of treatments to experimental units. In a completely randomized design , each experimental unit has the same chance of being subjected to any of the treatments, and any differences between the experimental units other than the treatments are distributed over the treatment groups. Importantly, randomization is the only method that also protects our experiment against unknown sources of bias: we do not need to know all or even any of the potential differences and yet their impact is eliminated from the treatment comparisons by random treatment allocation.

Randomization has two effects: (i) differences unrelated to treatment become part of the ‘statistical noise’ rendering the treatment groups more similar; and (ii) the systematic differences are thereby eliminated as sources of bias from the treatment comparison.

Randomization transforms systematic variation into random variation.

In our example, a proper randomization would select 10 out of our 20 mice fully at random, such that the probability of any one mouse being picked is 1/20. These ten mice are then assigned to kit A, and the remaining mice to kit B. This allocation is entirely independent of the treatments and of any properties of the mice.

To ensure random treatment allocation, some kind of random process needs to be employed. This can be as simple as shuffling a pack of 10 red and 10 black cards or using a software-based random number generator. Randomization is slightly more difficult if the number of experimental units is not known at the start of the experiment, such as when patients are recruited for an ongoing clinical trial (sometimes called rolling recruitment ), and we want to have reasonable balance between the treatment groups at each stage of the trial.

Seemingly random assignments “by hand” are usually no less complicated than fully random assignments, but are always inferior. If surprising results ensue from the experiment, such assignments are subject to unanswerable criticism and suspicion of unwanted bias. Even worse are systematic allocations; they can only remove bias from known causes, and immediately raise red flags under the slightest scrutiny.

The Problem of Undesired Assignments

Even with a fully random treatment allocation procedure, we might end up with an undesirable allocation. For our example, the treatment group of kit A might—just by chance—contain mice that are all bigger or more active than those in the other treatment group. Statistical orthodoxy recommends using the design nevertheless, because only full randomization guarantees valid estimates of residual variance and unbiased estimates of effects. This argument, however, concerns the long-run properties of the procedure and seems of little help in this specific situation. Why should we care if the randomization yields correct estimates under replication of the experiment, if the particular experiment is jeopardized?

Another solution is to create a list of all possible allocations that we would accept and randomly choose one of these allocations for our experiment. The analysis should then reflect this restriction in the possible randomizations, which often renders this approach difficult to implement.

The most pragmatic method is to reject highly undesirable designs and compute a new randomization ( Cox 1958 ) . Undesirable allocations are unlikely to arise for large sample sizes, and we might accept a small bias in estimation for small sample sizes, when uncertainty in the estimated treatment effect is already high. In this approach, whenever we reject a particular outcome, we must also be willing to reject the outcome if we permute the treatment level labels. If we reject eight big and two small mice for kit A, then we must also reject two big and eight small mice. We must also be transparent and report a rejected allocation, so that critics may come to their own conclusions about potential biases and their remedies.

1.5.2 Blinding

Bias in treatment comparisons is also introduced if treatment allocation is random, but responses cannot be measured entirely objectively, or if knowledge of the assigned treatment affects the response. In clinical trials, for example, patients might react differently when they know to be on a placebo treatment, an effect known as cognitive bias . In animal experiments, caretakers might report more abnormal behavior for animals on a more severe treatment. Cognitive bias can be eliminated by concealing the treatment allocation from technicians or participants of a clinical trial, a technique called single-blinding .

If response measures are partially based on professional judgement (such as a clinical scale), patient or physician might unconsciously report lower scores for a placebo treatment, a phenomenon known as observer bias . Its removal requires double blinding , where treatment allocations are additionally concealed from the experimentalist.

Blinding requires randomized treatment allocation to begin with and substantial effort might be needed to implement it. Drug companies, for example, have to go to great lengths to ensure that a placebo looks, tastes, and feels similar enough to the actual drug. Additionally, blinding is often done by coding the treatment conditions and samples, and effect sizes and statistical significance are calculated before the code is revealed.

In clinical trials, double-blinding creates a conflict of interest. The attending physicians do not know which patient received which treatment, and thus accumulation of side-effects cannot be linked to any treatment. For this reason, clinical trials have a data monitoring committee not involved in the final analysis, that performs intermediate analyses of efficacy and safety at predefined intervals. If severe problems are detected, the committee might recommend altering or aborting the trial. The same might happen if one treatment already shows overwhelming evidence of superiority, such that it becomes unethical to withhold this treatment from the other patients.

1.5.3 Analysis Plan and Registration

An often overlooked source of bias has been termed the researcher degrees of freedom or garden of forking paths in the data analysis. For any set of data, there are many different options for its analysis: some results might be considered outliers and discarded, assumptions are made on error distributions and appropriate test statistics, different covariates might be included into a regression model. Often, multiple hypotheses are investigated and tested, and analyses are done separately on various (overlapping) subgroups. Hypotheses formed after looking at the data require additional care in their interpretation; almost never will \(p\) -values for these ad hoc or post hoc hypotheses be statistically justifiable. Many different measured response variables invite fishing expeditions , where patterns in the data are sought without an underlying hypothesis. Only reporting those sub-analyses that gave ‘interesting’ findings invariably leads to biased conclusions and is called cherry-picking or \(p\) -hacking (or much less flattering names).

The statistical analysis is always part of a larger scientific argument and we should consider the necessary computations in relation to building our scientific argument about the interpretation of the data. In addition to the statistical calculations, this interpretation requires substantial subject-matter knowledge and includes (many) non-statistical arguments. Two quotes highlight that experiment and analysis are a means to an end and not the end in itself.

There is a boundary in data interpretation beyond which formulas and quantitative decision procedures do not go, where judgment and style enter. ( Abelson 1995 )
Often, perfectly reasonable people come to perfectly reasonable decisions or conclusions based on nonstatistical evidence. Statistical analysis is a tool with which we support reasoning. It is not a goal in itself. ( Bailar III 1981 )

There is often a grey area between exploiting researcher degrees of freedom to arrive at a desired conclusion, and creative yet informed analyses of data. One way to navigate this area is to distinguish between exploratory studies and confirmatory studies . The former have no clearly stated scientific question, but are used to generate interesting hypotheses by identifying potential associations or effects that are then further investigated. Conclusions from these studies are very tentative and must be reported honestly as such. In contrast, standards are much higher for confirmatory studies, which investigate a specific predefined scientific question. Analysis plans and pre-registration of an experiment are accepted means for demonstrating lack of bias due to researcher degrees of freedom, and separating primary from secondary analyses allows emphasizing the main goals of the study.

Analysis Plan

The analysis plan is written before conducting the experiment and details the measurands and estimands, the hypotheses to be tested together with a power and sample size calculation, a discussion of relevant effect sizes, detection and handling of outliers and missing data, as well as steps for data normalization such as transformations and baseline corrections. If a regression model is required, its factors and covariates are outlined. Particularly in biology, handling measurements below the limit of quantification and saturation effects require careful consideration.

In the context of clinical trials, the problem of estimands has become a recent focus of attention. An estimand is the target of a statistical estimation procedure, for example the true average difference in enzyme levels between the two preparation kits. A main problem in many studies are post-randomization events that can change the estimand, even if the estimation procedure remains the same. For example, if kit B fails to produce usable samples for measurement in five out of ten cases because the enzyme level was too low, while kit A could handle these enzyme levels perfectly fine, then this might severely exaggerate the observed difference between the two kits. Similar problems arise in drug trials, when some patients stop taking one of the drugs due to side-effects or other complications.

Registration

Registration of experiments is an even more severe measure used in conjunction with an analysis plan and is becoming standard in clinical trials. Here, information about the trial, including the analysis plan, procedure to recruit patients, and stopping criteria, are registered in a public database. Publications based on the trial then refer to this registration, such that reviewers and readers can compare what the researchers intended to do and what they actually did. Similar portals for pre-clinical and translational research are also available.

1.6 Notes and Summary

The problem of measurements and measurands is further discussed for statistics in Hand ( 1996 ) and specifically for biological experiments in Coxon, Longstaff, and Burns ( 2019 ) . A general review of methods for handling missing data is Dong and Peng ( 2013 ) . The different roles of randomization are emphasized in Cox ( 2009 ) .

Two well-known reporting guidelines are the ARRIVE guidelines for animal research ( Kilkenny et al. 2010 ) and the CONSORT guidelines for clinical trials ( Moher et al. 2010 ) . Guidelines describing the minimal information required for reproducing experimental results have been developed for many types of experimental techniques, including microarrays (MIAME), RNA sequencing (MINSEQE), metabolomics (MSI) and proteomics (MIAPE) experiments; the FAIRSHARE initiative provides a more comprehensive collection ( Sansone et al. 2019 ) .

The problems of experimental design in animal experiments and particularly translation research are discussed in Couzin-Frankel ( 2013 ) . Multi-center studies are now considered for these investigations, and using a second laboratory already increases reproducibility substantially ( Richter et al. 2010 ; Richter 2017 ; Voelkl et al. 2018 ; Karp 2018 ) and allows standardizing the treatment effects ( Kafkafi et al. 2017 ) . First attempts are reported of using designs similar to clinical trials ( Llovera and Liesz 2016 ) . Exploratory-confirmatory research and external validity for animal studies is discussed in Kimmelman, Mogil, and Dirnagl ( 2014 ) and Pound and Ritskes-Hoitinga ( 2018 ) . Further information on pilot studies is found in Moore et al. ( 2011 ) , Sim ( 2019 ) , and Thabane et al. ( 2010 ) .

The deliberate use of statistical analyses and their interpretation for supporting a larger argument was called statistics as principled argument ( Abelson 1995 ) . Employing useless statistical analysis without reference to the actual scientific question is surrogate science ( Gigerenzer and Marewski 2014 ) and adaptive thinking is integral to meaningful statistical analysis ( Gigerenzer 2002 ) .

In an experiment, the investigator has full control over the experimental conditions applied to the experiment material. The experimental design gives the logical structure of an experiment: the units describing the organization of the experimental material, the treatments and their allocation to units, and the response. Statistical design of experiments includes techniques to ensure internal validity of an experiment, and methods to make inference from experimental data efficient.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HCA Healthc J Med
  • v.1(2); 2020
  • PMC10324782

Logo of hcahjm

Introduction to Research Statistical Analysis: An Overview of the Basics

Christian vandever.

1 HCA Healthcare Graduate Medical Education

Description

This article covers many statistical ideas essential to research statistical analysis. Sample size is explained through the concepts of statistical significance level and power. Variable types and definitions are included to clarify necessities for how the analysis will be interpreted. Categorical and quantitative variable types are defined, as well as response and predictor variables. Statistical tests described include t-tests, ANOVA and chi-square tests. Multiple regression is also explored for both logistic and linear regression. Finally, the most common statistics produced by these methods are explored.

Introduction

Statistical analysis is necessary for any research project seeking to make quantitative conclusions. The following is a primer for research-based statistical analysis. It is intended to be a high-level overview of appropriate statistical testing, while not diving too deep into any specific methodology. Some of the information is more applicable to retrospective projects, where analysis is performed on data that has already been collected, but most of it will be suitable to any type of research. This primer will help the reader understand research results in coordination with a statistician, not to perform the actual analysis. Analysis is commonly performed using statistical programming software such as R, SAS or SPSS. These allow for analysis to be replicated while minimizing the risk for an error. Resources are listed later for those working on analysis without a statistician.

After coming up with a hypothesis for a study, including any variables to be used, one of the first steps is to think about the patient population to apply the question. Results are only relevant to the population that the underlying data represents. Since it is impractical to include everyone with a certain condition, a subset of the population of interest should be taken. This subset should be large enough to have power, which means there is enough data to deliver significant results and accurately reflect the study’s population.

The first statistics of interest are related to significance level and power, alpha and beta. Alpha (α) is the significance level and probability of a type I error, the rejection of the null hypothesis when it is true. The null hypothesis is generally that there is no difference between the groups compared. A type I error is also known as a false positive. An example would be an analysis that finds one medication statistically better than another, when in reality there is no difference in efficacy between the two. Beta (β) is the probability of a type II error, the failure to reject the null hypothesis when it is actually false. A type II error is also known as a false negative. This occurs when the analysis finds there is no difference in two medications when in reality one works better than the other. Power is defined as 1-β and should be calculated prior to running any sort of statistical testing. Ideally, alpha should be as small as possible while power should be as large as possible. Power generally increases with a larger sample size, but so does cost and the effect of any bias in the study design. Additionally, as the sample size gets bigger, the chance for a statistically significant result goes up even though these results can be small differences that do not matter practically. Power calculators include the magnitude of the effect in order to combat the potential for exaggeration and only give significant results that have an actual impact. The calculators take inputs like the mean, effect size and desired power, and output the required minimum sample size for analysis. Effect size is calculated using statistical information on the variables of interest. If that information is not available, most tests have commonly used values for small, medium or large effect sizes.

When the desired patient population is decided, the next step is to define the variables previously chosen to be included. Variables come in different types that determine which statistical methods are appropriate and useful. One way variables can be split is into categorical and quantitative variables. ( Table 1 ) Categorical variables place patients into groups, such as gender, race and smoking status. Quantitative variables measure or count some quantity of interest. Common quantitative variables in research include age and weight. An important note is that there can often be a choice for whether to treat a variable as quantitative or categorical. For example, in a study looking at body mass index (BMI), BMI could be defined as a quantitative variable or as a categorical variable, with each patient’s BMI listed as a category (underweight, normal, overweight, and obese) rather than the discrete value. The decision whether a variable is quantitative or categorical will affect what conclusions can be made when interpreting results from statistical tests. Keep in mind that since quantitative variables are treated on a continuous scale it would be inappropriate to transform a variable like which medication was given into a quantitative variable with values 1, 2 and 3.

Categorical vs. Quantitative Variables

Categorical VariablesQuantitative Variables
Categorize patients into discrete groupsContinuous values that measure a variable
Patient categories are mutually exclusiveFor time based studies, there would be a new variable for each measurement at each time
Examples: race, smoking status, demographic groupExamples: age, weight, heart rate, white blood cell count

Both of these types of variables can also be split into response and predictor variables. ( Table 2 ) Predictor variables are explanatory, or independent, variables that help explain changes in a response variable. Conversely, response variables are outcome, or dependent, variables whose changes can be partially explained by the predictor variables.

Response vs. Predictor Variables

Response VariablesPredictor Variables
Outcome variablesExplanatory variables
Should be the result of the predictor variablesShould help explain changes in the response variables
One variable per statistical testCan be multiple variables that may have an impact on the response variable
Can be categorical or quantitativeCan be categorical or quantitative

Choosing the correct statistical test depends on the types of variables defined and the question being answered. The appropriate test is determined by the variables being compared. Some common statistical tests include t-tests, ANOVA and chi-square tests.

T-tests compare whether there are differences in a quantitative variable between two values of a categorical variable. For example, a t-test could be useful to compare the length of stay for knee replacement surgery patients between those that took apixaban and those that took rivaroxaban. A t-test could examine whether there is a statistically significant difference in the length of stay between the two groups. The t-test will output a p-value, a number between zero and one, which represents the probability that the two groups could be as different as they are in the data, if they were actually the same. A value closer to zero suggests that the difference, in this case for length of stay, is more statistically significant than a number closer to one. Prior to collecting the data, set a significance level, the previously defined alpha. Alpha is typically set at 0.05, but is commonly reduced in order to limit the chance of a type I error, or false positive. Going back to the example above, if alpha is set at 0.05 and the analysis gives a p-value of 0.039, then a statistically significant difference in length of stay is observed between apixaban and rivaroxaban patients. If the analysis gives a p-value of 0.91, then there was no statistical evidence of a difference in length of stay between the two medications. Other statistical summaries or methods examine how big of a difference that might be. These other summaries are known as post-hoc analysis since they are performed after the original test to provide additional context to the results.

Analysis of variance, or ANOVA, tests can observe mean differences in a quantitative variable between values of a categorical variable, typically with three or more values to distinguish from a t-test. ANOVA could add patients given dabigatran to the previous population and evaluate whether the length of stay was significantly different across the three medications. If the p-value is lower than the designated significance level then the hypothesis that length of stay was the same across the three medications is rejected. Summaries and post-hoc tests also could be performed to look at the differences between length of stay and which individual medications may have observed statistically significant differences in length of stay from the other medications. A chi-square test examines the association between two categorical variables. An example would be to consider whether the rate of having a post-operative bleed is the same across patients provided with apixaban, rivaroxaban and dabigatran. A chi-square test can compute a p-value determining whether the bleeding rates were significantly different or not. Post-hoc tests could then give the bleeding rate for each medication, as well as a breakdown as to which specific medications may have a significantly different bleeding rate from each other.

A slightly more advanced way of examining a question can come through multiple regression. Regression allows more predictor variables to be analyzed and can act as a control when looking at associations between variables. Common control variables are age, sex and any comorbidities likely to affect the outcome variable that are not closely related to the other explanatory variables. Control variables can be especially important in reducing the effect of bias in a retrospective population. Since retrospective data was not built with the research question in mind, it is important to eliminate threats to the validity of the analysis. Testing that controls for confounding variables, such as regression, is often more valuable with retrospective data because it can ease these concerns. The two main types of regression are linear and logistic. Linear regression is used to predict differences in a quantitative, continuous response variable, such as length of stay. Logistic regression predicts differences in a dichotomous, categorical response variable, such as 90-day readmission. So whether the outcome variable is categorical or quantitative, regression can be appropriate. An example for each of these types could be found in two similar cases. For both examples define the predictor variables as age, gender and anticoagulant usage. In the first, use the predictor variables in a linear regression to evaluate their individual effects on length of stay, a quantitative variable. For the second, use the same predictor variables in a logistic regression to evaluate their individual effects on whether the patient had a 90-day readmission, a dichotomous categorical variable. Analysis can compute a p-value for each included predictor variable to determine whether they are significantly associated. The statistical tests in this article generate an associated test statistic which determines the probability the results could be acquired given that there is no association between the compared variables. These results often come with coefficients which can give the degree of the association and the degree to which one variable changes with another. Most tests, including all listed in this article, also have confidence intervals, which give a range for the correlation with a specified level of confidence. Even if these tests do not give statistically significant results, the results are still important. Not reporting statistically insignificant findings creates a bias in research. Ideas can be repeated enough times that eventually statistically significant results are reached, even though there is no true significance. In some cases with very large sample sizes, p-values will almost always be significant. In this case the effect size is critical as even the smallest, meaningless differences can be found to be statistically significant.

These variables and tests are just some things to keep in mind before, during and after the analysis process in order to make sure that the statistical reports are supporting the questions being answered. The patient population, types of variables and statistical tests are all important things to consider in the process of statistical analysis. Any results are only as useful as the process used to obtain them. This primer can be used as a reference to help ensure appropriate statistical analysis.

Alpha (α)the significance level and probability of a type I error, the probability of a false positive
Analysis of variance/ANOVAtest observing mean differences in a quantitative variable between values of a categorical variable, typically with three or more values to distinguish from a t-test
Beta (β)the probability of a type II error, the probability of a false negative
Categorical variableplace patients into groups, such as gender, race or smoking status
Chi-square testexamines association between two categorical variables
Confidence intervala range for the correlation with a specified level of confidence, 95% for example
Control variablesvariables likely to affect the outcome variable that are not closely related to the other explanatory variables
Hypothesisthe idea being tested by statistical analysis
Linear regressionregression used to predict differences in a quantitative, continuous response variable, such as length of stay
Logistic regressionregression used to predict differences in a dichotomous, categorical response variable, such as 90-day readmission
Multiple regressionregression utilizing more than one predictor variable
Null hypothesisthe hypothesis that there are no significant differences for the variable(s) being tested
Patient populationthe population the data is collected to represent
Post-hoc analysisanalysis performed after the original test to provide additional context to the results
Power1-beta, the probability of avoiding a type II error, avoiding a false negative
Predictor variableexplanatory, or independent, variables that help explain changes in a response variable
p-valuea value between zero and one, which represents the probability that the null hypothesis is true, usually compared against a significance level to judge statistical significance
Quantitative variablevariable measuring or counting some quantity of interest
Response variableoutcome, or dependent, variables whose changes can be partially explained by the predictor variables
Retrospective studya study using previously existing data that was not originally collected for the purposes of the study
Sample sizethe number of patients or observations used for the study
Significance levelalpha, the probability of a type I error, usually compared to a p-value to determine statistical significance
Statistical analysisanalysis of data using statistical testing to examine a research hypothesis
Statistical testingtesting used to examine the validity of a hypothesis using statistical calculations
Statistical significancedetermine whether to reject the null hypothesis, whether the p-value is below the threshold of a predetermined significance level
T-testtest comparing whether there are differences in a quantitative variable between two values of a categorical variable

Funding Statement

This research was supported (in whole or in part) by HCA Healthcare and/or an HCA Healthcare affiliated entity.

Conflicts of Interest

The author declares he has no conflicts of interest.

Christian Vandever is an employee of HCA Healthcare Graduate Medical Education, an organization affiliated with the journal’s publisher.

This research was supported (in whole or in part) by HCA Healthcare and/or an HCA Healthcare affiliated entity. The views expressed in this publication represent those of the author(s) and do not necessarily represent the official views of HCA Healthcare or any of its affiliated entities.

Teach yourself statistics

Experimental Design for ANOVA

There is a close relationship between experimental design and statistical analysis. The way that an experiment is designed determines the types of analyses that can be appropriately conducted.

In this lesson, we review aspects of experimental design that a researcher must understand in order to properly interpret experimental data with analysis of variance.

What Is an Experiment?

An experiment is a procedure carried out to investigate cause-and-effect relationships. For example, the experimenter may manipulate one or more variables (independent variables) to assess the effect on another variable (the dependent variable).

Conclusions are reached on the basis of data. If the dependent variable is unaffected by changes in independent variables, we conclude that there is no causal relationship between the dependent variable and the independent variables. On the other hand, if the dependent variable is affected, we conclude that a causal relationship exists.

Experimenter Control

One of the features that distinguish a true experiment from other types of studies is experimenter control of the independent variable(s).

In a true experiment, an experimenter controls the level of the independent variable administered to each subject. For example, dosage level could be an independent variable in a true experiment; because an experimenter can manipulate the dosage administered to any subject.

What is a Quasi-Experiment?

A quasi-experiment is a study that lacks a critical feature of a true experiment. Quasi-experiments can provide insights into cause-and-effect relationships; but evidence from a quasi-experiment is not as persuasive as evidence from a true experiment. True experiments are the gold standard for causal analysis.

A study that used gender or IQ as an independent variable would be an example of a quasi-experiment, because the study lacks experimenter control over the independent variable; that is, an experimenter cannot manipulate the gender or IQ of a subject.

As we discuss experimental design in the context of a tutorial on analysis of variance, it is important to point out that experimenter control is a requirement for a true experiment; but it is not a requirement for analysis of variance. Analysis of variance can be used with true experiments and with quasi-experiments that lack only experimenter control over the independent variable.

Note: Henceforth in this tutorial, when we refer to an experiment, we will be referring to a true experiment or to a quasi-experiment that is almost a true experiment, in the sense that it lacks only experimenter control over the independent variable.

What Is Experimental Design?

The term experimental design refers to a plan for conducting an experiment in such a way that research results will be valid and easy to interpret. This plan includes three interrelated activities:

  • Write statistical hypotheses.
  • Collect data.
  • Analyze data.

Let's look in a little more detail at these three activities.

Statistical Hypotheses

A statistical hypothesis is an assumption about the value of a population parameter . There are two types of statistical hypotheses:

H 0: μ i = μ j

Here, μ i is the population mean for group i , and μ j is the population mean for group j . This hypothesis makes the assumption that population means in groups i and j are equal.

H 1: μ i ≠ μ j

This hypothesis makes the assumption that population means in groups i and j are not equal.

The null hypothesis and the alternative hypothesis are written to be mutually exclusive. If one is true, the other is not.

Experiments rely on sample data to test the null hypothesis. If experimental results, based on sample statistics , are consistent with the null hypothesis, the null hypothesis cannot be rejected; otherwise, the null hypothesis is rejected in favor of the alternative hypothesis.

Data Collection

The data collection phase of experimental design is all about methodology - how to run the experiment to produce valid, relevant statistics that can be used to test a null hypothesis.

Identify Variables

Every experiment exists to examine a cause-and-effect relationship. With respect to the relationship under investigation, an experimental design needs to account for three types of variables:

  • Dependent variable. The dependent variable is the outcome being measured, the effect in a cause-and-effect relationship.
  • Independent variables. An independent variable is a variable that is thought to be a possible cause in a cause-and-effect relationship.
  • Extraneous variables. An extraneous variable is any other variable that could affect the dependent variable, but is not explicitly included in the experiment.

Note: The independent variables that are explicitly included in an experiment are also called factors .

Define Treatment Groups

In an experiment, treatment groups are built around factors, each group defined by a unique combination of factor levels.

For example, suppose that a drug company wants to test a new cholesterol medication. The dependent variable is total cholesterol level. One independent variable is dosage. And, since some drugs affect men and women differently, the researchers include an second independent variable - gender.

This experiment has two factors - dosage and gender. The dosage factor has three levels (0 mg, 50 mg, and 100 mg), and the gender factor has two levels (male and female). Given this combination of factors and levels, we can define six unique treatment groups, as shown below:

Gender Dose
0 mg 50 mg 100 mg
Male Group 1 Group 2 Group 3
Female Group 4 Group 5 Group 6

Note: The experiment described above is an example of a quasi-experiment, because the gender factor cannot be manipulated by the experimenter.

Select Factor Levels

A factor in an experiment can be described by the way in which factor levels are chosen for inclusion in the experiment:

  • Fixed factor. The experiment includes all factor levels about which inferences are to be made.
  • Random factor. The experiment includes a random sample of levels from a much bigger population of factor levels.

Experiments can be described by the presence or absence of fixed or random factors:

  • Fixed-effects model. All of the factors in the experiment are fixed.
  • Random-effects model. All of the factors in the experiment are random.
  • Mixed model. At least one factor in the experiment is fixed, and at least one factor is random.

The use of fixed factors versus random factors has implications for how experimental results are interpreted. With a fixed factor, results apply only to factor levels that are explicitly included in the experiment. With a random factor, results apply to every factor level from the population.

For example, consider the blood pressure experiment described above. Suppose the experimenter only wanted to test the effect of three particular dosage levels - 0 mg, 50 mg, and 100 mg. He would include those dosage levels in the experiment, and any research conclusions would apply to only those particular dosage levels. This would be an example of a fixed-effects model.

On the other hand, suppose the experimenter wanted to test the effect of any dosage level. Since it is not practical to test every dosage level, the experimenter might choose three dosage levels at random from the population of possible dosage levels. Any research conclusions would apply not only to the selected dosage levels, but also to other dosage levels that were not included explicitly in the experiment. This would be an example of a random-effects model.

Select Experimental Units

The experimental unit is the entity that provides values for the dependent variable. Depending on the needs of the study, an experimental unit may be a person, animal, plant, product - anything. For example, in the cholesterol study described above, researchers measured cholesterol level (the dependent variable) of people; so the experimental units were people.

Note: When the experimental units are people, they are often referred to as subjects . Some researchers prefer the term participant , because subject has a connotation that the person is subservient.

If time and money were no object, you would include the entire population of experimental units in your experiment. In the real world, where there is never enough time or money, you will usually select a sample of experimental units from the population.

Ultimately, you want to use sample data to make inferences about population parameters. With that in mind, it is best practice to draw a random sample of experimental units from the population. This provides a defensible, statistical basis for generalizing from sample findings to the larger population.

Finally, it is important to consider sample size. The larger the sample, the greater the statistical power ; and the more confidence you can have in your results.

Assign Experimental Units to Treatments

Having selected a sample of experimental units, we need to assign each unit to one or more treatment groups. Here are two ways that you might assign experimental units to groups:

  • Independent groups design. Each experimental unit is randomly assigned to one, and only one, treatment group. This is also known as a between-subjects design .
  • Repeated measures design. Experimental units are assigned to more than one treatment group. This is also known as a within-subjects design .

Control for Extraneous Variables

Extraneous variables can mask effects of independent variables. Therefore, a good experimental design controls potential effects of extraneous variables. Here are a few strategies for controlling extraneous variables:

  • Randomization Assign subjects randomly to treatment groups. This tends to distribute effects of extraneous variables evenly across groups.
  • Repeated measures design. To control for individual differences between subjects (age, attitude, religion, etc.), assign each subject to multiple treatments. This strategy is called using subjects as their own control.
  • Counterbalancing. In repeated measures designs, randomize or reverse the order of treatments among subjects to control for order effects (e.g., fatigue, practice).

As we describe specific experimental designs in upcoming lessons, we will point out the strategies that are used with each design to control the confounding effects of extraneous variables.

Data Analysis

Researchers follow a formal process to determine whether to reject a null hypothesis, based on sample data. This process, called hypothesis testing, consists of five steps:

  • Formulate hypotheses. This involves stating the null and alternative hypotheses. Because the hypotheses are mutually exclusive, if one is true, the other must be false.
  • Choose the test statistic. This involves specifying the statistic that will be used to assess the validity of the null hypothesis. Typically, in analysis of variance studies, researchers compute a F ratio to test hypotheses.
  • Compute a P-value, based on sample data. Suppose the observed test statistic is equal to S . The P-value is the probability that the experiment would yield a test statistic as extreme as S , assuming the null hypothesis is true.
  • Choose a significance level. The significance level, denoted by α, is the probability of rejecting the null hypothesis when it is really true. Researchers often choose a significance level of 0.05 or 0.01.
  • Test the null hypothesis. If the P-value is smaller than the significance level, we reject the null hypothesis; if it is larger, we fail to reject.

A good experimental design includes a precise plan for data analysis. Before the first data point is collected, a researcher should know how experimental data will be processed to accept or reject the null hypotheses.

Test Your Understanding

In a well-designed experiment, which of the following statements is true?

I. The null hypothesis and the alternative hypothesis are mutually exclusive. II. The null hypothesis is subjected to statistical test. III. The alternative hypothesis is subjected to statistical test.

(A) I only (B) II only (C) III only (D) I and II (E) I and III

The correct answer is (D). The null hypothesis and the alternative hypothesis are mutually exclusive; if one is true, the other must be false. Only the null hypothesis is subjected to statistical test. When the null hypothesis is accepted, the alternative hypothesis is rejected. The alternative hypothesis is not tested explicitly.

In a true experiment, each subject is assigned to only one treatment group. What type of design is this?

(A) Independent groups design (B) Repeated measures design (C) Within-subjects design (D) None of the above (E) All of the above

The correct answer is (A). In an independent groups design, each experimental unit is assigned to one treatment group. In the other two designs, each experimental unit is assigned to more than one treatment group.

In a true experiment, which of the following does the experimenter control?

(A) How to manipulate independent variables. (B) How to assign subjects to treatment conditions. (C) How to control for extraneous variables. (D) None of the above (E) All of the above

The correct answer is (E). The experimenter chooses factors and factor levels for the experiment, assigns experimental units to treatment groups (often through a random process), and implements strategies (randomization, counterbalancing, etc.) to control the influence of extraneous variables.

Encyclopedia Britannica

  • History & Society
  • Science & Tech
  • Biographies
  • Animals & Nature
  • Geography & Travel
  • Arts & Culture
  • Games & Quizzes
  • On This Day
  • One Good Fact
  • New Articles
  • Lifestyles & Social Issues
  • Philosophy & Religion
  • Politics, Law & Government
  • World History
  • Health & Medicine
  • Browse Biographies
  • Birds, Reptiles & Other Vertebrates
  • Bugs, Mollusks & Other Invertebrates
  • Environment
  • Fossils & Geologic Time
  • Entertainment & Pop Culture
  • Sports & Recreation
  • Visual Arts
  • Demystified
  • Image Galleries
  • Infographics
  • Top Questions
  • Britannica Kids
  • Saving Earth
  • Space Next 50
  • Student Center
  • Introduction
  • Tabular methods
  • Graphical methods
  • Exploratory data analysis
  • Events and their probabilities
  • Random variables and probability distributions
  • The binomial distribution
  • The Poisson distribution
  • The normal distribution
  • Sampling and sampling distributions
  • Estimation of a population mean
  • Estimation of other parameters
  • Estimation procedures for two populations
  • Hypothesis testing
  • Bayesian methods

Analysis of variance and significance testing

Regression model, least squares method, analysis of variance and goodness of fit, significance testing.

  • Residual analysis
  • Model building
  • Correlation
  • Time series and forecasting
  • Nonparametric methods
  • Acceptance sampling
  • Statistical process control
  • Sample survey methods
  • Decision analysis

bar graph

Experimental design

Our editors will review what you’ve submitted and determine whether to revise the article.

  • Arizona State University - Educational Outreach and Student Services - Basic Statistics
  • Princeton University - Probability and Statistics
  • Statistics LibreTexts - Introduction to Statistics
  • University of North Carolina at Chapel Hill - The Writing Center - Statistics
  • Corporate Finance Institute - Statistics
  • statistics - Children's Encyclopedia (Ages 8-11)
  • statistics - Student Encyclopedia (Ages 11 and up)
  • Table Of Contents

Data for statistical studies are obtained by conducting either experiments or surveys. Experimental design is the branch of statistics that deals with the design and analysis of experiments. The methods of experimental design are widely used in the fields of agriculture, medicine , biology , marketing research, and industrial production.

Recent News

In an experimental study, variables of interest are identified. One or more of these variables, referred to as the factors of the study , are controlled so that data may be obtained about how the factors influence another variable referred to as the response variable , or simply the response. As a case in point, consider an experiment designed to determine the effect of three different exercise programs on the cholesterol level of patients with elevated cholesterol. Each patient is referred to as an experimental unit , the response variable is the cholesterol level of the patient at the completion of the program, and the exercise program is the factor whose effect on cholesterol level is being investigated. Each of the three exercise programs is referred to as a treatment .

Three of the more widely used experimental designs are the completely randomized design, the randomized block design, and the factorial design. In a completely randomized experimental design, the treatments are randomly assigned to the experimental units. For instance, applying this design method to the cholesterol-level study, the three types of exercise program (treatment) would be randomly assigned to the experimental units (patients).

The use of a completely randomized design will yield less precise results when factors not accounted for by the experimenter affect the response variable. Consider, for example, an experiment designed to study the effect of two different gasoline additives on the fuel efficiency , measured in miles per gallon (mpg), of full-size automobiles produced by three manufacturers. Suppose that 30 automobiles, 10 from each manufacturer, were available for the experiment. In a completely randomized design the two gasoline additives (treatments) would be randomly assigned to the 30 automobiles, with each additive being assigned to 15 different cars. Suppose that manufacturer 1 has developed an engine that gives its full-size cars a higher fuel efficiency than those produced by manufacturers 2 and 3. A completely randomized design could, by chance , assign gasoline additive 1 to a larger proportion of cars from manufacturer 1. In such a case, gasoline additive 1 might be judged to be more fuel efficient when in fact the difference observed is actually due to the better engine design of automobiles produced by manufacturer 1. To prevent this from occurring, a statistician could design an experiment in which both gasoline additives are tested using five cars produced by each manufacturer; in this way, any effects due to the manufacturer would not affect the test for significant differences due to gasoline additive. In this revised experiment, each of the manufacturers is referred to as a block, and the experiment is called a randomized block design. In general, blocking is used in order to enable comparisons among the treatments to be made within blocks of homogeneous experimental units.

Factorial experiments are designed to draw conclusions about more than one factor, or variable. The term factorial is used to indicate that all possible combinations of the factors are considered. For instance, if there are two factors with a levels for factor 1 and b levels for factor 2, the experiment will involve collecting data on a b treatment combinations. The factorial design can be extended to experiments involving more than two factors and experiments involving partial factorial designs.

A computational procedure frequently used to analyze the data from an experimental study employs a statistical procedure known as the analysis of variance. For a single-factor experiment, this procedure uses a hypothesis test concerning equality of treatment means to determine if the factor has a statistically significant effect on the response variable. For experimental designs involving multiple factors, a test for the significance of each individual factor as well as interaction effects caused by one or more factors acting jointly can be made. Further discussion of the analysis of variance procedure is contained in the subsequent section.

Regression and correlation analysis

Regression analysis involves identifying the relationship between a dependent variable and one or more independent variables . A model of the relationship is hypothesized, and estimates of the parameter values are used to develop an estimated regression equation . Various tests are then employed to determine if the model is satisfactory. If the model is deemed satisfactory, the estimated regression equation can be used to predict the value of the dependent variable given values for the independent variables.

In simple linear regression , the model used to describe the relationship between a single dependent variable y and a single independent variable x is y = β 0 + β 1 x + ε. β 0 and β 1 are referred to as the model parameters, and ε is a probabilistic error term that accounts for the variability in y that cannot be explained by the linear relationship with x . If the error term were not present, the model would be deterministic; in that case, knowledge of the value of x would be sufficient to determine the value of y .

In multiple regression analysis , the model for simple linear regression is extended to account for the relationship between the dependent variable y and p independent variables x 1 , x 2 , . . ., x p . The general form of the multiple regression model is y = β 0 + β 1 x 1 + β 2 x 2 + . . . + β p x p + ε. The parameters of the model are the β 0 , β 1 , . . ., β p , and ε is the error term.

Either a simple or multiple regression model is initially posed as a hypothesis concerning the relationship among the dependent and independent variables. The least squares method is the most widely used procedure for developing estimates of the model parameters. For simple linear regression, the least squares estimates of the model parameters β 0 and β 1 are denoted b 0 and b 1 . Using these estimates, an estimated regression equation is constructed: ŷ = b 0 + b 1 x . The graph of the estimated regression equation for simple linear regression is a straight line approximation to the relationship between y and x .

scatter diagram with estimated regression equation

As an illustration of regression analysis and the least squares method, suppose a university medical centre is investigating the relationship between stress and blood pressure . Assume that both a stress test score and a blood pressure reading have been recorded for a sample of 20 patients. The data are shown graphically in Figure 4 , called a scatter diagram . Values of the independent variable, stress test score, are given on the horizontal axis, and values of the dependent variable, blood pressure, are shown on the vertical axis. The line passing through the data points is the graph of the estimated regression equation: ŷ = 42.3 + 0.49 x . The parameter estimates, b 0 = 42.3 and b 1 = 0.49, were obtained using the least squares method.

A primary use of the estimated regression equation is to predict the value of the dependent variable when values for the independent variables are given. For instance, given a patient with a stress test score of 60, the predicted blood pressure is 42.3 + 0.49(60) = 71.7. The values predicted by the estimated regression equation are the points on the line in Figure 4 , and the actual blood pressure readings are represented by the points scattered about the line. The difference between the observed value of y and the value of y predicted by the estimated regression equation is called a residual . The least squares method chooses the parameter estimates such that the sum of the squared residuals is minimized.

A commonly used measure of the goodness of fit provided by the estimated regression equation is the coefficient of determination . Computation of this coefficient is based on the analysis of variance procedure that partitions the total variation in the dependent variable, denoted SST, into two parts: the part explained by the estimated regression equation, denoted SSR, and the part that remains unexplained, denoted SSE.

The measure of total variation, SST, is the sum of the squared deviations of the dependent variable about its mean: Σ( y − ȳ ) 2 . This quantity is known as the total sum of squares. The measure of unexplained variation, SSE, is referred to as the residual sum of squares. For the data in Figure 4 , SSE is the sum of the squared distances from each point in the scatter diagram (see Figure 4 ) to the estimated regression line: Σ( y − ŷ ) 2 . SSE is also commonly referred to as the error sum of squares. A key result in the analysis of variance is that SSR + SSE = SST.

The ratio r 2 = SSR/SST is called the coefficient of determination. If the data points are clustered closely about the estimated regression line, the value of SSE will be small and SSR/SST will be close to 1. Using r 2 , whose values lie between 0 and 1, provides a measure of goodness of fit; values closer to 1 imply a better fit. A value of r 2 = 0 implies that there is no linear relationship between the dependent and independent variables.

When expressed as a percentage , the coefficient of determination can be interpreted as the percentage of the total sum of squares that can be explained using the estimated regression equation. For the stress-level research study, the value of r 2 is 0.583; thus, 58.3% of the total sum of squares can be explained by the estimated regression equation ŷ = 42.3 + 0.49 x . For typical data found in the social sciences, values of r 2 as low as 0.25 are often considered useful. For data in the physical sciences, r 2 values of 0.60 or greater are frequently found.

In a regression study, hypothesis tests are usually conducted to assess the statistical significance of the overall relationship represented by the regression model and to test for the statistical significance of the individual parameters. The statistical tests used are based on the following assumptions concerning the error term: (1) ε is a random variable with an expected value of 0, (2) the variance of ε is the same for all values of x , (3) the values of ε are independent, and (4) ε is a normally distributed random variable.

The mean square due to regression, denoted MSR, is computed by dividing SSR by a number referred to as its degrees of freedom ; in a similar manner, the mean square due to error, MSE , is computed by dividing SSE by its degrees of freedom. An F-test based on the ratio MSR/MSE can be used to test the statistical significance of the overall relationship between the dependent variable and the set of independent variables. In general, large values of F = MSR/MSE support the conclusion that the overall relationship is statistically significant. If the overall model is deemed statistically significant, statisticians will usually conduct hypothesis tests on the individual parameters to determine if each independent variable makes a significant contribution to the model.

Our systems are now restored following recent technical disruption, and we’re working hard to catch up on publishing. We apologise for the inconvenience caused. Find out more: https://www.cambridge.org/universitypress/about-us/news-and-blogs/cambridge-university-press-publishing-update-following-technical-disruption

We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings .

Login Alert

  • < Back to search results
  • The Design and Statistical Analysis of Animal Experiments

The Design and Statistical Analysis of Animal Experiments

experimental statistical analysis

  • Get access Buy a print copy Check if you have access via personal or institutional login Log in Register
  • Cited by 72

Crossref logo

This Book has been cited by the following publications. This list is generated based on data provided by Crossref .

  • Google Scholar
  • Simon T. Bate , GlaxoSmithKline , Robin A. Clark , Huntingdon Life Sciences
  • Export citation
  • Buy a print copy

Book description

Written for animal researchers, this book provides a comprehensive guide to the design and statistical analysis of animal experiments. It has long been recognised that the proper implementation of these techniques helps reduce the number of animals needed. By using real-life examples to make them more accessible, this book explains the statistical tools employed by practitioners. A wide range of design types are considered, including block, factorial, nested, cross-over, dose-escalation and repeated measures and techniques are introduced to analyse the experimental data generated. Each analysis technique is described in non-mathematical terms, helping readers without a statistical background to understand key techniques such as t-tests, ANOVA, repeated measures, analysis of covariance, multiple comparison tests, non-parametric and survival analysis. This is also the first text to describe technical aspects of InVivoStat, a powerful open-source software package developed by the authors to enable animal researchers to analyse their data and obtain informative results.

'At last, a readable statistics book focusing solely on preclinical experimental designs, data and its analysis that should form part of an in-vivo scientist’s personal library. The author’s unique insight into the statistical needs of preclinical scientists has allowed them to compile a non-technical guide that can facilitate sound experimental design, meaningful data analysis and appropriate scientific conclusions. I would also encourage all readers to download and explore 'InVivoStat', a powerful software package that both my group and I use on a daily basis.'

Darrel J. Pemberton - Janssen Research and Development

'This book provides an indispensable reference for any in-vivo scientist. It addresses common pitfalls in animal experiments and provides tangible advice to address sources of bias, thus increasing the robustness of the data. … The text links experimental design and statistical analysis in a practical way, easily accessible without any prior statistical knowledge. The statistical concepts are described in plain English, avoiding overuse of mathematical formulas and illustrated with numerous examples relevant to biomedical scientists. … This book will help scientists improve the design of animal experiments and give them the confidence to use more complex designs, enabling more efficient use of animals and reducing the number of experimental animals needed overall.'

Nathalie Percie du Sert - National Centre for the Replacement, Refinement and Reduction of Animals in Research

'This book will transform the way biomedical scientists plan their work and interpret their results. Although the subject matter covers complex points, it is easy to read and packed with relevant examples. There are two particularly striking features. First, at no point do the authors resort to mathematical equations as a substitute for explaining the concepts. Secondly, they explain why the choice of experimental design is so important, why the design affects the statistical analysis and how to ensure the choice of the most appropriate statistical test. The final section describes how to use InvivoStat (a software package, assembled by the authors), which enables researchers to put into practice all the points covered in this book. This is an invaluable combination of resources that should be within easy reach of anyone carrying out experiments in the biomedical sciences, especially if their work involves using live animals.'

Clare Stanford - University College London

  • Aa Reduce text
  • Aa Enlarge text

Refine List

Actions for selected content:.

  • View selected items
  • Save to my bookmarks
  • Export citations
  • Download PDF (zip)
  • Save to Kindle
  • Save to Dropbox
  • Save to Google Drive

Save content to

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to .

To save content items to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle .

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service .

Save Search

You can save your searches here and later view and run them again in "My saved searches".

Frontmatter pp i-iv

  • Get access Check if you have access via personal or institutional login Log in Register

Contents pp vi-xii

Preface pp xiii-xiv, acknowledgments pp xv-xvi, 1 - introduction pp 1-17, 2 - statistical concepts pp 18-29, 3 - experimental design pp 30-121, 4 - randomisation pp 122-131, 5 - statistical analysis pp 132-237, 6 - analysis using invivostat pp 238-292, 7 - conclusion pp 293-294, glossary pp 295-296, references pp 297-302, index pp 303-310, altmetric attention score, full text views.

Full text views reflects the number of PDF downloads, PDFs sent to Google Drive, Dropbox and Kindle and HTML full text views for chapters in this book.

Book summary page views

Book summary views reflect the number of visits to the book and chapter landing pages.

* Views captured on Cambridge Core between #date#. This data will be updated every 24 hours.

Usage data cannot currently be displayed.

CRAN Task View: Design of Experiments (DoE) & Analysis of Experimental Data

Ulrike Groemping, Tyler Morgan-Wall
ulrike.groemping at bht-berlin.de
2023-04-05
Suggestions and improvements for this task view are very welcome and can be made through issues or pull requests on GitHub or via e-mail to the maintainer address. For further details see the .
Ulrike Groemping, Tyler Morgan-Wall (2023). CRAN Task View: Design of Experiments (DoE) & Analysis of Experimental Data. Version 2023-04-05. URL https://CRAN.R-project.org/view=ExperimentalDesign.
The packages from this task view can be installed automatically using the package. For example, installs all the core packages or installs all packages that are not yet installed and up-to-date. See the for more details.

This task view collects information on R packages for experimental design and analysis of data from experiments. Packages that focus on analysis only and do not make relevant contributions for design creation are not considered in the scope of this task view. Please feel free to suggest enhancements, and please send information on new packages or major package updates if you think they belong here, either via e-mail to the maintainers or by submitting an issue or pull request in the GitHub repository linked above.

Experimental design is applied in many areas, and methods have been tailored to the needs of various fields. This task view starts out with a section on the historically earliest application area, agricultural experimentation. Subsequently, it covers the most general packages, continues with specific sections on industrial experimentation, computer experiments, and experimentation in the clinical trials contexts (this section is going to be removed eventually; experimental design packages for clinical trials will be integrated into the clinical trials task view), and closes with a section on various special experimental design packages that have been developed for other specific purposes. Of course, the division into fields is not always clear-cut, and some packages from the more specialized sections can also be applied in general contexts.

You may also notice that the maintainers’ experience is mainly from industrial experimentation (in a broad sense), which may explain a somewhat biased view on things. Volunteers for co-maintaining are welcome.

Experimental designs for agricultural and plant breeding experiments

Package agricolae is by far the most-used package from this task view (status: October 2017). It offers extensive functionality on experimental design especially for agricultural and plant breeding experiments, which can also be useful for other purposes. It supports planning of lattice designs, factorial designs, randomized complete block designs, completely randomized designs, (Graeco-)Latin square designs, balanced incomplete block designs and alpha designs. There are also various analysis facilities for experimental data, e.g. treatment comparison procedures and several non-parametric tests, but also some quite specialized possibilities for specific types of experiments. Package desplot is made for plotting the layout of agricultural experiments. Package agridat offers a large repository of useful agricultural data sets.

Experimental designs for general purposes

There are a few packages for creating and analyzing experimental designs for general purposes: First of all, the standard (generalized) linear model functions in the base package stats are of course very important for analyzing data from designed experiments (especially functions lm() , aov() and the methods and functions for the resulting linear model objects). These are concisely explained in Kuhnert and Venables (2005, p. 109 ff.); Vikneswaran (2005) points out specific usages for experimental design (using function contrasts() , multiple comparison functions and some convenience functions like model.tables() , replications() and plot.design() ). Lawson (2014) is a good introductory textbook on experimental design in R, which gives many example applications. Lalanne (2012) provides an R companion to the well-known book by Montgomery (2005); he so far covers approximately the first ten chapters; he does not include R’s design generation facilities, but mainly discusses the analysis of existing designs. Package GAD handles general balanced analysis of variance models with fixed and/or random effects and also nested effects (the latter can only be random); they quote Underwood (1997) for this work. The package is quite valuable, as many users have difficulties with using the R packages for handling random or mixed effects. Package ez aims at supporting intuitive analysis and visualization of factorial experiments based on package “ggplot2”.

  • Package AlgDesign creates full factorial designs with or without additional quantitative variables, creates mixture designs (i.e., designs where the levels of factors sum to 1=100%; lattice designs are created only) and creates D-, A-, or I-optimal designs exactly or approximately, possibly with blocking, using the Federov (1972) algorithm.
  • Package skpr (Morgan-Wall and Khoury, 2021) also provides optimal designs (D, I, A, Alias, G, T, or E optimal); a selection of the optimality criteria can also be used for the stepwise creation of split-plot designs. The package can also assess the power of designs and display diagnostic plots.
  • Package OptimalDesign likewise calculates unblocked D-, A-, or I-optimal designs (they use “IV-optimal” instead of “I-optimal”) exactly or approximately, treating quantitative variables only, including mixture designs; this package uses different algorithms (e.g. Atkinson, Donev and Tobias 2007, Harman and Filova 2014), some of which rely on the availability of the gurobi software ( http://www.gurobi.com/ , free for academics and academic institutions) and its accompanying R package “gurobi” (not on CRAN).
  • Package ICAOD implements the “Imperialist Competitive Algorithm for Optimal Designs” for nonlinear models according to Masoudi, Holling and Wong (2016). Package PopED provides optimal designs for nonlinear mixed effect models.
  • There are various further packages that deal with optimal designs of different types: Package rodd provides T-optimal designs, also called optimal discriminating designs (Dette, Melas and Shpilev 2013, Dette, Melas and Guchenko 2014), Package acebayes calculates optimal Bayesian designs using an approximate coordinate exchange algorithm, package OBsMD provides “Objective Bayesian Model Discrimination in Follow-Up Designs” according to Consonni and Deldossi (2015). Further optimal design packages for very specific purposes are listed at the end of this view.
  • Package conf.design allows to create a design with certain interaction effects confounded with blocks (function conf.design() ) and allows to combine existing designs in several ways (e.g., useful for Taguchi’s inner and outer array designs in industrial experimentation).
  • The archived package “planor” allows to generate regular fractional factorial designs with fixed and mixed levels and quite flexible randomization structures. The packages flexibility comes at the price of a certain complexity and - for larger designs - high computing time. It is listed here in spite of being archived on CRAN, because it still works and can create some designs that cannot created by any other packages.
  • Package ibd creates and analyses incomplete block designs. Packages PGM2 , RPPairwiseDesign and CombinS all produce designs related to (resolvable) (partially) balanced incomplete block designs. Package PBIBD also provides experts with some series of partially balanced incomplete block designs.
  • Package crossdes creates and analyses cross-over designs of various types (including latin squares, mutually orthogonal latin squares and Youden squares) that can for example be used in sensometrics. Package Crossover also provides crossover designs; it offers designs from the literature and algorithmic designs, makes use of the functionality in crossdes and in addition provides a GUI.
  • Package DoE.base provides full factorial designs with or without blocking (function fac.design ) and orthogonal arrays (function oa.design ) for main effects experiments (those listed by Kuhfeld 2009 up to 144 runs, plus a few additional ones). There is also some functionality for assessing the quality of orthogonal arrays, related to Groemping and Xu (2014) and Groemping (2017), and some analysis functionality with half-normal effects plots in quite general form (Groemping 2015). Package DoE.base also forms the basis of a suite of related packages: together with FrF2 (cf. below) and DoE.wrapper , it provides the work horse of the GUI package RcmdrPlugin.DoE (beta version; tutorial available in Groemping 2011), which integrates design of experiments functionality into the R-Commander (package “Rcmdr”, Fox 2005) for the benefit of those R users who cannot or do not want to do command line programming. The role of package DoE.wrapper in that suite is to wrap functionality from other packages into the input and output structure of the package suite (so far for response surface designs with package rsm (cf. also below), design of computer experiments with packages lhs and DiceDesign (cf. also below), and , and D-optimal designs with package AlgDesign (cf. also above).
  • Package DoE.MIParray creates optimized orthogonal arrays (or even supersaturated arrays) for factorial experiments. Arrays created with this package can be used as input to function oa.design of package DoE.base . Note, however, that the package is only useful in combination with at least one of the commercial optimizers Gurobi (R-package gurobi delivered with the software) or Mosek (R-package Rmosek downloadable from the vendor (an outdated version is on CRAN)).
  • Package dae provides various utility functions around experimental design and manipulating R factors, e.g. a routine for randomizing (according to Bailey 1981) most crossed and nested structures, a function that can produce, for any design, a skeleton-ANOVA table that displays the confounding and aliasing inherent in the design, and functions for plotting designs using R package “ggplot2”. Furthermore, the package provides post-processing of objects returned by the aov() function.
  • Package daewr accompanies the book Design and Analysis of Experiments with R by Lawson (2014) and does not only provide data sets from the book but also some standalone functionality that is not available elsewhere in R, e.g. definitive screening designs.
  • Package OPDOE accompanies the book Optimal Experimental Design with R by Rasch et al. (2011). It has some interesting sample size estimation functionality, but is almost unusable without the book (the first edition of which I would not recommend buying).
  • Package blockTools assigns units to blocks in order to end up with homogeneous sets of blocks in case of too small block sizes and offers further functionality for randomization and reporting; package blocksdesign permits the creation of nested block structures.
  • There are several packages for determining sample sizes in experimental contexts, some of them quite general, others very specialized. All of these are mentioned here: packages powerbydesign and easypower deal with estimating the power, sample size and/or effect size for factorial experiments. Package JMdesign deals with the power for the special situation of jointly modeling longitudinal and survival data, package PwrGSD with the power for group sequential designs, package powerGWASinteraction with the power for interactions in genome wide association studies, package ssizeRNA with sample size for RNA sequencing experiments, and package ssize.fdr for sample sizes in microarray experiments (requesting a certain power while limiting the false discovery rate).

Experimental designs for industrial experiments

Some further packages especially handle designs for industrial experiments that are often highly fractionated, intentionally confounded and have few extra degrees of freedom for error.

Fractional factorial 2-level designs are particularly important in industrial experimentation.

  • Package FrF2 (Groemping 2014) is the most comprehensive R package for their creation. It generates regular Fractional Factorial designs for factors with 2 levels as well as Plackett-Burman type screening designs. Regular fractional factorials default to maximum resolution minimum aberration designs and can be customized in various ways, supported by an incorporated catalogue of designs (including the designs catalogued by Chen, Sun and Wu 1993, and further larger designs catalogued in Block and Mee 2005 and Xu 2009; the additional package FrF2.catlg128 provides a very large complete catalogue for resolution IV 128 run designs with up to 23 factors for special purposes). Analysis-wise, FrF2 provides simple graphical analysis tools (normal and half-normal effects plots (modified from BsMD , cf. below), main effects plots and interaction plot matrices similar to those in Minitab software, and a cube plot for the combinations of three factors). It can also show the alias structure for regular fractional factorials of 2-level factors, regardless whether they have been created with the package or not. Fractional factorial 2-level plans can also be created by other R packages, namely BHH2 , or with a little bit more complication by packages conf.design or AlgDesign . Package ALTopt provides optimal designs for accelerated life testing.
  • Package BHH2 accompanies the 2nd edition of the book by Box, Hunter and Hunter and provides various of its data sets. It can generate full and fractional factorial two-level-designs from a number of factors and a list of defining relations (function ffDesMatrix() , less comfortable than package FrF2). It also provides several functions for analyzing data from 2-level factorial experiments: The function anovaPlot assesses effect sizes relative to residuals, and the function lambdaPlot() assesses the effect of Box-Cox transformations on statistical significance of effects.
  • BsMD provides Bayesian charts as proposed by Box and Meyer (1986) as well as effects plots (normal, half-normal and Lenth) for assessing which effects are active in a fractional factorial experiment with 2-level factors.
  • Package unrepx provides a battery of methods for the assessment of effect estimates from unreplicated factorial experiments, including many of the effects plots also present in other packages, but also further possibilities.
  • The small package FMC provides factorial designs with minimal number of level changes; the package does not take any measures to account for the statistical implications this may imply. Thus, using this package must be considered very risky for many experimental situations, because in many experiments some variability is caused by level changes. For such situations (and they are the rule rather than the exception), minimizing the level changes without taking precautions in the analysis will yield misleading results.
  • Package pid accompanies an online book by Dunn (2010-2016) and also makes heavy use of the Box, Hunter and Hunter book; it provides various data sets, which are mostly from fractional factorial 2-level designs.

Apart from tools for planning and analysing factorial designs, R also offers support for response surface optimization for quantitative factors (cf. e.g. Myers and Montgomery 1995):

  • Package rsm supports sequential optimization with first order and second order response surface models (central composite or Box-Behnken designs), offering optimization approaches like steepest ascent and visualization of the response function for linear model objects. Also, coding for response surface investigations is facilitated.
  • Package DoE.wrapper enhances design creation from package rsm with the possibilities of automatically choosing the cube portion of central composite designs and of augmenting an existing (fractional) factorial 2-level design with a star portion.
  • The small package rsurface provides rotatable central composite designs for which the user specifies the minimum and maximum of the experimental variables instead of the corner points of the cube.
  • The small package minimalRSD provides central composite and Box-Behnken designs with minimal number of level changes; the package does not take any measures to account for the statistical implications this may imply. Thus, using this package must be considered very risky for many experimental situations, because in many experiments some variability is caused by level changes. For such situations (and they are the rule rather than the exception), minimizing the level changes without taking precautions in the analysis will yield misleading results.
  • Package OptimaRegion provides functionality for inspecting the optimal region of a response surface for quadratic polynomials and thin-plate spline models and can compute a confidence interval for the distance between two optima.
  • Package vdg creates variance dispersion graphs (Vining 1993) using Monte Carlo sampling.
  • Package EngrExpt provides a collection of data sets from the book Introductory Statistics for Engineering Experimentation by Nelson, Coffin and Copeland (2003).

In some industries, mixtures of ingredients are important; these require special designs, because the quantitative factors have a fixed total. Mixture designs are handled by packages AlgDesign (function gen.mixture , lattice designs), lattice designs and simplex centroid designs), and mixexp (several small functions for simplex centroid, simplex lattice and extreme vertices designs as well as for plotting).

Occasionally, supersaturated designs can be useful. The two small packages mkssd and mxkssd provide fixed level and mixed level k-circulant supersaturated designs. The aforementioned package DoE.MIParray can also provide (small!) supersaturated arrays (by choosing resolution II), but requires the presence of at least one of the commercial optimizers Gurobi or Mosek .

Experimental designs for computer experiments

Computer experiments with quantitative factors require special types of experimental designs: it is often possible to include many different levels of the factors, and replication will usually not be beneficial. Also, the experimental region is often too large to assume that a linear or quadratic model adequately represents the phenomenon under investigation. Consequently, it is desirable to fill the experimental space with points as well as possible (space-filling designs) in such a way that each run provides additional information even if some factors turn out to be irrelevant. The lhs package provides latin hypercube designs for this purpose. Furthermore, the package provides ways to analyse such computer experiments with emphasis on what follow-up experiments to conduct. Another package with similar orientation is the DiceDesign package, which adds further ways to construct space-filling designs and some measures to assess the quality of designs for computer experiments. The package DiceKriging provides the kriging methodology which is often used for creating meta models from computer experiments, the package DiceEval creates and evaluates meta models (among others Kriging ones), and the package DiceView provides facilities for viewing sections of multidimensional meta models.

Package MaxPro provides maximum projection designs as introduced by Joseph, Gul and Ba(2015). Package SLHD provides optimal sliced latin hypercube designs according to Ba et al. (2015), package sFFLHD provides sliced full factorial-based latin hypercube designs according to Duan et al. (2017). Package simrel allows creation of designs for computer experiments according to the Multi-level binary replacement (MBR) strategy by Martens et al. (2010). Package minimaxdesign (archived) provides minimax designs and minimax projection designs according to Mak and Joseph (2016). Package SOAs provides stratum (aka strong) orthogonal arrays by various authors, as described in Grömping (2021) and references therein.

Package tgp is another package dedicated to planning and analysing computer experiments. Here, emphasis is on Bayesian methods. The package can for example be used with various kinds of (surrogate) models for sequential optimization, e.g. with an expected improvement criterion for optimizing a noisy blackbox target function. Packages plgp and dynaTree enhance the functionality offered by tgp with particle learning facilities and learning for dynamic regression trees.

Package BatchExperiments is also designed for computer experiments, in this case specifically for experiments with algorithms to be run under different scenarios. The package is described in a technical report by Bischl et al. (2012).

Experimental designs for clinical trials

This task view only covers specific design of experiments packages (which will eventually also be removed here); there may be some grey areas. Please, also consult the ClinicalTrials task view.

  • Package experiment contains tools for clinical experiments, e.g., a randomization tool, and it provides a few special analysis options for clinical trials.
  • Package ThreeArmedTrials (archived) provides design and analysis tools for three-armed superiority or non-inferiority trials. Beside the standard functionality, the package includes the negative Binomial response situation discussed in Muetze et al. (2016).
  • Package gsDesign implements group sequential designs, package GroupSeq gives a GUI for probability spending in such designs, package OptGS near-optimal balanced group sequential designs. Package seqDesign handles group sequential two-stage treatment efficacy trials with time-to-event endpoints.
  • Package binseqtest handles sequential single arm binary response trials.
  • Package asd implements adaptive seamless designs (see e.g. Parsons et al. 2012).
  • Packages bcrm and crmPack offer Bayesian CRM designs.
  • Package MAMS offers designs for multi-arm multi stage studies.
  • Package BOIN provides Bayesian optimal interval designs, which are used in phase I clinical trials for finding the maximum tolerated dose.
  • The DoseFinding package provides functions for the design and analysis of dose-finding experiments (for example pharmaceutical Phase II clinical trials); it combines the facilities of the “MCPMod” package (maintenance discontinued; described in Bornkamp, Pinheiro and Bretz 2009) with a special type of optimal designs for dose finding situations (MED-optimal designs, or D-optimal designs, or a mixture of both; cf., Dette et al. 2008).
  • Package TEQR provides toxicity equivalence range designs (Blanchard and Longmate 2010) for phase I clinical trials, package pipe.design so-called product of independent beta probabilities dose escalation (PIPE) designs for phase I. Package dfcrm provides designs for classical or TITE continual reassessment trials in phase I.
  • Packages dfcomb and dfmta provide phase I/II adaptive dose-finding designs for combination studies or single-agent molecularly targeted agent, respectively.
  • Packages ph2bayes and ph2bye are concerned with Bayesian single arm phase II trials.
  • Package sp23design claims to offer seamless integration of phase II to III.

Experimental designs for special purposes

Various further packages handle special situations in experimental design:

  • Package desirability provides ways to combine several target criteria into a desirability function in order to simplify multi-criteria analysis.
  • osDesign designs studies nested in observational studies, designmatch can also be useful for this purpose.
  • packages optbdmaeAT , optrcdmaeAT and soptdmaeA provide optimal block designs, optimal row-column designs, and sequential optimal or near-optimal block or row-column designs for two-colour cDNA microarray experiments, with optimality according to an A-, MV-, D- or E-criterion.
  • Package docopulae implements optimal designs for copula models according to Perrone and Mueller (2016),
  • Package MBHdesign provides spatially balanced designs, allowing the inclusion of prespecified (legacy) sites. The more elaborate package geospt allows to optimize spatial networks of sampling points (see e.g. Santacruz, Rubiano and Melo 2014).
  • Package SensoMineR contains special designs for sensometric studies, e.g., for the triangle test.
  • Package choiceDes creates choice designs with emphasis on discrete choice models and MaxDiff functionality; it is based on optimal designs. Package idefix provides D-efficient designs for discrete choice experiments based on the multinomial logit model, and individually adapted designs for the mixed multinomial logit model (Crabbe et al. 2014). Package support.CEs provides tools for creating stated choice designs for market research investigations, based on orthogonal arrays.
  • Package odr creates optimal designs for cluster randomized trials under condition- and unit-specific cost structures.

Key references for packages in this task view

  • Atkinson, A.C. and Donev, A.N. (1992). Optimum Experimental Designs. Oxford: Clarendon Press.
  • Atkinson, A.C., Donev, A.N. and Tobias, R.D. (2007). Optimum Experimental Designs, with SAS. Oxford University Press, Oxford.
  • Ba,S., Brenneman, W.A. and Myers, W.R. (2015). Optimal Sliced Latin Hypercube Designs. Technometrics 57 479-487.
  • Bailey, R.A. (1981). A unified approach to design of experiments. Journal of the Royal Statistical Society, Series A 144 214-223.
  • Ball, R.D. (2005). Experimental Designs for Reliable Detection of Linkage Disequilibrium in Unstructured Random Population Association Studies. Genetics 170 859-873.
  • Bischl, B., Lang, M., Mersmann, O., Rahnenfuehrer, J. and Weihs, C. (2012). Computing on high performance clusters with R: Packages BatchJobs and BatchExperiments . Technical Report 1/2012 , TU Dortmund, Germany.
  • Blanchard, M.S. and Longmate, J.A. (2010). Toxicity equivalence range design (TEQR): A practical Phase I design. Contemporary Clinical Trials doi:10.1016/j.cct.2010.09.011.
  • Block, R. and Mee, R. (2005). Resolution IV Designs with 128 Runs. Journal of Quality Technology 37 282-293.
  • Bornkamp B., Pinheiro J. C., and Bretz, F. (2009). MCPMod: An R Package for the Design and Analysis of Dose-Finding Studies . Journal of Statistical Software 29 (7) 1-23.
  • Box G. E. P, Hunter, W. C. and Hunter, J. S. (2005). Statistics for Experimenters (2nd edition). New York: Wiley.
  • Box, G. E. P and R. D. Meyer (1986). An Analysis for Unreplicated Fractional Factorials. Technometrics 28 11-18.
  • Box, G. E. P and R. D. Meyer (1993). Finding the Active Factors in Fractionated Screening Experiments. Journal of Quality Technology 25 94-105.
  • Chasalow, S., Brand, R. (1995). Generation of Simplex Lattice Points. Journal of the Royal Statistical Society, Series C 44 534-545.
  • Chen, J., Sun, D.X. and Wu, C.F.J. (1993). A catalogue of 2-level and 3-level orthogonal arrays. International Statistical Review 61 131-145.
  • Consonni, G. and Deldossi, L. (2015), Objective Bayesian model discrimination in follow-up experimental designs DOI 10.1007/s11749-015-0461-3. TEST.
  • Collings, B. J. (1989). Quick Confounding. Technometrics 31 107-110.
  • Cornell, J. (2002). Experiments with Mixtures . Third Edition. Wiley.
  • Crabbe, M., Akinc, D. and Vandebroek, M. (2014). Fast algorithms to generate individualized designs for the mixed logit choice model. Transportation Research Part B: Methodological 60 , 1-15.
  • Daniel, C. (1959). Use of Half Normal Plots in Interpreting Two Level Experiments. Technometrics 1 311-340.
  • Derringer, G. and Suich, R. (1980). Simultaneous Optimization of Several Response Variables. Journal of Quality Technology 12 214-219.
  • Dette, H., Bretz, F., Pepelyshev, A. and Pinheiro, J. C. (2008). Optimal Designs for Dose Finding Studies. Journal of the American Statisical Association 103 1225-1237.
  • Dette, H., Melas, V.B. and Shpilev, P. (2013). Robust T-optimal discriminating designs. The Annals of Statistics 41 1693-1715.
  • Dette H., Melas V.B. and Guchenko R. (2014). Bayesian T-optimal discriminating designs. ArXiv link .
  • Duan, W., Ankenman, B.E. Sanchez, S.M. and Sanchez, P.J. (2017). Sliced Full Factorial-Based Latin Hypercube Designs as a Framework for a Batch Sequential Design Algorithm. Technometrics 59 , 11-22.
  • Dunn, K. (2010-2016). Process Improvement Using Data . Online book.
  • Federov, V.V. (1972). Theory of Optimal Experiments. Academic Press, New York.
  • Fox, J. (2005). The R Commander: A Basic-Statistics Graphical User Interface to R . Journal of Statistical Software 14 (9) 1-42.
  • Gramacy, R.B. (2007). tgp: An R Package for Bayesian Nonstationary, Semiparametric Nonlinear Regression and Design by Treed Gaussian Process Models . Journal of Statistical Software 19 (9) 1-46.
  • Groemping, U. (2011). Tutorial for designing experiments using the R package RcmdrPlugin.DoE . Reports in Mathematics, Physics and Chemistry , Department II, Beuth University of Applied Sciences Berlin.
  • Groemping, U. (2014). R Package FrF2 for Creating and Analysing Fractional Factorial 2-Level Designs. Journal of Statistical Software 56 (1) 1-56.
  • Groemping, U. (2015). Augmented Half Normal Effects Plots in the Presence of a Few Error Degrees of Freedom. Quality and Reliability Engineering International 31 , 1185-1196. DOI: 10.1002/qre.1842.
  • Groemping, U. (2017). Frequency Tables for the Coding Invariant Quality Assessment of Factorial Designs. IISE Transactions 49 , 505-517.
  • Groemping, U. and Xu, H. (2014). Generalized resolution for orthogonal arrays. The Annals of Statistics 42 918-939.
  • Groemping, U. (2021). A unified implementation of stratum (aka strong) orthogonal arrays. Report 01/2021 , Department II, BHT Berlin.
  • Harman R., Filova L. (2014): Computing efficient exact designs of experiments using integer quadratic programming, Computational Statistics and Data Analysis 71 1159-1167
  • Hoaglin D., Mosteller F. and Tukey J. (eds., 1991). Fundamentals of Exploratory Analysis of Variance . Wiley, New York.
  • Jones, B. and Kenward, M.G. (1989). Design and Analysis of Cross-Over Trials . Chapman and Hall, London.
  • Johnson, M.E., Moore L.M. and Ylvisaker D. (1990). Minimax and maximin distance designs. Journal of Statistical Planning and Inference 26 131-148.
  • Joseph, V. R., Gul, E., and Ba, S. (2015). Maximum Projection Designs for Computer Experiments. Biometrika 102 371-380.
  • Kuhfeld, W. (2009). Orthogonal arrays. Website courtesy of SAS Institute Inc., accessed August 4th 2010. URL http://support.sas.com/techsup/technote/ts723.html .
  • Kuhnert, P. and Venables, B. (2005) An Introduction to R: Software for Statistical Modelling & Computing . URL http://CRAN.R-project.org/doc/contrib/Kuhnert+Venables-R_Course_Notes.zip . (PDF document (about 360 pages) of lecture notes in combination with the data sets and R scripts)
  • Kunert, J. (1998). Sensory Experiments as Crossover Studies. Food Quality and Preference 9 243-253.
  • Lalanne, C. (2012). R Companion to Montgomerys Design and Analysis of Experiments. Manuscript, downloadable at URL http://www.aliquote.org/articles/tech/dae/dae.pdf . (The file accompanies the book by Montgomery 2005 (cf. below).)
  • Lawson, J. (2014). Design and Analysis of Experiments with R. Chapman and Hall/CRC, Boca Raton.
  • Lenth, R.V. (1989). Quick and Easy Analysis of Unreplicated Factorials. Technometrics 31 469-473.
  • Lenth, R.V. (2009). Response-Surface Methods in R, Using rsm . Journal of Statistical Software 32 (7) 1-17.
  • Mak, S., and Joseph, V.R. (2016). Minimax designs using clustering. Journal of Computational and Graphical Statistics . In revision.
  • Martens, H., Mage, I., Tondel, K., Isaeva, J., Hoy, M. and Saebo, S. (2010). Multi-level binary replacement (MBR) design for computer experiments in high-dimensional nonlinear systems, J. Chemom. 24 748-756.
  • Masoudi, E., Holling, H. and Wong, W.-K. (2016). Application of imperialist competitive algorithm to find minimax and standardized maximin optimal designs. Computational Statistics and Data Analysis , in press. DOI: 10.1016/j.csda.2016.06.014
  • Mee, R. (2009). A Comprehensive Guide to Factorial Two-Level Experimentation. Springer, New York.
  • Montgomery, D. C. (2005, 6th ed.). Design and Analysis of Experiments. Wiley, New York.
  • Morgan-Wall T, Khoury G (2021). Optimal Design Generation and Power Evaluation in R: The skpr Package. Journal of Statistical Software , 99 (1), 1-36. doi: 10.18637/jss.v099.i01.
  • Muetze,T., Munk, A. and Friede, T. (2016). Design and analysis of three-arm trials with negative binomially distributed endpoints. Statistics in Medicine 35 (4) 505-521.
  • Myers, R. H. and Montgomery, D. C. (1995). Response Surface Methodology: Process and Product Optimization Using Designed Experiments. Wiley, New York.
  • Nelson, P.R., Coffin, M. and Copeland, K.A.F. (2003). Introductory Statistics for Engineering Experimentation. Academic Press, San Diego.
  • Parsons N, Friede T, Todd S, Valdes Marquez E, Chataway J, Nicholas R, Stallard N. (2012). An R package for implementing simulations for seamless phase II/III clinicals trials using early outcomes for treatment selection. Computational Statistics and Data Analysis 56 , 1150-1160.
  • Perrone, E. and Mueller, W.G. (2016) Optimal designs for copula models, Statistics 50 (4), 917-929. DOI: 10.1080/02331888.2015.1111892
  • Plackett, R.L. and Burman, J.P. (1946). The design of optimum multifactorial experiments. Biometrika 33 305-325.
  • Rasch, D., Pilz, J., Verdooren, L.R. and Gebhardt, A. (2011). Optimal Experimental Design with R. Chapman and Hall/CRC. (caution, does not live up to its title!)
  • Rosenbaum, P. (1989). Exploratory Plots for Paired Data. The American Statistician 43 108-109.
  • Sacks, J., Welch, W.J., Mitchell, T.J. and Wynn, H.P. (1989). Design and analysis of computer experiments. Statistical Science 4 409-435.
  • Santacruz, A., Rubiano, Y., Melo, C., 2014. Evolutionary optimization of spatial sampling networks designed for the monitoring of soil carbon. In: Hartemink, A., McSweeney, K. (Eds.). Soil Carbon. Series: Progress in Soil Science. (pp. 77-84). Springer, New York.
  • Santner T.J., Williams B.J. and Notz W.I. (2003). The Design and Analysis of Computer Experiments. Springer, New York.
  • Sen S, Satagopan JM and Churchill GA (2005). Quantitative Trait Locus Study Design from an Information Perspective. Genetics 170 447-464.
  • Stein, M. (1987). Large Sample Properties of Simulations Using Latin Hypercube Sampling. Technometrics 29 143-151.
  • Stocki, R. (2005). A Method to Improve Design Reliability Using Optimal Latin Hypercube Sampling. Computer Assisted Mechanics and Engineering Sciences 12 87-105.
  • Underwood, A.J. (1997). Experiments in Ecology: Their Logical Design and Interpretation Using Analysis of Variance. Cambridge University Press, Cambridge.
  • Vikneswaran (2005). An R companion to “Experimental Design”. URL http://CRAN.R-project.org/doc/contrib/Vikneswaran-ED_companion.pdf . (The file accompanies the book “Experimental Design with Applications in Management, Engineering and the Sciences” by Berger and Maurer, 2002.)
  • Vining, G. (1993). A Computer Program for Generating Variance Dispersion Graphs. Journal of Quality Technology 25 45-58. Corrigendum in the same volume, pp. 333-335.
  • Xu, H. (2009). Algorithmic Construction of Efficient Fractional Factorial Designs With Large Run Sizes. Technometrics 51 262-277.
  • Yin, J., Qin, R., Ezzalfani, M., Sargent, D. J., and Mandrekar, S. J. (2017). A Bayesian dose-finding design incorporating toxicity data from multiple treatment cycles. Statistics in Medicine 36 , 67-80. doi: 10.1002/sim.7134.

CRAN packages

, , , , , , , , .
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , .
, .

Related links

  • Dunn, K. (2010-2016). Process Improvement Using Data.
  • Kuhnert, P. and Venables, B. (2005) An Introduction to R: Software for Statistical Modelling & Computing . (~4MB)
  • Vikneswaran (2005). An R companion to “Experimental Design”.

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

3.3 - experimental design terminology.

In experimental design terminology, the " experimental unit " is randomized to the treatment regimen and receives the treatment directly. The " observational unit " has measurements taken on it. In most clinical trials, the experimental units and the observational units are one and the same, namely, the individual patient

One exception to this is a community intervention trial in which communities, e.g., geographic regions, are randomized to treatments. For example, communities (experimental units) might be randomized to receive different formulations of a vaccine, whereas the effects are measured directly on the subjects (observational units) within the communities. The advantages here are strictly logistical - it is simply easier to implement in this fashion. Another example occurs in reproductive toxicology experiments in which female rodents are exposed to a treatment (experimental units) but measurements are taken on the pups (observational units).

In experimental design terminology, factors are variables that are controlled and varied during the course of the experiment. For example, treatment is a factor in a clinical trial with experimental units randomized to treatment. Another example is pressure and temperature as factors in a chemical experiment.

Most clinical trials are structured as one-way designs , i.e., only one factor, treatment, with a few levels.

Temperature and pressure in the chemical experiment are two factors that comprise a two-way design in which it is of interest to examine various combinations of temperature and pressure. Some clinical trials may have a two-way factorial design , such as in oncology where various combinations of doses of two chemotherapeutic agents comprise the treatments. An incomplete factorial design may be useful if it is inappropriate to assign subjects to some of the possible treatment combinations, such as no treatment (double placebo). We will study factorial designs in a later lesson.

A parallel design refers to a study in which patients are randomized to a treatment and remain on that treatment throughout the course of the trial. This is a typical design. In contrast, with a crossover design patients are randomized to a sequence of treatments and they cross over from one treatment to another during the course of the trial. Each treatment occurs in a time period with a washout period in between. Crossover designs are of interest since with each patient serving as their own control, there is potential for reduced variability. However, there are potential problems with this type of design. There should be investigation into possible carry-over effects, i.e. the residual effects of the previous treatment affecting subject’s response in the later treatment period. In addition, only conditions that are likely to be similar in both treatment periods are amenable to crossover designs. Acute health problems that do not recur are not well-suited for a crossover study. We will study crossover design in a later lesson.

Randomization is used to remove systematic error (bias) and to justify Type I error probabilities in experiments. Randomization is recognized as an essential feature of clinical trials for removing selection bias.

Selection bias occurs when a physician decides treatment assignment and systematically selects a certain type of patient for a particular treatment.. Suppose the trial consists of an experimental therapy and a placebo. If the physician assigns healthier patients to the experimental therapy and the less healthy patients to the placebo, the study could result in an invalid conclusion that the experimental therapy is very effective.

Blocking and stratification are used to control unwanted variation. For example, suppose a clinical trial is structured to compare treatments A and B in patients between the ages of 18 and 65. Suppose that the younger patients tend to be healthier. It would be prudent to account for this in the design by stratifying with respect to age. One way to achieve this is to construct age groups of 18-30, 31-50, and 51-65 and to randomize patients to treatment within each age group.

18 - 30 12 13
31 - 50 23 23
51-65 6 7

It is not necessary to have the same number of patients within each age stratum. We do, however, want to have a balance in the number on each treatment within each age group. This is accomplished by blocking, in this case, within the age strata. Blocking is a restriction of the randomization process that results a balance of numbers of patients on each treatment after a prescribed number of randomizations. For example, blocks of 4 within these age strata would mean that after 4, 8, 12, etc. patients in a particular age group had entered the study, the numbers assigned to each treatment within that stratum would be equal.

If the numbers are large enough within a stratum, a planned subgroup analysis may be performed. In the example, the smaller numbers of patients in the upper and lower age groups would require care in the analyses of these sub-groups specifically. However, with the primary question as to the effect of treatment regardless of age, the pooled data in which each sub-group is represented in a balanced fashion would be utilized for the main analysis.

Even ineffective treatments can appear beneficial in some patients. This may be due to random fluctuations, or variability in the disease. If, however, the improvement is due to the patient’s expectation of a positive response, this is called a " placebo effect ". This is especially problematic when the outcome is subjective, such as pain or symptom assessment. The placebo effect is widely recognized and must be removed in any clinical trial. For example, rather than constructing a nonrandomized trial in which all patients receive an experimental therapy, it is better to randomize patients to receive either the experimental therapy or a placebo. A true placebo is an inert or inactive treatment that mimics the route of administration of the real treatment, e.g., a sugar pill.

Placebos are not acceptable ethically in many situations, e.g., in surgical trials. (Although there have been instances where 'sham' surgical procedures took place as the 'placebo' control.) When an accepted treatment already exists for a serious illness such as cancer, the control must be an active treatment. In other situations, a true placebo is not physically possible to attain. For example, a few trials investigating dimethyl sulfoxide (DMSO) for providing muscle pain relief were conducted in the 1970’s and 1980’s. DMSO is rubbed onto the area of muscle pain but leaves a garlicky taste in the mouth, so it was difficult to develop a placebo.

Treatment masking or blinding is an effective way to ensure objectivity of the person measuring the outcome variables. Masking is especially important when the measurements are subjective or based on self-assessment. Double-masked trials refer to studies in which both investigators and patients are masked to the treatment. Single-masked trials refer to the situation when only patients are masked. In some studies, statisticians are masked to treatment assignment when performing the initial statistical analyses, i.e., not knowing which group received the treatment and which is the control until analyses have been completed. Even a safety-monitoring committee may be masked to the identity of treatment A or B, until there is an observed trend or difference that should evoke a response from the monitors. In executing a masked trial great care will be taken to keep the treatment allocation schedule securely hidden from all except those with a need to know which medications are active and which are placebo. This could be limited to the producers of the study medications, and possibly the safety monitoring board before study completion. There is always a caveat for breaking the blind for a particular patient in an emergency situation.

As with placebos, masking, although highly desirable, is not always possible. For example, one could not mask a surgeon to the procedure he is to perform. Even so, some have gone to great lengths to achieve masking. For example, a few trials with cardiac pacemakers have consisted of every eligible patient undergoing a surgical procedure to be implanted with the device. The device was "turned on" in patients randomized to the treatment group and "turned off" in patients randomized to the control group. The surgeon was not aware of which devices would be activated.

Investigators often underestimate the importance of masking as a design feature. This is because they believe that biases are small in relation to the magnitude of the treatment effects (when the converse usually is true), or that they can compensate for their prejudice and subjectivity.

Confounding is the effect of other relevant factors on the outcome that may be incorrectly attributed to the difference between study groups.

Here is an example: An investigator plans to assign 10 patients to treatment and 10 patients to control. There will be a one-week follow-up on each patient. The first 10 patients will be assigned treatment on March 01 and the next 10 patients will be assigned control on March 15. The investigator may observe a significant difference between treatment and control, but is it due to different environmental conditions between early March and mid-March? The obvious way to correct this would be to randomize 5 patients to treatment and 5 patients to control on March 01, followed by another 5 patients to treatment and the 5 patients to control on March 15.

Validity Section  

A trial is said to possess internal validity if the observed difference in outcome between the study groups is real and not due to bias, chance, or confounding. Randomized, placebo-controlled, double-blinded clinical trials have high levels of internal validity.

External validity in a human trial refers to how well study results can be generalized to a broader population. External validity is irrelevant if internal validity is low. External validity in randomized clinical trials is enhanced by using broad eligibility criteria when recruiting patients .

Large simple and pragmatic trials emphasize external validity. A large simple trial attempts to discover small advantages of a treatment that is expected to be used in a large population. Large numbers of subjects are enrolled in a study with simplified design and management. There is an implicit assumption that the treatment effect is similar for all subjects with the simplified data collection. In a similar vein, a pragmatic trial emphasizes the effect of a treatment in practices outside academic medical centers and involves a broad range of clinical practices.

Studies of equivalency and noninferiority have different objectives than the usual trial which is designed to demonstrate superiority of a new treatment to a control. A study to demonstrate non-inferiority aims to show that a new treatment is not worse than an accepted treatment in terms of the primary response variable by more than a pre-specified margin. A study to demonstrate equivalence has the objective of demonstrating the response to the new treatment is within a prespecified margin in both directions. We will learn more about these studies when we explore sample size calculations.

experimental statistical analysis

  • Science & Math
  • Mathematics

Sorry, there was a problem.

Kindle app logo image

Download the free Kindle app and start reading Kindle books instantly on your smartphone, tablet, or computer - no Kindle device required .

Read instantly on your browser with Kindle for Web.

Using your mobile phone camera - scan the code below and download the Kindle app.

QR code to download the Kindle App

Image Unavailable

The Statistical Analysis of Experimental Data (Dover Books on Mathematics)

  • To view this video download Flash Player

Follow the author

John Mandel

The Statistical Analysis of Experimental Data (Dover Books on Mathematics) Later Printing Edition

The increasing importance in laboratory situations of minutely precise measurements presents the chemist and physicist with numerous problems in data analysis. National Bureau of Standards statistics consultant John Mandel here draws a clear and fascinating blueprint for a systematic science of statistical analysis — geared to the particular needs of the physical scientist, with approach and examples aimed specifically at the statistical problems he is likely to confront. The first third of The Statistical Analysis of Experimental Data comprises a thorough grounding in the fundamental mathematical definitions, concepts, and facts underlying modern statistical theory — math knowledge beyond basic algebra, calculus, and analytic geometry is not required. Remaining chapters deal with statistics as an interpretative tool that can enable the laboratory researcher to determine his most effective methodology. You'll find lucid, concise coverage of over 130 topics, including elements of measurement; nature of statistical analysis; design/analysis of experiments; statistics as diagnostic tool; precision and accuracy; testing statistical models; between-within classifications; two-way classifications; sampling (principles, objectives, methods); fitting of non-linear models; measurement of processes; components of variance; nested designs; the sensitivity ratio, and much more. Also included are many examples, each worked in step-by-step fashion; nearly 200 helpful figures and tables; and concluding chapter summaries followed by references for further study. Mandel argues that, when backed by an understanding of its theoretic framework, statistics offers researchers "not only a powerful tool for the interpretation of experiments but also a task of real intellectual gratification." The Statistical Analysis of Experimental Data provides the physical scientist with the explanations and models he requires to impress this invaluable tool into service.

  • ISBN-10 0486646661
  • ISBN-13 978-0486646664
  • Edition Later Printing
  • Publisher Dover Publications
  • Publication date September 1, 1984
  • Part of series Dover Books on Mathematics
  • Language English
  • Dimensions 5.5 x 1 x 8.5 inches
  • Print length 432 pages
  • See all details

Product details

  • Publisher ‏ : ‎ Dover Publications; Later Printing edition (September 1, 1984)
  • Language ‏ : ‎ English
  • Paperback ‏ : ‎ 432 pages
  • ISBN-10 ‏ : ‎ 0486646661
  • ISBN-13 ‏ : ‎ 978-0486646664
  • Item Weight ‏ : ‎ 1 pounds
  • Dimensions ‏ : ‎ 5.5 x 1 x 8.5 inches
  • #560 in Scientific Research
  • #1,109 in Calculus (Books)
  • #2,122 in Probability & Statistics (Books)

Videos for this product

Video Widget Card

Click to play video

Video Widget Video Title Section

Very nice book

The Math Sorcerer's Lair

experimental statistical analysis

About the author

John mandel.

Discover more of the author’s books, see similar authors, read book recommendations and more.

Customer reviews

  • 5 star 4 star 3 star 2 star 1 star 5 star 67% 15% 8% 3% 6% 67%
  • 5 star 4 star 3 star 2 star 1 star 4 star 67% 15% 8% 3% 6% 15%
  • 5 star 4 star 3 star 2 star 1 star 3 star 67% 15% 8% 3% 6% 8%
  • 5 star 4 star 3 star 2 star 1 star 2 star 67% 15% 8% 3% 6% 3%
  • 5 star 4 star 3 star 2 star 1 star 1 star 67% 15% 8% 3% 6% 6%

Customer Reviews, including Product Star Ratings help customers to learn more about the product and decide whether it is the right product for them.

To calculate the overall star rating and percentage breakdown by star, we don’t use a simple average. Instead, our system considers things like how recent a review is and if the reviewer bought the item on Amazon. It also analyzed reviews to verify trustworthiness.

  • Sort reviews by Top reviews Most recent Top reviews

Top reviews from the United States

There was a problem filtering reviews right now. please try again later..

experimental statistical analysis

Top reviews from other countries

experimental statistical analysis

  • About Amazon
  • Investor Relations
  • Amazon Devices
  • Amazon Science
  • Sell products on Amazon
  • Sell on Amazon Business
  • Sell apps on Amazon
  • Become an Affiliate
  • Advertise Your Products
  • Self-Publish with Us
  • Host an Amazon Hub
  • › See More Make Money with Us
  • Amazon Business Card
  • Shop with Points
  • Reload Your Balance
  • Amazon Currency Converter
  • Amazon and COVID-19
  • Your Account
  • Your Orders
  • Shipping Rates & Policies
  • Returns & Replacements
  • Manage Your Content and Devices
 
 
 
 
  • Conditions of Use
  • Privacy Notice
  • Consumer Health Data Privacy Disclosure
  • Your Ads Privacy Choices

experimental statistical analysis

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Choosing the Right Statistical Test | Types & Examples

Choosing the Right Statistical Test | Types & Examples

Published on January 28, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Statistical tests are used in hypothesis testing . They can be used to:

  • determine whether a predictor variable has a statistically significant relationship with an outcome variable.
  • estimate the difference between two or more groups.

Statistical tests assume a null hypothesis of no relationship or no difference between groups. Then they determine whether the observed data fall outside of the range of values predicted by the null hypothesis.

If you already know what types of variables you’re dealing with, you can use the flowchart to choose the right statistical test for your data.

Statistical tests flowchart

Table of contents

What does a statistical test do, when to perform a statistical test, choosing a parametric test: regression, comparison, or correlation, choosing a nonparametric test, flowchart: choosing a statistical test, other interesting articles, frequently asked questions about statistical tests.

Statistical tests work by calculating a test statistic – a number that describes how much the relationship between variables in your test differs from the null hypothesis of no relationship.

It then calculates a p value (probability value). The p -value estimates how likely it is that you would see the difference described by the test statistic if the null hypothesis of no relationship were true.

If the value of the test statistic is more extreme than the statistic calculated from the null hypothesis, then you can infer a statistically significant relationship between the predictor and outcome variables.

If the value of the test statistic is less extreme than the one calculated from the null hypothesis, then you can infer no statistically significant relationship between the predictor and outcome variables.

Prevent plagiarism. Run a free check.

You can perform statistical tests on data that have been collected in a statistically valid manner – either through an experiment , or through observations made using probability sampling methods .

For a statistical test to be valid , your sample size needs to be large enough to approximate the true distribution of the population being studied.

To determine which statistical test to use, you need to know:

  • whether your data meets certain assumptions.
  • the types of variables that you’re dealing with.

Statistical assumptions

Statistical tests make some common assumptions about the data they are testing:

  • Independence of observations (a.k.a. no autocorrelation): The observations/variables you include in your test are not related (for example, multiple measurements of a single test subject are not independent, while measurements of multiple different test subjects are independent).
  • Homogeneity of variance : the variance within each group being compared is similar among all groups. If one group has much more variation than others, it will limit the test’s effectiveness.
  • Normality of data : the data follows a normal distribution (a.k.a. a bell curve). This assumption applies only to quantitative data .

If your data do not meet the assumptions of normality or homogeneity of variance, you may be able to perform a nonparametric statistical test , which allows you to make comparisons without any assumptions about the data distribution.

If your data do not meet the assumption of independence of observations, you may be able to use a test that accounts for structure in your data (repeated-measures tests or tests that include blocking variables).

Types of variables

The types of variables you have usually determine what type of statistical test you can use.

Quantitative variables represent amounts of things (e.g. the number of trees in a forest). Types of quantitative variables include:

  • Continuous (aka ratio variables): represent measures and can usually be divided into units smaller than one (e.g. 0.75 grams).
  • Discrete (aka integer variables): represent counts and usually can’t be divided into units smaller than one (e.g. 1 tree).

Categorical variables represent groupings of things (e.g. the different tree species in a forest). Types of categorical variables include:

  • Ordinal : represent data with an order (e.g. rankings).
  • Nominal : represent group names (e.g. brands or species names).
  • Binary : represent data with a yes/no or 1/0 outcome (e.g. win or lose).

Choose the test that fits the types of predictor and outcome variables you have collected (if you are doing an experiment , these are the independent and dependent variables ). Consult the tables below to see which test best matches your variables.

Parametric tests usually have stricter requirements than nonparametric tests, and are able to make stronger inferences from the data. They can only be conducted with data that adheres to the common assumptions of statistical tests.

The most common types of parametric test include regression tests, comparison tests, and correlation tests.

Regression tests

Regression tests look for cause-and-effect relationships . They can be used to estimate the effect of one or more continuous variables on another variable.

Predictor variable Outcome variable Research question example
What is the effect of income on longevity?
What is the effect of income and minutes of exercise per day on longevity?
Logistic regression What is the effect of drug dosage on the survival of a test subject?

Comparison tests

Comparison tests look for differences among group means . They can be used to test the effect of a categorical variable on the mean value of some other characteristic.

T-tests are used when comparing the means of precisely two groups (e.g., the average heights of men and women). ANOVA and MANOVA tests are used when comparing the means of more than two groups (e.g., the average heights of children, teenagers, and adults).

Predictor variable Outcome variable Research question example
Paired t-test What is the effect of two different test prep programs on the average exam scores for students from the same class?
Independent t-test What is the difference in average exam scores for students from two different schools?
ANOVA What is the difference in average pain levels among post-surgical patients given three different painkillers?
MANOVA What is the effect of flower species on petal length, petal width, and stem length?

Correlation tests

Correlation tests check whether variables are related without hypothesizing a cause-and-effect relationship.

These can be used to test whether two variables you want to use in (for example) a multiple regression test are autocorrelated.

Variables Research question example
Pearson’s  How are latitude and temperature related?

Non-parametric tests don’t make as many assumptions about the data, and are useful when one or more of the common statistical assumptions are violated. However, the inferences they make aren’t as strong as with parametric tests.

Predictor variable Outcome variable Use in place of…
Spearman’s 
Pearson’s 
Sign test One-sample -test
Kruskal–Wallis  ANOVA
ANOSIM MANOVA
Wilcoxon Rank-Sum test Independent t-test
Wilcoxon Signed-rank test Paired t-test

This flowchart helps you choose among parametric tests. For nonparametric alternatives, check the table above.

Choosing the right statistical test

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient
  • Null hypothesis

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Statistical tests commonly assume that:

  • the data are normally distributed
  • the groups that are being compared have similar variance
  • the data are independent

If your data does not meet these assumptions you might still be able to use a nonparametric statistical test , which have fewer requirements but also make weaker inferences.

A test statistic is a number calculated by a  statistical test . It describes how far your observed data is from the  null hypothesis  of no relationship between  variables or no difference among sample groups.

The test statistic tells you how different two or more groups are from the overall population mean , or how different a linear slope is from the slope predicted by a null hypothesis . Different test statistics are used in different statistical tests.

Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test . Significance is usually denoted by a p -value , or probability value.

Statistical significance is arbitrary – it depends on the threshold, or alpha value, chosen by the researcher. The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of the time under the null hypothesis .

When the p -value falls below the chosen alpha value, then we say the result of the test is statistically significant.

Quantitative variables are any variables where the data represent amounts (e.g. height, weight, or age).

Categorical variables are any variables where the data represent groups. This includes rankings (e.g. finishing places in a race), classifications (e.g. brands of cereal), and binary outcomes (e.g. coin flips).

You need to know what type of variables you are working with to choose the right statistical test for your data and interpret your results .

Discrete and continuous variables are two types of quantitative variables :

  • Discrete variables represent counts (e.g. the number of objects in a collection).
  • Continuous variables represent measurable amounts (e.g. water volume or weight).

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Choosing the Right Statistical Test | Types & Examples. Scribbr. Retrieved September 23, 2024, from https://www.scribbr.com/statistics/statistical-tests/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, hypothesis testing | a step-by-step guide with easy examples, test statistics | definition, interpretation, and examples, normal distribution | examples, formulas, & uses, what is your plagiarism score.

Help | Advanced Search

Condensed Matter > Statistical Mechanics

Title: statistical mechanical analysis of gaussian processes.

Abstract: In this paper, we analyze Gaussian processes using statistical mechanics. Although the input is originally multidimensional, we simplify our model by considering the input as one-dimensional for statistical mechanical analysis. Furthermore, we employ periodic boundary conditions as an additional modeling approach. By using periodic boundary conditions, we can diagonalize the covariance matrix. The diagonalized covariance matrix is then applied to Gaussian processes. This allows for a statistical mechanical analysis of Gaussian processes using the derived diagonalized matrix. We indicate that the analytical solutions obtained in this method closely match the results from simulations.
Comments: 12 pages, 3 figures,
Subjects: Statistical Mechanics (cond-mat.stat-mech); Data Analysis, Statistics and Probability (physics.data-an)
Cite as: [cond-mat.stat-mech]
  (or [cond-mat.stat-mech] for this version)
  Focus to learn more arXiv-issued DOI via DataCite (pending registration)

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

References & Citations

  • INSPIRE HEP
  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Have a thesis expert improve your writing

Check your thesis for plagiarism in 10 minutes, generate your apa citations for free.

  • Knowledge Base

The Beginner's Guide to Statistical Analysis | 5 Steps & Examples

Statistical analysis means investigating trends, patterns, and relationships using quantitative data . It is an important research tool used by scientists, governments, businesses, and other organisations.

To draw valid conclusions, statistical analysis requires careful planning from the very start of the research process . You need to specify your hypotheses and make decisions about your research design, sample size, and sampling procedure.

After collecting data from your sample, you can organise and summarise the data using descriptive statistics . Then, you can use inferential statistics to formally test hypotheses and make estimates about the population. Finally, you can interpret and generalise your findings.

This article is a practical introduction to statistical analysis for students and researchers. We’ll walk you through the steps using two research examples. The first investigates a potential cause-and-effect relationship, while the second investigates a potential correlation between variables.

Table of contents

Step 1: write your hypotheses and plan your research design, step 2: collect data from a sample, step 3: summarise your data with descriptive statistics, step 4: test hypotheses or make estimates with inferential statistics, step 5: interpret your results, frequently asked questions about statistics.

To collect valid data for statistical analysis, you first need to specify your hypotheses and plan out your research design.

Writing statistical hypotheses

The goal of research is often to investigate a relationship between variables within a population . You start with a prediction, and use statistical analysis to test that prediction.

A statistical hypothesis is a formal way of writing a prediction about a population. Every research prediction is rephrased into null and alternative hypotheses that can be tested using sample data.

While the null hypothesis always predicts no effect or no relationship between variables, the alternative hypothesis states your research prediction of an effect or relationship.

  • Null hypothesis: A 5-minute meditation exercise will have no effect on math test scores in teenagers.
  • Alternative hypothesis: A 5-minute meditation exercise will improve math test scores in teenagers.
  • Null hypothesis: Parental income and GPA have no relationship with each other in college students.
  • Alternative hypothesis: Parental income and GPA are positively correlated in college students.

Planning your research design

A research design is your overall strategy for data collection and analysis. It determines the statistical tests you can use to test your hypothesis later on.

First, decide whether your research will use a descriptive, correlational, or experimental design. Experiments directly influence variables, whereas descriptive and correlational studies only measure variables.

  • In an experimental design , you can assess a cause-and-effect relationship (e.g., the effect of meditation on test scores) using statistical tests of comparison or regression.
  • In a correlational design , you can explore relationships between variables (e.g., parental income and GPA) without any assumption of causality using correlation coefficients and significance tests.
  • In a descriptive design , you can study the characteristics of a population or phenomenon (e.g., the prevalence of anxiety in U.S. college students) using statistical tests to draw inferences from sample data.

Your research design also concerns whether you’ll compare participants at the group level or individual level, or both.

  • In a between-subjects design , you compare the group-level outcomes of participants who have been exposed to different treatments (e.g., those who performed a meditation exercise vs those who didn’t).
  • In a within-subjects design , you compare repeated measures from participants who have participated in all treatments of a study (e.g., scores from before and after performing a meditation exercise).
  • In a mixed (factorial) design , one variable is altered between subjects and another is altered within subjects (e.g., pretest and posttest scores from participants who either did or didn’t do a meditation exercise).
  • Experimental
  • Correlational

First, you’ll take baseline test scores from participants. Then, your participants will undergo a 5-minute meditation exercise. Finally, you’ll record participants’ scores from a second math test.

In this experiment, the independent variable is the 5-minute meditation exercise, and the dependent variable is the math test score from before and after the intervention. Example: Correlational research design In a correlational study, you test whether there is a relationship between parental income and GPA in graduating college students. To collect your data, you will ask participants to fill in a survey and self-report their parents’ incomes and their own GPA.

Measuring variables

When planning a research design, you should operationalise your variables and decide exactly how you will measure them.

For statistical analysis, it’s important to consider the level of measurement of your variables, which tells you what kind of data they contain:

  • Categorical data represents groupings. These may be nominal (e.g., gender) or ordinal (e.g. level of language ability).
  • Quantitative data represents amounts. These may be on an interval scale (e.g. test score) or a ratio scale (e.g. age).

Many variables can be measured at different levels of precision. For example, age data can be quantitative (8 years old) or categorical (young). If a variable is coded numerically (e.g., level of agreement from 1–5), it doesn’t automatically mean that it’s quantitative instead of categorical.

Identifying the measurement level is important for choosing appropriate statistics and hypothesis tests. For example, you can calculate a mean score with quantitative data, but not with categorical data.

In a research study, along with measures of your variables of interest, you’ll often collect data on relevant participant characteristics.

Variable Type of data
Age Quantitative (ratio)
Gender Categorical (nominal)
Race or ethnicity Categorical (nominal)
Baseline test scores Quantitative (interval)
Final test scores Quantitative (interval)
Parental income Quantitative (ratio)
GPA Quantitative (interval)

Population vs sample

In most cases, it’s too difficult or expensive to collect data from every member of the population you’re interested in studying. Instead, you’ll collect data from a sample.

Statistical analysis allows you to apply your findings beyond your own sample as long as you use appropriate sampling procedures . You should aim for a sample that is representative of the population.

Sampling for statistical analysis

There are two main approaches to selecting a sample.

  • Probability sampling: every member of the population has a chance of being selected for the study through random selection.
  • Non-probability sampling: some members of the population are more likely than others to be selected for the study because of criteria such as convenience or voluntary self-selection.

In theory, for highly generalisable findings, you should use a probability sampling method. Random selection reduces sampling bias and ensures that data from your sample is actually typical of the population. Parametric tests can be used to make strong statistical inferences when data are collected using probability sampling.

But in practice, it’s rarely possible to gather the ideal sample. While non-probability samples are more likely to be biased, they are much easier to recruit and collect data from. Non-parametric tests are more appropriate for non-probability samples, but they result in weaker inferences about the population.

If you want to use parametric tests for non-probability samples, you have to make the case that:

  • your sample is representative of the population you’re generalising your findings to.
  • your sample lacks systematic bias.

Keep in mind that external validity means that you can only generalise your conclusions to others who share the characteristics of your sample. For instance, results from Western, Educated, Industrialised, Rich and Democratic samples (e.g., college students in the US) aren’t automatically applicable to all non-WEIRD populations.

If you apply parametric tests to data from non-probability samples, be sure to elaborate on the limitations of how far your results can be generalised in your discussion section .

Create an appropriate sampling procedure

Based on the resources available for your research, decide on how you’ll recruit participants.

  • Will you have resources to advertise your study widely, including outside of your university setting?
  • Will you have the means to recruit a diverse sample that represents a broad population?
  • Do you have time to contact and follow up with members of hard-to-reach groups?

Your participants are self-selected by their schools. Although you’re using a non-probability sample, you aim for a diverse and representative sample. Example: Sampling (correlational study) Your main population of interest is male college students in the US. Using social media advertising, you recruit senior-year male college students from a smaller subpopulation: seven universities in the Boston area.

Calculate sufficient sample size

Before recruiting participants, decide on your sample size either by looking at other studies in your field or using statistics. A sample that’s too small may be unrepresentative of the sample, while a sample that’s too large will be more costly than necessary.

There are many sample size calculators online. Different formulas are used depending on whether you have subgroups or how rigorous your study should be (e.g., in clinical research). As a rule of thumb, a minimum of 30 units or more per subgroup is necessary.

To use these calculators, you have to understand and input these key components:

  • Significance level (alpha): the risk of rejecting a true null hypothesis that you are willing to take, usually set at 5%.
  • Statistical power : the probability of your study detecting an effect of a certain size if there is one, usually 80% or higher.
  • Expected effect size : a standardised indication of how large the expected result of your study will be, usually based on other similar studies.
  • Population standard deviation: an estimate of the population parameter based on a previous study or a pilot study of your own.

Once you’ve collected all of your data, you can inspect them and calculate descriptive statistics that summarise them.

Inspect your data

There are various ways to inspect your data, including the following:

  • Organising data from each variable in frequency distribution tables .
  • Displaying data from a key variable in a bar chart to view the distribution of responses.
  • Visualising the relationship between two variables using a scatter plot .

By visualising your data in tables and graphs, you can assess whether your data follow a skewed or normal distribution and whether there are any outliers or missing data.

A normal distribution means that your data are symmetrically distributed around a center where most values lie, with the values tapering off at the tail ends.

Mean, median, mode, and standard deviation in a normal distribution

In contrast, a skewed distribution is asymmetric and has more values on one end than the other. The shape of the distribution is important to keep in mind because only some descriptive statistics should be used with skewed distributions.

Extreme outliers can also produce misleading statistics, so you may need a systematic approach to dealing with these values.

Calculate measures of central tendency

Measures of central tendency describe where most of the values in a data set lie. Three main measures of central tendency are often reported:

  • Mode : the most popular response or value in the data set.
  • Median : the value in the exact middle of the data set when ordered from low to high.
  • Mean : the sum of all values divided by the number of values.

However, depending on the shape of the distribution and level of measurement, only one or two of these measures may be appropriate. For example, many demographic characteristics can only be described using the mode or proportions, while a variable like reaction time may not have a mode at all.

Calculate measures of variability

Measures of variability tell you how spread out the values in a data set are. Four main measures of variability are often reported:

  • Range : the highest value minus the lowest value of the data set.
  • Interquartile range : the range of the middle half of the data set.
  • Standard deviation : the average distance between each value in your data set and the mean.
  • Variance : the square of the standard deviation.

Once again, the shape of the distribution and level of measurement should guide your choice of variability statistics. The interquartile range is the best measure for skewed distributions, while standard deviation and variance provide the best information for normal distributions.

Using your table, you should check whether the units of the descriptive statistics are comparable for pretest and posttest scores. For example, are the variance levels similar across the groups? Are there any extreme values? If there are, you may need to identify and remove extreme outliers in your data set or transform your data before performing a statistical test.

Pretest scores Posttest scores
Mean 68.44 75.25
Standard deviation 9.43 9.88
Variance 88.96 97.96
Range 36.25 45.12
30

From this table, we can see that the mean score increased after the meditation exercise, and the variances of the two scores are comparable. Next, we can perform a statistical test to find out if this improvement in test scores is statistically significant in the population. Example: Descriptive statistics (correlational study) After collecting data from 653 students, you tabulate descriptive statistics for annual parental income and GPA.

It’s important to check whether you have a broad range of data points. If you don’t, your data may be skewed towards some groups more than others (e.g., high academic achievers), and only limited inferences can be made about a relationship.

Parental income (USD) GPA
Mean 62,100 3.12
Standard deviation 15,000 0.45
Variance 225,000,000 0.16
Range 8,000–378,000 2.64–4.00
653

A number that describes a sample is called a statistic , while a number describing a population is called a parameter . Using inferential statistics , you can make conclusions about population parameters based on sample statistics.

Researchers often use two main methods (simultaneously) to make inferences in statistics.

  • Estimation: calculating population parameters based on sample statistics.
  • Hypothesis testing: a formal process for testing research predictions about the population using samples.

You can make two types of estimates of population parameters from sample statistics:

  • A point estimate : a value that represents your best guess of the exact parameter.
  • An interval estimate : a range of values that represent your best guess of where the parameter lies.

If your aim is to infer and report population characteristics from sample data, it’s best to use both point and interval estimates in your paper.

You can consider a sample statistic a point estimate for the population parameter when you have a representative sample (e.g., in a wide public opinion poll, the proportion of a sample that supports the current government is taken as the population proportion of government supporters).

There’s always error involved in estimation, so you should also provide a confidence interval as an interval estimate to show the variability around a point estimate.

A confidence interval uses the standard error and the z score from the standard normal distribution to convey where you’d generally expect to find the population parameter most of the time.

Hypothesis testing

Using data from a sample, you can test hypotheses about relationships between variables in the population. Hypothesis testing starts with the assumption that the null hypothesis is true in the population, and you use statistical tests to assess whether the null hypothesis can be rejected or not.

Statistical tests determine where your sample data would lie on an expected distribution of sample data if the null hypothesis were true. These tests give two main outputs:

  • A test statistic tells you how much your data differs from the null hypothesis of the test.
  • A p value tells you the likelihood of obtaining your results if the null hypothesis is actually true in the population.

Statistical tests come in three main varieties:

  • Comparison tests assess group differences in outcomes.
  • Regression tests assess cause-and-effect relationships between variables.
  • Correlation tests assess relationships between variables without assuming causation.

Your choice of statistical test depends on your research questions, research design, sampling method, and data characteristics.

Parametric tests

Parametric tests make powerful inferences about the population based on sample data. But to use them, some assumptions must be met, and only some types of variables can be used. If your data violate these assumptions, you can perform appropriate data transformations or use alternative non-parametric tests instead.

A regression models the extent to which changes in a predictor variable results in changes in outcome variable(s).

  • A simple linear regression includes one predictor variable and one outcome variable.
  • A multiple linear regression includes two or more predictor variables and one outcome variable.

Comparison tests usually compare the means of groups. These may be the means of different groups within a sample (e.g., a treatment and control group), the means of one sample group taken at different times (e.g., pretest and posttest scores), or a sample mean and a population mean.

  • A t test is for exactly 1 or 2 groups when the sample is small (30 or less).
  • A z test is for exactly 1 or 2 groups when the sample is large.
  • An ANOVA is for 3 or more groups.

The z and t tests have subtypes based on the number and types of samples and the hypotheses:

  • If you have only one sample that you want to compare to a population mean, use a one-sample test .
  • If you have paired measurements (within-subjects design), use a dependent (paired) samples test .
  • If you have completely separate measurements from two unmatched groups (between-subjects design), use an independent (unpaired) samples test .
  • If you expect a difference between groups in a specific direction, use a one-tailed test .
  • If you don’t have any expectations for the direction of a difference between groups, use a two-tailed test .

The only parametric correlation test is Pearson’s r . The correlation coefficient ( r ) tells you the strength of a linear relationship between two quantitative variables.

However, to test whether the correlation in the sample is strong enough to be important in the population, you also need to perform a significance test of the correlation coefficient, usually a t test, to obtain a p value. This test uses your sample size to calculate how much the correlation coefficient differs from zero in the population.

You use a dependent-samples, one-tailed t test to assess whether the meditation exercise significantly improved math test scores. The test gives you:

  • a t value (test statistic) of 3.00
  • a p value of 0.0028

Although Pearson’s r is a test statistic, it doesn’t tell you anything about how significant the correlation is in the population. You also need to test whether this sample correlation coefficient is large enough to demonstrate a correlation in the population.

A t test can also determine how significantly a correlation coefficient differs from zero based on sample size. Since you expect a positive correlation between parental income and GPA, you use a one-sample, one-tailed t test. The t test gives you:

  • a t value of 3.08
  • a p value of 0.001

The final step of statistical analysis is interpreting your results.

Statistical significance

In hypothesis testing, statistical significance is the main criterion for forming conclusions. You compare your p value to a set significance level (usually 0.05) to decide whether your results are statistically significant or non-significant.

Statistically significant results are considered unlikely to have arisen solely due to chance. There is only a very low chance of such a result occurring if the null hypothesis is true in the population.

This means that you believe the meditation intervention, rather than random factors, directly caused the increase in test scores. Example: Interpret your results (correlational study) You compare your p value of 0.001 to your significance threshold of 0.05. With a p value under this threshold, you can reject the null hypothesis. This indicates a statistically significant correlation between parental income and GPA in male college students.

Note that correlation doesn’t always mean causation, because there are often many underlying factors contributing to a complex variable like GPA. Even if one variable is related to another, this may be because of a third variable influencing both of them, or indirect links between the two variables.

Effect size

A statistically significant result doesn’t necessarily mean that there are important real life applications or clinical outcomes for a finding.

In contrast, the effect size indicates the practical significance of your results. It’s important to report effect sizes along with your inferential statistics for a complete picture of your results. You should also report interval estimates of effect sizes if you’re writing an APA style paper .

With a Cohen’s d of 0.72, there’s medium to high practical significance to your finding that the meditation exercise improved test scores. Example: Effect size (correlational study) To determine the effect size of the correlation coefficient, you compare your Pearson’s r value to Cohen’s effect size criteria.

Decision errors

Type I and Type II errors are mistakes made in research conclusions. A Type I error means rejecting the null hypothesis when it’s actually true, while a Type II error means failing to reject the null hypothesis when it’s false.

You can aim to minimise the risk of these errors by selecting an optimal significance level and ensuring high power . However, there’s a trade-off between the two errors, so a fine balance is necessary.

Frequentist versus Bayesian statistics

Traditionally, frequentist statistics emphasises null hypothesis significance testing and always starts with the assumption of a true null hypothesis.

However, Bayesian statistics has grown in popularity as an alternative approach in the last few decades. In this approach, you use previous research to continually update your hypotheses based on your expectations and observations.

Bayes factor compares the relative strength of evidence for the null versus the alternative hypothesis rather than making a conclusion about rejecting the null hypothesis or not.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

The research methods you use depend on the type of data you need to answer your research question .

  • If you want to measure something or test a hypothesis , use quantitative methods . If you want to explore ideas, thoughts, and meanings, use qualitative methods .
  • If you want to analyse a large amount of readily available data, use secondary data. If you want data specific to your purposes with control over how they are generated, collect primary data.
  • If you want to establish cause-and-effect relationships between variables , use experimental methods. If you want to understand the characteristics of a research subject, use descriptive methods.

Statistical analysis is the main method for analyzing quantitative research data . It uses probabilities and models to test predictions about a population from sample data.

Is this article helpful?

Other students also liked, a quick guide to experimental design | 5 steps & examples, controlled experiments | methods & examples of control, between-subjects design | examples, pros & cons, more interesting articles.

  • Central Limit Theorem | Formula, Definition & Examples
  • Central Tendency | Understanding the Mean, Median & Mode
  • Correlation Coefficient | Types, Formulas & Examples
  • Descriptive Statistics | Definitions, Types, Examples
  • How to Calculate Standard Deviation (Guide) | Calculator & Examples
  • How to Calculate Variance | Calculator, Analysis & Examples
  • How to Find Degrees of Freedom | Definition & Formula
  • How to Find Interquartile Range (IQR) | Calculator & Examples
  • How to Find Outliers | Meaning, Formula & Examples
  • How to Find the Geometric Mean | Calculator & Formula
  • How to Find the Mean | Definition, Examples & Calculator
  • How to Find the Median | Definition, Examples & Calculator
  • How to Find the Range of a Data Set | Calculator & Formula
  • Inferential Statistics | An Easy Introduction & Examples
  • Levels of measurement: Nominal, ordinal, interval, ratio
  • Missing Data | Types, Explanation, & Imputation
  • Normal Distribution | Examples, Formulas, & Uses
  • Null and Alternative Hypotheses | Definitions & Examples
  • Poisson Distributions | Definition, Formula & Examples
  • Skewness | Definition, Examples & Formula
  • T-Distribution | What It Is and How To Use It (With Examples)
  • The Standard Normal Distribution | Calculator, Examples & Uses
  • Type I & Type II Errors | Differences, Examples, Visualizations
  • Understanding Confidence Intervals | Easy Examples & Formulas
  • Variability | Calculating Range, IQR, Variance, Standard Deviation
  • What is Effect Size and Why Does It Matter? (Examples)
  • What Is Interval Data? | Examples & Definition
  • What Is Nominal Data? | Examples & Definition
  • What Is Ordinal Data? | Examples & Definition
  • What Is Ratio Data? | Examples & Definition
  • What Is the Mode in Statistics? | Definition, Examples & Calculator
  • Open access
  • Published: 19 September 2024

Examining gaze behavior in undergraduate students and educators during the evaluation of tooth preparation: an eye-tracking study

  • Frédéric Silvestri 1 , 2 ,
  • Nabil Odisho 2 ,
  • Abhishek Kumar 3 , 4 &
  • Anastasios Grigoriadis 2  

BMC Medical Education volume  24 , Article number:  1030 ( 2024 ) Cite this article

133 Accesses

Metrics details

Gaze behavior can serve as an objective tool in undergraduate pre-clinical dental education, helping to identify key areas of interest and common pitfalls in the routine evaluation of tooth preparations. Therefore, this study aimed to investigate the gaze behavior of undergraduate dental students and dental educators while evaluating a single crown tooth preparation.

Thirty-five participants volunteered to participate in the study and were divided into a novice group (dental students, n  = 18) and an expert group (dental educators, n  = 17). Each participant wore a binocular eye-tracking device, and the total duration of fixation was evaluated as a metric to study the gaze behavior. Sixty photographs of twenty different tooth preparations in three different views (buccal, lingual, and occlusal) were prepared and displayed during the experimental session. The participants were asked to rate the tooth preparations on a 100 mm visual analog rating scale and were also asked to determine whether each tooth preparation was ready to make an impression. Each view was divided into different areas of interest. Statistical analysis was performed with a three-way analysis of the variance model with repeated measures.

Based on the participants’ mean rates, the “best” and the “worst” tooth preparations were selected for analysis. The results showed a significantly longer time to decision in the novices compared to the experts ( P  = 0.003) and a significantly longer time to decision for both the groups in the best tooth preparation compared to the worst tooth preparation ( P  = 0.002). Statistical analysis also showed a significantly longer total duration of fixations in the margin compared to all other conditions for both the buccal ( P  < 0.012) and lingual ( P  < 0.001) views.

Conclusions

The current study showed distinct differences in gaze behavior between the novices and the experts during the evaluation of single crown tooth preparation. Understanding differences in gaze behavior between undergraduate dental students and dental educators could help improve tooth preparation skills and provide constructive customized feedback.

Peer Review reports

The purpose of dental education programs is to impart undergraduate students with theoretical knowledge and to develop their motor and fine motor skills for effective management of dental procedures in different branches of dentistry. In most dental curriculums, students receive extensive pre-clinical and theoretical teaching to gain the ability to practice before taking on clinical cases. This allows students to master motor and fine motor skills necessary for effective management of dental procedures [ 1 ]. One of the main disciplines in dentistry is prosthodontics, defined by the Glossary of Prosthodontic Terms as “the dental specialty about the diagnosis, treatment planning, rehabilitation, and maintenance of the oral function, comfort, appearance, and health of patients with clinical conditions associated with missing or deficient teeth and/or maxillofacial tissues by using biocompatible substitutes” [ 2 ]. Although most dental students can easily acquire and validate theoretical knowledge before graduation, transforming this knowledge into practical motor skills remains complex for students and challenging for teachers to evaluate.

Preclinical courses provide an opportunity to assess undergraduates students abilities before they manage real clinical cases with patients [ 3 ]. However, dentists as well as health care practitioners need self-assessment skills or performance feedback to provide quality patient care. Self-assessment is described as an active process used by a student or a practitioner to objectively evaluate their knowledge, skills, and shortcomings to adapt and improve their skills [ 4 , 5 ]. In the prosthodontic curriculum, theoretical knowledge allows undergraduates to identify areas of interest (e.g., finishing line, mesial-distal taper) when assessing a tooth preparation to make an objective self-assessment. Typically, a “feedback conversation” after a pre-clinical session could allow dental faculty to evaluate the student’s understanding by comparing their assessment with that of the educator [ 6 ]. Nevertheless, it has been shown that undergraduates tend to underrate or overrate their work, and therefore often do not improve significantly in their ability to self-evaluate [ 7 ]. Digital technologies such as intraoral scanners, software for evaluation of tooth preparation, virtual reality, etc., have offered newer tools to enhance the learning and motor skills of undergraduate students [ 8 , 9 ]. Although undergraduate students self-assess their preclinical tooth preparation, it is difficult to identify common pitfalls in their evaluation method.

Recently, in other branches of medicine, eye-tracking technologies have been used to analyze and compare the gaze behavior of healthcare practitioners in different specialties [ 10 , 11 , 12 ]. Eye-tracking devices could also make it possible to have an objective assessment of areas of interest considered by the undergraduates for self-assessment. The eye tracking device is a sensor technology based on corneal reflection and stereo geometry. It allows a line-of-sight analysis by measuring different parameters of gaze behavior such as pupil diameter, number of fixations, duration of fixation, gaze path, and gaze location [ 12 ]. Moreover, this analysis could also provide information about unconscious behavior which cannot be obtained with a feedback conversation or another subjective tool such as a questionnaire [ 13 ]. In dentistry, few studies have utilized eye-tracking devices, with most focusing on analyzing visual perception [ 13 , 14 , 15 , 16 ] or interpreting radiographs [ 17 , 18 ]. However, no study has yet reported the gaze behavior in the evaluation of undergraduate students’ evaluation of tooth-preparation. Therefore, the study aimed to investigate the gaze behavior in undergraduate dental students and dental educators while evaluating a single crown tooth preparation. It was hypothesized that there would be differences in gaze behavior, specifically reflected in shorter total duration of fixation, between undergraduate dental students and dental educators when evaluating a single crown tooth preparation.

The participants of the study were groups of students and staff of the Department of Dental Medicine, Karolinska Institutet, Sweden. Written informed consents were obtained from all participants before participation in the study according to the Declaration of Helsinki. The project was approved by the Ethics Review Authority, Stockholm (Dnr 2023–04136-01).

Study participants

Thirty-five participants volunteered to participate in the current observational study. Participants were divided into a novice group ( n  = 18, mean age = 22.9 ± 1.5; age range:22–28) consisting of undergraduate dental students in their seventh semester and an expert group consisting of dental educators ( n  = 17, mean age = 44.3 ± 13.0; age range: 30–74). Experts were dental educators with an average time since graduation of 19.0 ± 12.7 years and an average time in routine clinical practice of 16.7 ± 12.3 years. A power calculation was performed a priori using G*Power for an ANOVA with repeated measures and within-between interaction. For a medium effect size (f) of 0.3, an α error probability of 0.05, and a desired power of 0.90. the results indicated a required total sample size of 32 participants to achieve an actual power of approximately 0.91.

Study setting

The experiment has been designed following the Reporting Eye-tracking Studies In Dentistry (RESIDE) recommendations. [ 19 ] During the experiment, each participant was invited to participate in a signal experimental session of about 30 min. The participants (both groups) were asked to comfortably sit on an office chair in a well-lit quiet room illuminated with regular artificial light (3000 K). A screen was placed on a desk in front of a white wall (FlexScan® EV2416W, 24.1 inches,1920 × 1080 pixels, 50–60 Hz; Eizo Corporation, Japan). The height of the chair on which the participants were seated was adjustable and was about 0.75 to 1.0 m from the screen. The participants were asked to adjust the chair so that they could look horizontally at the screen. Each participant wore a binocular eye-tracking device (Tobii Pro Glasses 3®, Danderyd, Stockholm, Sweden). The participants were assisted by the examiner to carefully secure the wearable eye tracker like a spectacle. This eye-tracking system uses a one-point calibration procedure and has a gaze position accuracy of 0.6°. Participants were also asked to wear earplugs during the experiment to ensure maximum silence. Video recordings were carried out using dedicated software (Glasses 3 controller®, Danderyd, Stockholm, Sweden: Tobii AB) (Fig. 1 ). During the experiment, the examiner remained inside the room although away from the direct vision of the participants and observed the smooth conduct of the entire experimental process as discretely as possible.

figure 1

Showing experimental setup and timeline of the experimental session

Selection and display of images

The examiner (FS) prepared twenty samples of acrylic, right first maxillary molars (Frasaco®, Franz Sachs GMBH & Co, Germany) for monolithic zirconia crowns. After satisfactory preparation, these samples were scanned using an intra-oral scanner (Cerec® Omnicam, Dentsply Sirona, Charlotte, United States). Subsequently, a software-supported evaluation of tooth preparations was conducted using Prepcheck® (Dentsply Sirona, Charlotte, United States), and all reports were obtained. From the scan files, three high-resolution images (1920 × 1080 pixels) were selected for each of the twenty-tooth preparation, showcasing a buccal, lingual, and occlusal views. This resulted in a total of twenty sets of images, each set containing three views of a single tooth preparation, amounting to a total of sixty images. These sixty images, representing twenty different tooth preparations in three views each, were displayed during the experimental session.

Experimental protocol

Participants received verbal and written instructions explaining how the experimental session would be conducted. The participants were also briefly informed about the main objectives of the study. The participants were then asked to rate all twenty sets of tooth preparations (60 pictures in total) on a 100 mm visual analog rating scale (VAS) without landmarks from “very bad” to “very good”. Additionally, the participants were asked to respond to the question “Is this tooth preparation ready for making an impression for a monolithic zirconia crown?” by choosing the answers as “yes” or “no.”

All participant first performed a "test trial" to ensure they understood the instructions correctly. Then, the twenty sets of different tooth preparations, each with buccal, lingual, and occlusal views, were presented to the participants one by one on the computer screen (Fig.  2 ). Although all twenty sets of images were randomly arranged, the order in which they were assessed was the same for all participants. For each set, the buccal image was displayed first, followed by the lingual and then the occlusal view. Each image (buccal, lingual, or occlusal view) was displayed for 15 s. At the end of the three views (one set), participants had 15 s to rate the preparation on a 100 mm analog rating scale. They also responded to whether they thought the tooth preparation was ready for making an impression for a monolithic zirconia crown on a subject-based feedback form. Please note that the participants could move on to the next slide/picture if they thought they had made their decision with the 3 views of each tooth preparation before their allocated time of fifteen seconds. The participants were also given a break of one minute after five sets of the tooth preparations were displayed.

figure 2

Examples of tooth preparation images from the buccal, lingual, and occlusal views, showcasing the best ( A , B , C ) and worst ( D , E , F ) preparations

Data analysis

For each participant, answers were recorded (yes = 1, no = 0) and scores of all tooth-preparations were collected by measuring the mark on the line (0 to 100). All the collected video files were then analyzed using a dedicated software (Tobii Pro Lab®, v 1.217, [Computer software]; Danderyd, Stockholm, Sweden: Tobii AB). Both buccal and lingual views were divided into four areas of interest (AOI), which included the margin, mesial taper, distal taper, and occlusal shape. Occlusal views were divided into two AOI, which included the margin and the occlusal area only. (Fig.  3 ) Each AOI was outlined in the software and the fixation threshold was set at 200 ms. First, the data was automatically mapped using a dedicated software (Tobii Pro Lab®). However, each gaze fixation was then manually checked by the examiner. For each AOI, the values of the total duration of fixation were analyzed as a metric of gaze behavior.

figure 3

Showing different areas of interest drawn on buccal ( A ), lingual ( B ), and occlusal ( C ) views

Statistical analysis

The data was analyzed with SPSS (Statistical Package for the Social Sciences), version 27, IBM, inc. The data was checked for the assumptions of normal distribution with the Shapiro-Wilks test, histograms, and QQ plots. The scores of the acceptable and unacceptable tooth preparations were compared between the groups with the Wilcoxon Mann–Whitney U test. Further the dichotomous (yes/no) response to the question “Is this tooth preparation ready for making an impression for a monolithic zirconia crown?” were compared between the two groups with the Chi-Square test.

The total duration of fixations for the different views was evaluated with a three-way analysis of variance (ANOVA) model with repeated measures to analyze the different outcome parameters. Since the distribution of the variables was skewed the variables were log-transformed before subjecting to the repeated measures ANOVA. To avoid the loss of zero values, a small constant was added to all the variables before their logarithmic transformation [ 10 ]. The factors in ANOVA were groups (two levels: novices and experts), photos (best and worst preparation), and conditions (margin, mesial taper, distal taper, occlusal). Similarly, the duration of assessment for the different views was evaluated with a three-way analysis of variance (ANOVA) model with repeated measures. The factors in ANOVA were groups (two levels: novices and experts), photos (best and worst preparation), and views (buccal, lingual, and occlusal). Post hoc analysis of the significant main effects was done with the Unequal N HSD test. A P value of < 0.05 was considered statistically significant.

All participants completed the entire experimental session without any difficulty. The mean scores from all the participants for all the twenty tooth preparations were averaged and the “best” and the “worst” tooth preparations were selected for analysis. The novices rated 74.7 ± 17.8 and the experts’ 69.9 ± 20.5 for the best tooth preparation on the 100 mm visual analog scale. Similarly, the novices rated 22.1 ± 18.5 and experts 11.9 ± 15.0 for the worst tooth preparation on a 100 mm visual analog scale. However, there was no significant difference in the visual analog scale ratings for either the best ( P  = 0.478) or worst ( P  = 0.074) tooth preparation between the novices and experts.

Subject-based reports

Subject-based reports further showed that about 77.8% of the participants in the novices and 64.7% of participants in the experts judged the best tooth preparation as the one “ready for making an impression for a monolithic zirconia crown.” While none of the participants in the experts agreed that the worst tooth preparation was ready for impression about 11.1% of the participants in the novices thought that it was still acceptable that the otherwise worst tooth preparation was ready for making an impression. However, there was also no significant correlation between the groups and their decision while judging either the best ( P  = 0.392) or worst ( P  = 0.157) tooth preparation.

Buccal view

The results of ANOVA analysis showed significant main effects of group (novices/experts) ( P  = 0.013) and condition (AOI: margin, mesial taper, distal taper, occlusal) ( P  < 0.050) but no significant effect of photos (best tooth preparation / worst tooth preparation) ( P  = 0.330). Post hoc analysis of the main effects of groups showed a significantly longer total duration of fixations in the novices compared to the experts. Post hoc analysis of the main effects of the condition showed a significantly longer total duration of fixations in the margin compared to all other conditions ( P  < 0.012). (Fig.  4 ).

figure 4

Mean and standard error mean of the total duration of fixation for the novice (dental students) and expert (dental educators) groups for buccal, lingual, and occlusal views for the best ( A , B , C ) and the worst ( D , E , F ) tooth preparation

The results of the ANOVA also showed significant interactions between condition and group ( P  = 0.015) and photos and condition ( P  < 0.001). Post hoc analysis of the condition and group showed a significantly higher total duration of fixations in the novices compared to the experts while observing the occlusion ( P  < 0.006). Post hoc analysis of the interaction between condition and photos showed significantly higher total duration of fixations in the best tooth preparation compared to the worst tooth preparation, while observing the margin ( P  < 0.001) and occlusion ( P  < 0.002). But post hoc analysis also showed a significantly higher total duration of fixations in the worst tooth preparation compared to the best tooth preparation, while observing the mesial taper ( P  < 0.002) but not the distal taper ( P  = 0.253).

Lingual view

The result of ANOVA analysis showed significant main effects of group (novices/experts) ( P  = 0.006), photos (best tooth preparation / worst tooth preparation) ( P  = 0.020), and condition (AOI: margin, mesial taper, distal taper, occlusal) ( P  < 0.001). Post hoc analysis of the main effects of groups showed a significantly longer total duration of fixations in the novices compared to the experts ( P  < 0.007). Post hoc analysis of the main effects of photos showed a significantly longer total duration of fixation in the best tooth preparation compared to the worst tooth preparation ( P  < 0.03). Post hoc analysis of the main effects of the condition showed a significantly longer total duration of fixations in the margin compared to all other conditions ( P  < 0.001). (Fig.  4 ).

Occlusal view

The result of the ANOVA analysis showed significant main effects of groups (novices/experts) ( P  < 0.001), photo (best tooth preparation / worst tooth preparation) ( P  = 0.008), and condition (AOI: margin, mesial taper, distal taper, occlusal) ( P  < 0.001). Post hoc analysis of the main effects of groups showed a significantly longer total duration of fixations in the novices compared to the experts ( P  < 0.001). Post hoc analysis of the main effects of the condition showed a significantly longer total duration of fixations in the occlusal area compared to the margin ( P  < 0.001). (Fig.  4 ) Post hoc analysis of the main effects of photos showed a significantly longer total duration of fixations in the best tooth preparation compared to the worst tooth preparation ( P  < 0.01).

The results of the ANOVA also showed significant interactions between groups and conditions ( P  = 0.048). Post hoc analysis of the condition and groups showed a significantly higher total duration of fixation in the novices compared to the experts while observing the margin ( P  < 0.001).

Duration of assessment

The total time allocated to gaze at each of the pictures was 15 s and once the participants had observed the three views (buccal, lingual, and occlusal) they were asked to make the decision (if the preparation was ready for impression) and rate the tooth preparation (VAS).

The results of ANOVA analysis showed significant main effects of group (novices/experts) ( P  = 0.003), photos (best tooth preparation / worst tooth preparation) ( P  = 0.002), and views (buccal, lingual, and occlusal) ( P  = 0.012). Post hoc analysis of the main effects of groups showed a significantly longer time to decision in the novices compared to the experts. Post hoc analysis of the main effects of photos showed a significantly longer time to decision for both the groups in the best tooth preparation compared to the worst tooth preparation. Post hoc analysis of the main effects of the condition showed a significantly shorter duration of observation of the occlusal view compared to the buccal view.

The results of the ANOVA also showed significant interactions between photos and condition ( P  = 0.005). Post hoc analysis of the interaction between condition and photos showed a significantly higher total duration of observation in the best tooth preparation compared to the worst tooth preparation, while observing the lingual view ( P  < 0.001) and the occlusal view ( P  = 0.003) (Fig.  5 ).

figure 5

Mean and standard error mean of the duration of assessment for the buccal, lingual, and occlusal views for the best ( A ) and the worst ( B ) tooth preparation by the novice (dental students) and expert (dental educators) groups

Gaze behavior, measured through eye tracking, has been widely accepted as a key indicator of how humans process information from their surroundings and interact with the world [ 20 ]. As a result, and as mentioned above, eye tracking has emerged as a key tool in both clinical research and user experience research, as it can provide objective, quantitative data on visual attention, cognitive processes, and neurological function. In the current study, eye tracking was used to evaluate the differences in gaze behavior between novices comprising a group of undergraduate dental students and experts comprising qualified dentists while assessing a single crown tooth preparation. Specifically, both groups assessed precise AOIs on a buccal, lingual, and occlusal view of a “best” and “worst” tooth preparation. In accordance with the hypothesis, the results of the study showed significant differences in gaze behavior between undergraduate dental students (novice group) and dental educators (expert group) while evaluating a single crown tooth preparation. More specifically, the results of the study showed a significantly longer duration of fixation in all the three (buccal, lingual, and occlusal) views in the novices compared to the expert group. Overall, there were specific differences in the total duration of fixation between the novices and the experts, between best and worst preparations and also AOIs. The important interactions are discussed below.

Tooth preparation is the foundation of undergraduate dental education programs in dental fixed prosthetic restorations and the results of the current study may have an important implication for the undergraduate pedagogical training in dental education [ 21 ]. We believe that the current study is the first study that objectively analyzes the gaze behavior of participants while assessing tooth preparations. Overall, our goal is to provide educators with objective tools to better provide customized feedback to the undergrads in order to improve their abilities to self-assess their work in fixed prosthodontic education.

The design of the current study was such that the participants were asked to observe twenty different tooth preparations each with a buccal, lingual, and occlusal view on a computer screen. The participants were subsequently requested to evaluate the images and determine if the preparation was adequate for dental impression-making. The images were screenshots of scanned tooth preparations. The quality of the pictures displayed enabled participants to perceive the details of each preparation so that their assessment was as objective as possible in accordance to previous studies [ 22 , 23 ]. Based on initial pilot testing it was decided that displaying 60 slides, each representing different views (buccal, lingual, and occlusal) of 20 different tooth preparations, provides a comprehensive and detailed assessment of the monolithic zirconia crown preparations. It was also observed that displaying 60 slides provided a balanced and comprehensive evaluation with practical considerations of time and efficiency, ensuring that all significant features of the tooth preparations were adequately covered while maintaining a manageable number of images for assessment. Also, each view captured unique details essential for accurate analysis, such as the contour, margin integrity, and overall quality of the preparation from different angles. It is also suggested that as participants viewed more images, they became increasingly familiar with the image quality, potentially enhancing their ability to assess the overall quality of various tooth preparations. In other words, because the participants viewed several images, they were more finely attuned to evaluating the overall quality of different tooth preparations. Recently, in a study involving spatial images, it was shown that imposing a time limit on participants highlighted differences in attention between novices and experts, whereas there was no difference without a time limit [ 24 ]. Accordingly, in the current study, the participants were given 15 s to observe each image and another 15 s to make the decision. In addition, the participants could also move on to the next image as soon as they had made their decision. Therefore, the study design is more robust to elucidate differences between the novice (i.e., undergraduate dental students) and experts (dental educators).

Studies have suggested that the total duration of fixation, a commonly used eye-tracking metric, is a useful tool in the study of learning processes [ 25 ]. Fixations are the moments when the eyes remain relatively still, focusing on a specific point, and are suggested to be essential for processing visual information [ 26 ]. In particular, the total duration of fixation measures the cumulative time spent fixating on specific areas of interest. It has been shown that the fixation durations of slower readers were typically shorter than those of skilled readers [ 27 , 28 ]. Therefore, in the current study, it was decided to evaluate the total duration of fixation as a metric to evaluate the differences between the level of skills between the novices and experts.

The results of subject-based reports showed no significant difference in the visual analog scale ratings (and their dichotomous decision) between the novice and expert groups for either the best or worst tooth preparations. This finding implies that both groups had similar perceptions of tooth preparations, regardless of their professional background (undergraduate dental students vs. dental educators). However, they may have different perceptions of overall scores and the decision if they judged that the prepared tooth was ready to make an impression. There was also no correlation between the VAS scores and the dichotomous decision. This observation could be because the dichotomous decisions were perhaps not based on predefined criteria or thresholds, while the continuous scores perhaps represent a spectrum of values [ 29 ].

The results of the current study also showed that in general the time to decision was significantly longer while assessing the best picture compared to the worst picture for both novices and experts. However, the time to decision was significantly longer in novices compared to experts suggesting that undergraduate dental students took more time than dental educators to assess and decide on tooth preparations. These observations are in accordance with previous studies which suggested that novices needed more time and had a larger cognitive workload due to both the uncertainty and lack of experience compared to experts [ 11 , 30 , 31 , 32 ].

The duration of assessment of the buccal view was longer than the occlusal view but not the lingual view. The buccal view seemed to allow participants to make a quick decision while assessing an unacceptable preparation, suggesting that they spent less time on the other views. It also appears that the buccal view is important while evaluating tooth preparation and that typically people take more time to evaluate a good preparation than a bad preparation. It is suggested that fixation is a metric to evaluate cognitive processing and longer fixations are generally interpreted as more processing [ 32 ]. Accordingly, we have observed that the participants tend to evaluate “obvious” discrepancies faster than the “not so obvious” discrepancies. Previous studies have shown that a greater number of fixations are indicative of greater visual attention. It has been shown that in general people fix their vision at a point of discrepancy without making any progress. Further, AOIs have been used in several studies involving a variety of medical specialties to define specific locations for eye-tracking software to provide gaze behavior information (duration of fixation, saccade, number of visits, etc.). [ 11 , 14 , 18 , 33 , 34 ]. In the current study, AOIs have been drawn according to the typical areas involved in the tooth preparation and its assessment (margin, mesial taper, distal taper, and occlusal) [ 35 ]. It was also observed that in general both the groups took considerably longer time to assess the margin than the other AOIs. This observation can be because the accurate placement and fit of the dental crown is dependent on the convergence of the mesial and distal walls and the overall shape of the preparation. Moreover, the mesial and distal taper AOIs could also be evaluated during the observation of the lingual view but the AOI “margin” for the buccal view can only be visible in the buccal view. If there are no obvious discrepancies in the mesial and distal walls people tend to evaluate the finishing line which is perhaps perceived as an important determinant of good tooth preparation and perhaps may be a difficult proposition to evaluate at a glance. Conversely, when an obvious undercut is present in the mesial and distal walls, participants tend to spot it quickly in the worst tooth preparation, regardless of the group they belong to, enabling faster decision-making. In contrast, in the best tooth preparation, participants tend to hesitate until they are certain of the absence of discrepancies, resulting in relatively longer decision-making times. Studies have suggested that a longer total duration of fixation indicates more visual attention and cognitive processing of the stimulus. Thus, the participants tend to spend more time fixating on areas that are more visually salient, informative, or cognitively demanding [ 36 ]. Thus, in the current study, the duration of fixation was longer for evaluating the margin and also for evaluating the best tooth preparation. These observations are in accordance with the previous observations that showed individuals tend to stare at a problem more without making any progress in relation to the task. It was also suggested that students often tend to spend a significant amount of time staring at a particular point which is perhaps an indication that they are uncertain about the next step. Therefore, a comprehensive evaluation of the total duration of fixation in eye-tracking metrics provides valuable insights into attention, cognitive processing, visual perception, perceptual load, and task demands. However, it's important to interpret this metric in conjunction with other eye-tracking measures and consider the specific context of the study or task to derive meaningful conclusions. Further, studies should confirm these statements by evaluating the cognitive implications on tooth-preparation assessment.

Acknowledging limitations in research studies is essential for understanding the boundaries and constraints inherent in the study design, data collection, and analysis. One such limitation in the current study is the use of corrective glasses. Previous research has suggested that corrective glasses can influence eye-tracking metrics. However, in the current study, we were unable to account for this variable. Authors from the previous studies have also highlighted limitations primarily concerning technical aspects, such as the eye-tracking device, lighting conditions, and the overall experimental design. However, the present study adhered to RESIDE recommendations, aiming to standardize parameters and minimize biases, which can be considered a strength of the study [ 19 ]. The results of the current study indicated that participants had longer fixation durations on the buccal view. It is suggested that the sequential display order of views (buccal, lingual, or occlusal) in the current study may introduce bias, as the buccal view consistently appeared first. Future studies could employ a randomized view in order to mitigate this potential bias. However, in the current study, the participants were exposed to a series of photographs under different conditions, followed by a selection of the “best” and “worst” rated photographs for the analysis which perhaps reduce biases.

In summary, the results of the current study showed distinct differences in gaze behavior between the novice and the experts during the evaluation of single crown tooth preparation. In particular, the novice group of dental students showed a longer total duration of fixation across all the views (buccal, lingual, and occlusal) compared to the expert group of dental educators. Further, both novice dental students and expert dentists spent more time assessing the best tooth preparation compared to the worst tooth preparation. Yet, the novices showed a longer total duration of fixation than the experts in the assessment of both best and worst tooth preparations. Also, the margin seems to be the most important AOI in the assessment of single crown tooth preparation. The findings of the study may have implications for dental education and clinical practice. Understanding differences in gaze behavior between undergraduate dental students and dental educators could enhance diagnostic skills help improve tooth preparation skills and provide constructive feedback. Further analysis of fixation patterns and their association with clinical decision-making needs to be investigated.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Abbreviations

Analysis of variance

Area of interest

Reporting Eye-tracking Studies In Dentistry

Visual analog scale

Khalaf KA, Moore C, McKenna G, Da Mata C, Lynch CD. Undergraduate teaching and assessment methods in prosthodontics curriculum: an international Delphi survey. J Dent. 2022;123: 104207.

Article   Google Scholar  

American Dental Association. The Glossary of Prosthodontic Terms. J Prosthet Dent. 2005;94(1):10–92.

Marchan SM, Coldero LG, Smith WAJ. An evaluation of the relationship between clinical requirements and tests of competence in a competency-based curriculum in dentistry. BMC Med Educ. 2023;23(1):585.

Asadoorian J, Batty HP. An evidence-based model of effective self-assessment for directing professional learning. J Dent Educ. 2005;69(12):1315–23.

Alfakhry G, Mustafa K, Ybrode K, Jazayerli B, Milly H, Abohajar S, et al. Evaluation of a workplace assessment method designed to improve self-assessment in operative dentistry: a quasi-experiment. BMC Med Educ. 2023;23(1):491.

Mays KA, Branch-Mays GL. A systematic review of the use of self-assessment in preclinical and clinical dental education. J Dent Educ. 2016;80(8):902–13.

Tuncer D, Arhun N, Yamanel K, Çelik Ç, Dayangaç B. Dental students’ ability to assess their performance in a preclinical restorative course: comparison of students’ and faculty members’ assessments. J Dent Educ. 2015;79(6):658–64.

Truchetto T, Dumoncel J, Nabet C, Galibourg A. Computer-assisted evaluation and feedback of a complete student class for preclinical tooth preparation. J Dent Educ. 2023;87(S3):1776–9.

Koolivand H, Shooreshi MM, Safari-Faramani R, Borji M, Mansoory MS, Moradpoor H, et al. Comparison of the effectiveness of virtual reality-based education and conventional teaching methods in dental education: a systematic review. BMC Med Educ. 2024;24(1):8.

van der Gijp A, Ravesloot CJ, Jarodzka H, van der Schaaf MF, van der Schaaf IC, van Schaik JPJ, et al. How visual search relates to visual diagnostic performance: a narrative systematic review of eye-tracking research in radiology. Adv Health Sci Educ Theory Pract. 2017;22(3):765–87.

Capogna E, Salvi F, Delvino L, Di Giacinto A, Velardo M. Novice and expert anesthesiologists’ eye-tracking metrics during simulated epidural block: a preliminary. Brief Observ Rep Local Reg Anesth. 2020;13:105–9.

Gil AM, Birdi S, Kishibe T, Grantcharov TP. Eye tracking use in surgical research: a systematic review. J Surg Res. 2022;279:774–87.

Wilbanks BA, Aroke E, Dudding KM. Using eye tracking for measuring cognitive workload during clinical simulations: literature review and synthesis. Comput Inform Nurs CIN. 2021;39(9):499–507.

Google Scholar  

Yamamoto M, Torii K, Sato M, Tanaka J, Tanaka M. Analysis of gaze points for mouth images using an eye tracking system. J Prosthodont Res. 2017;61(4):379–86.

Zhang Y, Wang X, Xu X, Feng S, Xia L. The use of eye-tracking technology in dento-maxillofacial esthetics: a systematic review. J Craniofac Surg. 2024;35(4):e329–33.

Al-Lahham A, Souza PHC, Miyoshi CS, Ignácio SA, Meira TM, Tanaka OM. An eye-tracking and visual analogue scale attractiveness evaluation of black space between the maxillary central incisors. Dent Press J Orthod. 2021;26(1): e211928.

Basmacı F, Mersin TÖ, Turgut B, Akbulut K, Kılıçarslan MA. Perception of orbital epitheses evaluated with eye tracker. J Prosthodont Off J Am Coll Prosthodont. 2022;31(9):754–60.

Gasparello GG, Júnior SLM, Hartmann GC, Meira TM, Camargo ES, Pithon MM, et al. The influence of malocclusion on social aspects in adults: study via eye tracking technology and questionnaire. Prog Orthod. 2022;23(1):4.

Cho VY, Loh XH, Abbott L, Mohd-Isa NA, Anthonappa RP. Reporting Eye-tracking Studies In DEntistry (RESIDE) checklist. J Dent. 2023;129: 104359.

Enders LR, Smith RJ, Gordon SM, Ries AJ, Touryan J. Gaze behavior during navigation and visual search of an open-world virtual environment. Front Psychol. 2021;12: 681042.

Rosella D, Rosella G, Brauner E, Papi P, Piccoli L, Pompa G. A tooth preparation technique in fixed prosthodontics for students and neophyte dentists. Ann Stomatol (Roma). 2015;6(3–4):104–9.

Han S, Yi Y, Revilla-León M, Yilmaz B, Yoon HI. Feasibility of software-based assessment for automated evaluation of tooth preparation for dental crown by using a computational geometric algorithm. Sci Rep. 2023;13(1):11847.

Tahani B, Rashno A, Haghighi H, Monirifard R, Khomami HN, Kafieh R. Automatic evaluation of crown preparation using image processing techniques: a substitute to faculty scoring in dental education. J Med Signals Sens. 2020;10(4):239–48.

Roach VA, Fraser GM, Kryklywy JH, Mitchell DGV, Wilson TD. Time limits in testing: an analysis of eye movements and visual attention in spatial problem solving. Anat Sci Educ. 2017;10(6):528–37.

Liu PL. Using eye tracking to understand learners’ reading process through the concept-mapping learning strategy. Comput Educ. 2014;78:237–49.

Martinez-Conde S, Macknik SL, Hubel DH. The role of fixational eye movements in visual perception. Nat Rev Neurosci. 2004;5(3):229–40.

Rayner K, Slattery TJ, Bélanger NN. Eye movements, the perceptual span, and reading speed. Psychon Bull Rev. 2010;17(6):834–9.

Reichle ED, Liversedge SP, Drieghe D, Blythe HI, Joseph HSSL, White SJ, et al. Using E-Z Reader to examine the concurrent development of eye-movement control and reading skill. Dev Rev. 2013;33(2):110–49.

Hu X. A theory of dichotomous valuation with applications to variable selection. Econom Rev. 2020;39(10):1075–99.

Kumar A, Koullia N, Jongenburger M, Koutris M, Lobbezoo F, Trulsson M, et al. Behavioral learning and skill acquisition during a natural yet novel biting task. Physiol Behav. 2019;211: 112667.

Kumar A, Munirji L, Nayif S, Almotairy N, Grigoriadis J, Grigoriadis A, et al. Motor performance and skill acquisition in oral motor training with exergames: a pilot study. Front Aging Neurosci. 2022;14: 730072.

da Silva Soares R, Oku AYA, Barreto C da SF, Sato JR. Exploring the potential of eye tracking on personalized learning and real-time feedback in modern education. Prog Brain Res. 2023;282:49‑70.

Richter J, Scheiter K, Eder TF, Huettig F, Keutel C. How massed practice improves visual expertise in reading panoramic radiographs in dental students: An eye tracking study. PLoS ONE. 2020;15(12): e0243060.

Botelho MG, Ekambaram M, Bhuyan SY, Yeung AWK, Tanaka R, Bornstein MM, et al. A comparison of visual identification of dental radiographic and nonradiographic images using eye tracking technology. Clin Exp Dent Res. 2020;6(1):59–68.

Podhorsky A, Rehmann P, Wöstmann B. Tooth preparation for full-coverage restorations-a literature review. Clin Oral Investig. 2015;19(5):959–68.

Wolf C, Lappe M. Salient objects dominate the central fixation bias when orienting toward images. J Vis. 2021;21(8):23.

Download references

Open access funding provided by Karolinska Institute. This work was partially supported by research grants from Stockholm County Council and Karolinska Institutet (SOF: Styrgruppen för Odontologisk Forskning) and grants from Karolinska Institutet Foundation.

Author information

Authors and affiliations.

Department of Prosthodontics, School of Dental Medicine, ADES, CNRS, Aix-Marseille University, EFS, Marseille, France

Frédéric Silvestri

Division of Oral Rehabilitation, Department of Dental Medicine, Karolinska Institutet, Huddinge, Sweden

Frédéric Silvestri, Nabil Odisho & Anastasios Grigoriadis

Division of Oral Rehabilitation, Department of Dental Medicine, Karolinska Institutet, Alfred Nobels Allé 8, Box 4064, 141 04, Huddinge, Sweden

Abhishek Kumar

Academic Center for Geriatric Dentistry, Stockholm, Sweden

You can also search for this author in PubMed   Google Scholar

Contributions

FS, NO, AK, AG contributed to the conception of the study. FS and NO performed the experiment and data collection. AK performed the statistical analysis. Manuscript drafting was implemented by FS, AK and all authors read, edited, and approved the final manuscript.

Corresponding author

Correspondence to Abhishek Kumar .

Ethics declarations

Ethics approval and consent to participate.

Written informed consents were obtained from all participants before participation in the study according to the Declaration of Helsinki. The project was approved by the Ethics Review Authority, Stockholm (Dnr 2023–04136-01).

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Silvestri, F., Odisho, N., Kumar, A. et al. Examining gaze behavior in undergraduate students and educators during the evaluation of tooth preparation: an eye-tracking study. BMC Med Educ 24 , 1030 (2024). https://doi.org/10.1186/s12909-024-06019-4

Download citation

Received : 26 April 2024

Accepted : 12 September 2024

Published : 19 September 2024

DOI : https://doi.org/10.1186/s12909-024-06019-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Tooth preparation
  • Eye tracking technology
  • Undergraduate
  • Prosthodontics
  • Pilot study
  • Dental education

BMC Medical Education

ISSN: 1472-6920

experimental statistical analysis

IMAGES

  1. Statistical analysis of experimental data

    experimental statistical analysis

  2. 7 Types of Statistical Analysis with Best Examples

    experimental statistical analysis

  3. PPT

    experimental statistical analysis

  4. PPT

    experimental statistical analysis

  5. PPT

    experimental statistical analysis

  6. Experimental Design, Statistical Analysis

    experimental statistical analysis

VIDEO

  1. 04_MMD_Statistical and Uncertainty Analysis of Experimental Data mpeg4

  2. 1-Course Introduction

  3. S102 Analyzing Two Level Designs

  4. AP Statistics: Topic 3.5 Introduction to Experimental Design

  5. Applied Statistics [Exploratory Data Analysis]

  6. Terminology

COMMENTS

  1. PDF Chapter 10. Experimental Design: Statistical Analysis of Data Purpose

    Now, if we divide the frequency with which a given mean was obtained by the total number of sample means (36), we obtain the probability of selecting that mean (last column in Table 10.5). Thus, eight different samples of n = 2 would yield a mean equal to 3.0. The probability of selecting that mean is 8/36 = 0.222.

  2. Experimental Design: Definition and Types

    An experimental design is a detailed plan for collecting and using data to identify causal relationships. Through careful planning, the design of experiments allows your data collection efforts to have a reasonable chance of detecting effects and testing hypotheses that answer your research questions. An experiment is a data collection ...

  3. Guide to Experimental Design

    Table of contents. Step 1: Define your variables. Step 2: Write your hypothesis. Step 3: Design your experimental treatments. Step 4: Assign your subjects to treatment groups. Step 5: Measure your dependent variable. Other interesting articles. Frequently asked questions about experiments.

  4. The Beginner's Guide to Statistical Analysis

    Statistical analysis means investigating trends, patterns, and relationships using quantitative data. It is an important research tool used by scientists, governments, businesses, and other organizations. ... Example: Experimental research design. You design a within-subjects experiment to study whether a 5-minute meditation exercise can ...

  5. Statistical Analysis of Experimental Data

    f (z) = 1 2 π e - (z 2 / 2), (11.7) where. z = x ¯ - μ σ. (11.8) Experimental data (with finite sample sizes) can be analyzed to obtain x ¯ as an estimate of μ and Sx as an estimate of σ. This procedure permits the experimentalist to use data drawn from small samples to represent the entire population. Fig. 11.4.

  6. Statistical Design and Analysis of Experiments

    Statistical Design and Analysis of Experiments is intended to be a practitioner's guide to statistical methods for designing and analyzing experiments. The topics selected for inclusion in this book represent statistical techniques that we feel are most useful to experimenters and data analysts who must either collect, analyze, or interpret data.

  7. Focus: Study Design & Statistical Analysis: Statistical relevance

    As part of a new EMBO Journal statistics series, this commentary introduces key concepts in statistical analysis and discusses best practices in study design. Statistical analysis is an important tool in experimental research and is essential for the reliable interpretation of experimental results. It is essential that statistical design should ...

  8. Experimental Design in Statistics (w/ 11 Examples!)

    00:44:23 - Design and experiment using complete randomized design or a block design (Examples #9-10) 00:56:09 - Identify the response and explanatory variables, experimental units, lurking variables, and design an experiment to test a new drug (Example #11) Practice Problems with Step-by-Step Solutions.

  9. Statistical Design and Analysis of Experiments

    Emphasizes the strategy of experimentation, data analysis, and the interpretation of experimental results. Features numerous examples using actual engineering and scientific studies. Presents statistics as an integral component of experimentation from the planning stage to the presentation of the conclusions. Deep and concentrated experimental design coverage, with equivalent but separate ...

  10. Chapter 1 Principles of Experimental Design

    1.2 A Cautionary Tale. For illustrating some of the issues arising in the interplay of experimental design and analysis, we consider a simple example. We are interested in comparing the enzyme levels measured in processed blood samples from laboratory mice, when the sample processing is done either with a kit from a vendor A, or a kit from a competitor B.

  11. Introduction to Research Statistical Analysis: An Overview of the

    Introduction. Statistical analysis is necessary for any research project seeking to make quantitative conclusions. The following is a primer for research-based statistical analysis. It is intended to be a high-level overview of appropriate statistical testing, while not diving too deep into any specific methodology.

  12. 2.4: Experimental Design and rise of statistics ...

    This page titled 2.4: Experimental Design and rise of statistics in medical research is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Michael R Dohm via source content that was edited to the style and standards of the LibreTexts platform.

  13. PDF Chapter 4 Experimental Designs and Their Analysis

    Experimental Designs and Their Analysis. Design of experiment means how to design an experiment in the sense that how the observations or measurements should be obtained to answer a query in a valid, efficient and economical way. The designing of the experiment and the analysis of obtained data are inseparable.

  14. Experimental Design for ANOVA

    Experimental Design for ANOVA. There is a close relationship between experimental design and statistical analysis. The way that an experiment is designed determines the types of analyses that can be appropriately conducted. In this lesson, we review aspects of experimental design that a researcher must understand in order to properly interpret experimental data with analysis of variance.

  15. Statistics

    Statistics - Sampling, Variables, Design: Data for statistical studies are obtained by conducting either experiments or surveys. Experimental design is the branch of statistics that deals with the design and analysis of experiments. The methods of experimental design are widely used in the fields of agriculture, medicine, biology, marketing research, and industrial production.

  16. Statistics for Analysis of Experimental Data

    Statistics is a mathematical tool for quantitativ e analysis of data, and as such it serves as the. means by which we extract useful information from data. In this chapter we are concerned with ...

  17. STAT 503: Design of Experiments

    This graduate level course covers the following topics: Understanding basic design principles. Working in simple comparative experimental contexts. Working with single factors or one-way ANOVA in completely randomized experimental design contexts. Implementing randomized blocks, Latin square designs and extensions of these.

  18. The Design and Statistical Analysis of Animal Experiments

    The text links experimental design and statistical analysis in a practical way, easily accessible without any prior statistical knowledge. The statistical concepts are described in plain English, avoiding overuse of mathematical formulas and illustrated with numerous examples relevant to biomedical scientists. …

  19. CRAN Task View: Design of Experiments (DoE) & Analysis of Experimental Data

    This task view collects information on R packages for experimental design and analysis of data from experiments. Packages that focus on analysis only and do not make relevant contributions for design creation are not considered in the scope of this task view. Please feel free to suggest enhancements, and please send information on new packages or major package updates if you think they belong ...

  20. 3.3

    In experimental design terminology, factors are variables that are controlled and varied during the course of the experiment. For example, treatment is a factor in a clinical trial with experimental units randomized to treatment. Another example is pressure and temperature as factors in a chemical experiment. Most clinical trials are structured ...

  21. Observational vs Experimental Study

    The study we conduct to perform statistical analysis of our data can majorly be of two types — Observational and Experimental. When we read about any research, we usually do not pay attention to how the study was designed. However, to understand the quality of the results/findings claimed by the study, it is extremely important for us to know ...

  22. The Statistical Analysis of Experimental Data (Dover Books on

    The first third of The Statistical Analysis of Experimental Data comprises a thorough grounding in the fundamental mathematical definitions, concepts, and facts underlying modern statistical theory — math knowledge beyond basic algebra, calculus, and analytic geometry is not required. Remaining chapters deal with statistics as an ...

  23. Choosing the Right Statistical Test

    When to perform a statistical test. You can perform statistical tests on data that have been collected in a statistically valid manner - either through an experiment, or through observations made using probability sampling methods.. For a statistical test to be valid, your sample size needs to be large enough to approximate the true distribution of the population being studied.

  24. Statistical Mechanical Analysis of Gaussian Processes

    In this paper, we analyze Gaussian processes using statistical mechanics. Although the input is originally multidimensional, we simplify our model by considering the input as one-dimensional for statistical mechanical analysis. Furthermore, we employ periodic boundary conditions as an additional modeling approach. By using periodic boundary conditions, we can diagonalize the covariance matrix ...

  25. The Beginner's Guide to Statistical Analysis

    Statistical analysis means investigating trends, patterns, and relationships using quantitative data. It is an important research tool used by scientists, governments, businesses, and other organisations. ... Example: Experimental research design. You design a within-subjects experiment to study whether a 5-minute meditation exercise can ...

  26. Examining gaze behavior in undergraduate students and educators during

    Statistical analysis also showed a significantly longer total duration of fixations in the margin compared to all other conditions for both the buccal (P < 0.012) and lingual (P < 0.001) views. The current study showed distinct differences in gaze behavior between the novices and the experts during the evaluation of single crown tooth preparation.