1. Independence of observations
3. Homogeneity of variance
5. Normality of difference scores
6. Chi-square assumption
Statistical assumptions on Research Engineer!
Meeting assumptions is necessary when running inferential statistics
I put up some new pages for the following statistical assumptions, as promised! Click on one of the links below!
1. Independence of observations
3. Homogeneity of variance
5. Normality of difference scores
6. Chi-square assumption
My very best wishes to you and yours!
No statistical soliloquy today. I am humbled and thankful for those listed above and for your interest in my website. Thank you for your patronage and many blessings to thee and thine.
Analyze three or more measures of an ordinal outcome
Wilcoxon is used as a post hoc test for significant main effects
The Greenhouse-Geisser correction is often employed when analyzing data with repeated-measures ANOVA. The statistical assumption of sphericity, as assessed by Mauchly's test in SPSS, is more often times than not violated. The Greenhouse-Geisser correction is robust to the violation of this statistical assumption with repeated-measures ANOVA. The means and standard deviations from a repeated-measures ANOVA can then be interpreted.
Friedman's ANOVA, in my experience, does not make many appearances in the empirical literature. Few people take three or more within-subjects or repeated measures of an ordinal outcome in order to answer their primary research question, I guess. It is a non-parametric statistical test since the data is measured at more of an ordinal level. When a significant main effect is found with a Friedman's ANOVA, then post hoc comparisons must be made within-subjects or amongst observations using Wilcoxon tests.
Friedman's ANOVA, while being a non-parametric statistic, may have the most statistical power when employed with cross-sectional data yielded from a survey instrument that has limited reliability and validity evidence. Likert scales and composite scores from such tests may be naturally skewed due to systematic and unsystematic error. Friedman's ANOVA is robust to these types of distributions that come from cross-sectional studies in the social sciences.
If the assumption of normality among the difference scores between observations of a continuous outcome cannot be met, then Friedman's ANOVA can be used to yield inferential evidence. But it is always a better idea to first check for outliers in a distribution (individual observations that are more than 3.29 standard deviations away from the mean) and make a decision as to whether 1) delete the observation in a listwise fashion, or 2) run a logarithmic transformation on the distribution.
You will have transform the other observations of the outcome if you choose #2 above. The means and standard deviations of transformed variables cannot be interpreted but the p-values can be interpreted. Report the median and interquartile range for transformed variables.
Deleting observations can introduce bias into the statistical analysis. This should only be done if the number of outliers constitutes less than 10% of the overall distribution. One can also run between-subjects comparisons between participants with all observations of the outcome versus participants without all observations. If there are no differences on predictor, confounding, and outcome variables between these two groups, then lessened observation bias can be assumed.
An amalgamation of philosophy and objectivity
The research question is the foundation of everything empirical
Research questions (and answering them) are always the primary focus of anything and everything empirical, methodological, epidemiological, and statistical. Without a research question, there is no reason to conduct a study or run statistics.
The following are DIRECTLY derived from research questions:
1. Null and alternative hypotheses (hypothesis testing and inferential statistics)
2. Research design (observation or experimental)
3. Population of interest (inclusion and exclusion criteria)
4. Sampling method (non-probability or probability)
5. Intervention or independent variable (categorical, ordinal, or continuous)
6. Confounding or control variables (secondary, tertiary, and ancillary research questions)
7. Comparator or control treatment (categorical, ordinal, or continuous)
8. Outcome or dependent variable (categorical, ordinal, or continuous)
9. Outcome and design for an a priori power analysis to calculate sample size
10. Structure of the database (between-subjects, within-subjects, or multivariate) and code book
11. Statistical tests used (descriptive, between-subjects, within-subjects, correlations, survival, or multivariate)
Researchers must take the appropriate amount of time to fully formulate and refine research questions. SO MUCH is dependent upon it for their study. Luckily, this task is made easier with the use of two prevalent mnemonics: FINER (feasible, interesting, novel, ethical, relevant) and PICO (population, intervention, comparator, outcome).
FINER is a more of a philosophy for writing research questions. The arguments for the "F," "I," "N," "E," and "R" are all and informed upon by the empirical literature in the area of empirical or clinical interest. Researchers especially have to be well vested in the most current literature in order to make sound arguments for interesting, novel, and relevant questions.
PICO is employed to explicitly and operationally define the population of interest, the intervention, the comparator, and the outcome in a research question. It is also more readily applicable in busy clinical and empirical environments and when writing literature search queries.
These two mnemonics compliment each other very well in applied empirical and clinical environments. The post-positivist philosophy of social and medical sciences lends itself well to FINER. Measurement of observable constructs and the application of experimental designs through the PICO mnemonic is also strongly reflective of a post-positivist philosophical orientation. Together, the "why" and "what" questions associated with conducting research can be argued in an evidence-based, objective, and logically sound fashion.
Kappa is a measure of inter-rater reliability
Rating performance or constructs a dichotomous categorical level
The Kappa statistic is a measure of inter-rater reliability when the construct or behavior is being rated using a dichotomous categorical outcome. When a sequential series of steps must be completed to yield an end product, such as with performance assessment, then a "checklist" or series of "yes/no" responses are scored by independent raters. The Kappa statistic can be used to assess the level of agreement/consistency/reliability between raters on subsequent dichotomous responses.
It is important that raters have an operational definition of what constitutes a "yes" or "no" in regards to performance. The construct or behavior of interest must be standardized between raters so that unsystematic bias can be reduced. A lack of operationalization and standardization in performance assessment significantly DECREASES the chances of obtaining evidence of inter-rater reliability when using the Kappa statistic.
Kappa is not a "powerful" statistic because of the dichotomous categorical variables used in the analysis. Larger sample sizes are needed to achieve adequate statistical power when categorical outcomes are utilized. So, many observations of the performance of simulation may be needed to adequately assess BOTH inter-rater reliability and outcomes of interest. The chances of having adequate inter-rater reliability decreases with fewer observations of performance or simulation.
Correlations and regression are used to establish this kind of evidence
Predictive validity evidence means that a survey instrument has the ability to predict some sort of occurrence in the future. The most common application of predictive validity occurs in tests like the ACT, SAT, GRE, MCAT, LSAT, and GMAT. These tests are given before entering various phases of higher education to assess an individual's potential to graduate from either undergraduate or graduate school. Interestingly enough, the correlation between these prevalent (and expensive) tests and graduation is only 0.3! This means that 91% of what accounts for graduation is NOT associated with test scores on these instruments. And we are talking a multi-BILLION dollar business...but, I digress.
Predictive validity is calculated using simple correlation coefficients. A correlation of 0.1 is considered weak evidence, a correlation of 0.3 denotes moderate evidence, and a correlation of 0.5 would make most social scientists jump for joy. Remember, in order to understand the amount of shared variance between two constructs, you simply "square" the correlation coefficient to yield the coefficient of determination. Even with the highest level of predictive evidence with a predictive validity coefficient of 0.5, you are only accounting for 25% of the association between the two constructs!
Within medicine, I believe that predictive validity plays an important role in imaging and early diagnosis. One of the benefits of working in medicine is that the measures are more objective, concrete, observable, validated, and measurable versus the social sciences. Correlations of 0.9 are common between various etiological, prognostic, confounding, clinical, and demographic phenomena within medicine. If an imaging or diagnostic method can detect the earlier stages of a progressing disease state, then future outcomes can be mitigated with earlier and preventative treatment.
Independence of observations
Each participant in a sample can only be counted as one observation
As a biostatistician, I spend a lot of time testing for normality and homogeneity of variance.
Skewness and kurtosis statistics are used to assess the normality of a continuous variable's distribution. A skewness or kurtosis statistic above an absolute value of 2.0 is considered to be non-normal. Distributions are often non-normal due to outliers in the distribution. Any observation that falls more than 3.29 standard deviations away from the mean is considered an outlier.
Levene's Test of Equality of Variances is used to measure for meeting the assumption of homogeneity of variance. Any Levene's Test with a p-value below .05 means that the assumption has been violated. In the event that the assumption is violated, non-parametric tests can be employed.
There is one more important statistical assumption that exists coincident with the aforementioned two, the assumption of independence of observations. Simply stated, this assumption stipulates that study participants are independent of each other in the analysis. They are only counted once.
In between-subjects designs, each study participant is a mutually exclusive observation that is completely independent from all other participants in all other groups.
For within-subjects designs, each participant is independent of other participants. There are just multiple observations of the outcome, per participant.
With this being said, it is prevalent for researchers to take multiple measurements of an outcome and compare these multiple measurements in an independent fashion (oftentimes with differing numbers of observations across participants) or within-subjects (ALWAYS with differing numbers of observations of the outcome). By default, these are not independent measures and violate the assumption of independence of observations. What is one to do?
The answer is generalized estimating equations (GEE). This family of statistical tests are robust to multiple observations (or correlated observations) of an outcome and can be used for between-subjects, within-subjects, factorial, and multivariate analyses.
Eric Heidel, Ph.D. is Owner and Operator of Scalë, LLC.
Hire A Statistician!
Copyright © 2020 Scalë. All Rights Reserved. Patent Pending.