Saturday, August 29, 2015

"Everything You Know is Wrong!"

‘Men and women are the same sex!
Pigs live in trees!
The Aztecs invented the vacation!
Aliens are living like Indians in an Arizona nudist park!
EVERYTHING YOU KNOW IS WRONG!’  -- The Firesign Theater

They were on vinyl back then.  "Sir, Syrup won't stop 'em.  They're in everybody's eggs!"

Back in the trippy early 1970’s the crazed psychedelic comic radio theater ensemble Firesign Theater came out with a series of loopy stream of consciousness record albums for stoners.  They were changing the face of comedy by constantly reframing their narrative, back before Monty Python cornered the market on silliness.  This radio play is a send up of then-famous pseudoscientist Erich Von Daniken, author of Chariots of the Gods.

Humor is a fertile object of study and inspiration for any social constructionist because it demonstrates the relativity of meaning is entirely dependent on context.  Flip the context and you flip the meaning.  Monty Python shows this with dialogues in which characters obstinately refuse to accept each other’s meanings.  The Firesign Theater did this by constantly changing the frame so you had to keep up just to know what they were talking about.  Each segue was a free association worthy of William Boroughs.

But I didn’t come here to talk about comedy. I came to talk about science.  Especially the social and medical sciences that constitute the background of all the therapeutic work we do.  If modern sexology can be considered to have started with Napoleonic civil administrators trying to count the prostitutes of Paris, figuring out how we count things is especially important.

In fact, sexology didn’t really get started until 1869 with Richard von Krafft-Ebing’s first edition of Psychopathia Sexualis.  That volume did not rely on counting things.  It was an anthology of case studies, and initiated a clinical methodology that would dominate sexology and much of psychology until 1947 when Alfred Kinsey first published The Sexual Behavior of the Human Male based upon survey data.

Kinky Boots:  Crippling fetish, or good clean fun?
For the entire period between 1885 and 1950, the practice of sex therapy was dominated by the clinical case history.  In the later language of statistical sampling, case histories are studies with an N of 1.  The dangers are obvious, now.  If Krafft-Ebing learned of a shoe fetishist who was unable to sexually respond except to women’s shoes, it was assumed that sexual fetishists were all unhappy weaklings who couldn’t get satisfaction without their preferred sex object.  It took years for it to occur to anyone (Freud) that many fetishists could get off just fine without their fetish being present, but only the seriously unhappy ones who couldn’t braved the costs and uncertainties of treatment to discuss it.

Methodologies have epistemological traps.  In the process of illuminating some truths, they throw others into shadow.  Today’s front page story form the New York Times illuminates this regarding laboratory experiments, today’s blue chip method of academic psychology.  Laboratory studies are appealing because they offer scientists the opportunity to control variables and to potentially prove causality.  Case histories can prove that something can happen.  Surveys can show that things co-occur and co-vary, experimentation can prove a change in one thing was caused by change in another.

As a psychologist, one of the things I know is that about 5% of everything I know to be true is wrong.  This is before we get to any personal issues of fallibility I might have that are unique to my professional limitations.  I consume studies that meet the professional standard prevailing in academic research that if a hypothesis tested in a psychological study has a 95% chance of being right, it is true, and if it has a 94% chance or less of being right, it is wrong.  This is how the statistical tests used to test hypotheses work.  A test is conducted to see if the data supporting the hypothesis might have occurred by chance.  If the likelihood the result happen by chance is a p = 5% or less, most researchers call their hypothesis confirmed. If this standard worked ideally from a statistical point of view, occasionally perfectly correct hypotheses would be proven ‘wrong’ about one chance in 20, and incorrect hypotheses would occasionally get lucky and be proven ‘right.’  So at any given moment in time, only most of what I know is genuinely correct.  Hopefully, only the good stuff is in this post.

This has led some humorists to characterize psychologists and their ilk as faceless grey ciphers.  After all, nothing rare ever happens to me.  Anything unlikely proves some hypothesis or other.  Who knows what desperate shenanigans I will be driven to do to disprove that facetious mischaracterization.  Perhaps a blog on kink and psychotherapy?  But it is worth wondering what we ought to do to improve the odds.  After all, 5% of our teaching and clinical wisdom is probably not correct.

A 20-sided die, often used in role playing games.  The chance of rolling a '20' is 5%.

Which brings us to the problem of replication.  If a study is conducted once and meets the 95% confidence interval, there is about a 5% chance its wrong.  But if it were to be precisely replicated, and met the standard 2 times in 2 tries, the chance of error shrinks from one in 20 to one in 400.  That is simple enough:  conduct important studies twice and only report those that are replicated, and the chance of error falls precipitously.  And exactly this type of statistical thinking does influence medical and safety studies where error might be fatal.  The researchers chose much more strict confidence intervals to test such hypotheses.  But the sociology of science does not make replication easy or trivial.

While experimenters are required to report methods and results so that other qualified scientists could check or repeat their work, only the tiniest fraction of work is replicated.  Careers are rewarded for original, not replicated work, so someone has to specifically and exceptionally reward replications.  Senior scientists who have carefully dreamed up and executed work are sensitive about who and how such replication might be conducted.  Who wants younger or less qualified colleagues to ‘check’ their work?  What if replication fails?  And there are hosts of methodological issues about what constitutes exact reproducibility.  What if the subjects, the times, the geographic regions, institutional support, the public’s familiarity with the study’s design and outcome; all kinds of variables threaten to make an attempted replication systematically different in a manner that might account for different results. Many studies rely on naive subjects or deception bringing to mind Heraclitus warning:  "No man never steps in the same river twice, for it is not the same river and he is not the same man."  The result: no genuine replication.

Despite all those potential obstacles to replication, three major psychology journals, Psychological Science, The Journal of Personality and Social Psychology, and The Journal of Experimental Psychology:  Learning, Memory and Cognition participated in The Reproducibility Project, which selected the 100 most important studies published in the year 2008 for replication.   To overcome barriers to proper execution, the Reproducibility Project mandated and funded close cooperation between the original researchers and the teams conducting each replication.   To ensure stability of results, many replications used more subjects than the original published studies.

If 100 studies had all had a 5% chance of failing replication, we would expect 95 to pass and about five to fail.  In the reported replication effort, 60 failed, 36, passed, and 2 were too ambiguous to call, and two of the original studies failed to achieve statistical significance but got published and rated as important enough to replicate.  Most of those failing had results that were similar in direction to their original studies, but failed to make statistically significant results.  If they had been conducted for the first time, these 60 would most likely not have been reported.

Which brings up our first and most serious form of bias in social science research.  If I read the literature, form the best hypothesis I can, scrupulously conduct my study, and for whatever reason I fail to obtain a statistically significant result, I quit, pick another hypothesis to test and start over.  I probably can’t get my failed results published, certainly can’t advance my case for academic tenure, and no one ever knows about my negative result.  All of my efforts fail to become a part of the scientific record.
This is a problem for science because someone could go out and do my failure all over, not knowing my work.  But it is very unlikely that if I had published my negative results, anyone would have bothered to replicate them.  Career considerations alone would propel them to test something that didn’t already have one strike against it.  You might think that a thoughtful person who looked at my work might have a creative idea to improve on my failed methods and retest under more favorable conditions.  And you would be right, a great deal of this goes on in pharmaceutical research, with slight changes in research design getting repeated until a positive result is achieved.  Mostly such programmatic research is a good thing, but it too has vulnerabilities.  One can take an indifferent study and repeat it enough times until one gets a statistically significant result, and thereby pass clinical trials.  But that raises the specter of another source of possible error, that of excessive self or financial interest.

As fond as we are of saying it doesn't, in research, effect size matters!
The practice of managing statistical significance is only part of the problem.  A more socially relevant statistical measure of a study’s importance is effect size.  Generally effect size statistics tell us how much of one variable is accounted for by our knowledge of another.   While the philosophical role of p values tells us how likely a given experiments results might have been achieved by chance alone, when studies have thousands of subjects, it is easy to achieve statistical significance for very tiny effects.   If I learned that only children made better managers, it would matter a lot in how much better run my company would be if they were 3% better managers, or 15%.  In the latter case, it might make sense to ask about interviewees’ birth order.   With tiny effects, it might not be worth the added expense of asking the question.

In the Replication Project, the average effect size of the 100 replicated studies was half of that of the originals.  That is the equivalent of halving their miles per gallon.  Everything I know isn’t wrong, but right is now at half strength.

Brian Nosek, PhD, Director of the Reproducibility Project

The Replications study’s director Brian Nosek is full of admiration for the courage and integrity of the team of 270 professionals who participated in these studies.  No evidence of wrongdoing was found.  He stresses that this is not about any lack of integrity.  No one knows better than Nosek about how hard doing replication really is.  But it is not the least bit reassuring about the overall integrity of the average psychology experiment reports submitted to top journals for publication.  97% of these achieved statistical significance, but only 36% could be replicated.

P hacking (methodologically cheating to increase your likelihood of achieving statistical significance); collecting part of your data, then checking your results, and only completing the collection of data if the sneak peek looks good; and deep sixing failed results are only a few of the actions biasing experimental results.  Funding and publication biases, prevailing research fads and orthodoxies, the occasional power plays by senior scholars, academic hiring practices, social prejudices, and the occasional outright fraud also influence what gets published and what is deemed important from among the thousands of research reports published annually.
We clinicians, busy with our clientele and paperwork, have biases of our own about which of these studies we consume.  And frankly, the methodological sophistication of most clinicians could be greatly improved.  Most of us selected clinical work in preference to academic research.  For all but a few of us, that was a wise economic decision.  Even those of us who are fanatical about CEs, career updating and re-certification are likely to retain biases from the period of our training long after we have left school and devoted our lives to practice.
The biggest source of bias, however, remains epistemological.  With limited money and time, and a realistic assessment of what promotes career advancement, it is far easier to get money to solve big problems that effect lots of people, and hard to get money to solve the problems of a few.  Social stigma is hard to research because societies retain vested interests in maintaining them, rather than spending risky money in hopes of overturning them.  People who insist on researching what they love may be just as biased as those who research what pays well.  We are each readier to see what we expect to see than that we do not expect.  It is easier to confirm our biases than to dispel them.  It seems there are biases, as the lady said of turtles, all the way down.

All of this must give us pause when we demand that clinical practice be more ‘evidence-based.’  The effort to collect experimental data is extremely valuable, as long as we appreciate the limitations of our methods and are careful not to over-interpret them.

The Kepler Space Telescope searches for exoplanets in .25% of the sky.

The Kepler Space Telescope is currently conducting a search for exoplanets, with special hope of detecting Earth-like planets that might sustain life, and perhaps maintain conditions for the evolution of intelligent life.  A great deal is unknown about exactly what those conditions might be.  Because of its orbit and limitations, the Kepler Space Telescope can only train on a tiny percentage, .25% of the visible sky.  This chosen percentage is focused in the lens of the Milky Way Galaxy where stars far enough away from the galactic core not to be irradiated and close enough to have high star density.  Midway through its mission, the aiming machinery broke, further limiting the Kepler’s field of view.  While the universe of potential experiments in psychology is genuinely infinite, and the Milky Way star population is immense, but ultimately finite, the analogy between xenoplanetology and psychology is a good one.  We cannot know that the Kepler’s chosen sample from the stellar population is representative of those stars most likely to harbor a planet that sustains life. We are simply taking the best shot we know how to take at this time.  In fact, the acts of orbiting the telescope, using it to find planets, and its mechanical failures have precipitated innovations in how to search for new planets.  But we have similarly sampled from a tiny but reasonable sample of possible psychology experiments.  Our evidence is spotty, but remains the sum of the best efforts of very thoughtful people.  I would say the same, however, of all those clinicians who made abundant errors generalizing from their first case studies.  Great care needs to be used in deciding what is true and what is not on the basis of fragmentary evidence-based science.

Perhaps the Firesign Theater was right.  Everything you know is wrong!  With the Replication Project data in and counted, we are about halfway there.   Even before checking our privilege, it would be well to check our biases.

2015 Russell J Stambaugh, Ann Arbor, MI. All rights reserved.

No comments:

Post a Comment