The Statistics the CIA Chose Not to Announce

CIA Headquarters, Langley, Virginia — CIA document cover from the declassified Stargate archive.

The official story of Project Stargate ends in 1995 with a CIA review that determined remote viewing had not been proven to work. That summary is accurate as far as it goes. It leaves out the finding that made the recommendation complicated enough that the two lead reviewers could not agree on what it meant.

In the fall of 1995, the American Institutes for Research (AIR) released a report commissioned by the CIA at the behest of Congress. The context was post-Cold War budget consolidation. The STARGATE program had been running since the 1970s at Stanford Research Institute and later at Science Applications International Corporation (SAIC), consuming roughly $20 million in government funding. With the Soviet threat diminished, the CIA needed to justify continued spending on controversial programs.

Two statisticians were selected to review two decades of remote viewing experiments. Jessica Utts was a professor of statistics at UC Davis, known for work favoring the statistical validity of anomalous cognition research. Ray Hyman was a psychologist at the University of Oregon, a skeptic of psi phenomena. They represented genuinely different orientations—not assigned positions, but genuine intellectual commitments. What they found in the data would be consistent enough to trouble the simple narrative that both camps wanted.

The Scope of the Evidence Reviewed

The 1995 AIR review did not evaluate the entire STARGATE program from scratch. Instead, the reviewers examined existing databases of documented experiments. Utts' analysis was based on the comprehensive database of 154 remote viewing experiments conducted at Stanford Research Institute between 1973 and 1988. These 154 experiments represented over 26,000 individual trials—20,000 of which were forced-choice experiments, where subjects selected targets from a limited set, and over 1,000 were free-response laboratory remote viewings.

This was not anecdotal testimony or field reports. These were controlled laboratory experiments with documented protocols, randomized target selection, and blinded evaluation procedures. The experimental design had evolved significantly from the early 1970s work. By the time of the database being reviewed, experimenters had tightened controls substantially.

Edwin May, the principal investigator and scientific director of the remote viewing program at SRI, had guided the experiments that populated this database. When the program transferred to SAIC in 1991, May continued his role, controlling approximately 70% of the contractor funds and 85% of the data. The AIR reviewers gained access to these records directly. Utts and Hyman reviewed not just published papers but the raw experimental data and protocols.

What the Numbers Said

Utts' statistical analysis of the 154 SRI experiments found what she termed "anomalous cognition"—performance significantly above chance expectation in direct statistical measures. Specifically, subjects scored between 5 and 15 percent above chance across sufficient numbers of trials to achieve statistical significance.

To a layperson, 5 to 15 percent sounds marginal. In controlled experimental conditions, with randomized targets and blinded evaluation, it is not. Consider: if you guess randomly at which of four targets is correct, you expect to be right 25% of the time by pure chance. Consistent performance at 30-40% correct—a 5-15 percentage point improvement—across thousands of trials produces statistical values that would occur through random fluctuation perhaps once in millions of chances.

Utts calculated the combined probability of observing the results across the SRI database if chance alone were operating: p-value of 10 to the negative 20th power (10^-20). In lay terms, this means the odds against chance explaining these results are 10 followed by 19 zeros to one. This is the kind of statistical threshold that, in conventional science, commands attention.

Within the SAIC experiments from 1991-1994, six out of ten anomalous cognition experiments showed statistically significant results. Two of these reached extreme significance levels. One experiment produced a p-value of 10^-9, meaning the odds against chance were one billion to one. When Utts examined only the most methodologically rigorous subset of experiments—those controlled most tightly against potential artifacts—the above-chance effects persisted.

"It is clear to this author that anomalous cognition is possible and has been demonstrated. This conclusion is not based on belief, but on conventional scientific standards."

— Jessica Utts, UC Davis, 1995 AIR Review.

The implications of this language matter. Utts was not claiming that remote viewers could reliably pinpoint military targets or read adversaries' minds. She was saying that the statistical patterns in the data were inconsistent with the null hypothesis of random guessing. She was making a narrow, specific claim: these experimental results show departures from chance expectation that meet the ordinary standards of statistical significance used throughout science.

Utts' Argument for Replication Across Labs

One of Utts' key observations was that the effect sizes—the magnitude of the above-chance performance—showed consistency across laboratories. This was important because if one laboratory was producing anomalous results due to some quirk or artifact specific to that lab, you would expect the effect to vanish or radically change when examined elsewhere. Instead, Utts found that different laboratories, using slightly different protocols and different experimenters, obtained effect sizes that clustered in the same range.

This observation aligned with contemporaneous research in other domains. In 1994, Daryl Bem and Phoebe Honorton published a meta-analysis in Psychological Bulletin examining approximately 50 ganzfeld studies from 10 different laboratories. The ganzfeld is a different experimental protocol—it uses perceptual isolation rather than remote viewing—but it tests for the same underlying phenomenon: direct transfer of information through channels that conventional neuroscience cannot account for. Bem and Honorton found that the ten most rigorous autoganzfeld experiments produced a statistically significant hit rate of 32.2%, with z = 2.89 and p = 0.002. The consistency of results across laboratories was a key feature of their meta-analytical conclusion.

Dean Radin, a physicist working at Princeton's Engineering Anomalies Research Laboratory in the early 1990s, conducted presentiment experiments—testing whether human physiological responses showed measurable changes seconds before exposure to randomly selected emotional versus calm images. In 1995, Radin and colleagues reported that in initial experiments, the presentiment effect appeared with odds against chance of 500:1. When combined across four separate experiments, the odds reached 125,000:1 in favor of a genuine effect. These experiments were double-blind with randomized target selection, and Radin explicitly controlled for sensory cues, data collection errors, selective reporting, and various anticipatory strategies.

The consistent pattern across remote viewing, ganzfeld, and presentiment experiments—different protocols, different laboratories, years apart—was what allowed Utts to argue that the phenomenon was not a laboratory artifact of one group or one method.

The Statistical Critique: Ray Hyman's Objections

Ray Hyman did not dispute Utts' statistical calculations. He agreed that the data showed departures from chance expectation. His disagreement centered on what those departures meant. Hyman raised several methodological concerns that, in his view, were substantial enough to invalidate the psi interpretation.

The first was the multiple-analysis problem. Hyman argued that May and his collaborators had conducted numerous statistical tests and analyses on the SAIC data, examining it from different angles, with different analytical approaches. If you conduct enough statistical tests, Hyman reasoned, some will cross the threshold for statistical significance purely by chance. This is true: if you test a hypothesis at the p = 0.05 level (one in twenty), and you conduct twenty independent tests, you expect one false positive on average. The question Hyman raised was whether the experimenters had disclosed all analyses conducted—whether selective reporting had inflated the apparent effect.

Second, Hyman pointed to the judge issue. In free-response remote viewing experiments, a subject produces written impressions of a target. An independent evaluator then ranks the actual target against decoys. For this judging process to be properly blinded, the judge should have no information about when experiments were conducted, who the subject was, or anything else that might introduce bias. Hyman noted that Edwin May had served as the principal judge in many experiments—not all, but most. This meant May, who was deeply invested in finding evidence for remote viewing, had control over a step of the process with considerable subjective judgment involved.

Third, Hyman invoked the file-drawer problem. This is a real statistical concern: if experimenters conduct many experiments, and only publish the ones with positive results, a meta-analysis of the published record will overestimate the true effect. Hyman suggested that May's group might have conducted more experiments than published, and that failures had been filed away without disclosure.

Fourth, Hyman argued that sensory leakage—subtle cues that allowed subjects to guess targets through normal sensory channels rather than anomalous cognition—remained possible despite the protocols in place. A stray word from the experimenter, a chance correlation between target location and room temperature, an unconscious tendency to select targets based on time patterns: Hyman argued that the controls, though tightened considerably from early SRI work, were still not tight enough to rule out all such possibilities.

Importantly, Hyman agreed that the SAIC experiments were substantially more rigorous than the early SRI experiments. He acknowledged that these later studies were "free of the more obvious and better known flaws that can invalidate the results of parapsychological investigations." But he held that methodological rigor and positive results are not the same as proof of an anomalous phenomenon. Absence of obvious flaws does not guarantee absence of subtle ones.

The Convergence and Divergence

This is where the 1995 AIR review becomes historically interesting. Utts and Hyman did not disagree on the facts. They disagreed on the interpretation. Both acknowledged the statistical departures from chance. Both agreed that the SAIC experiments showed improvements in methodology over SRI's earlier work. Both agreed that something in the data exceeded random fluctuation.

Utts argued that by conventional standards of scientific evidence, when multiple independent laboratories find consistent effect sizes in the same direction, when effect sizes persist even in methodologically rigorous subsets of experiments, and when the overall statistical probability is so extreme that it would occur perhaps once in 10^20 trials under the null hypothesis, the conclusion should be that an effect has been demonstrated. The source of that effect—whether it is genuine anomalous cognition or an undiscovered systematic bias—would then be the subject for further investigation.

Hyman held that the effect size, though statistically significant, was not large enough to overcome the remaining methodological concerns. He believed that subtle biases, not yet identified but still possible, could plausibly explain a 5-15 percent elevation above chance across thousands of trials. He recommended continued research with tighter controls, but he did not accept that anomalous cognition had been proven.

What Happened to the Physical Judgment Step

The question of the judge—particularly Edwin May's central role in evaluating results—deserves expanded attention. In a standard remote viewing experiment, the process flows like this: (1) a target is selected by a randomization procedure, (2) a subject, isolated from the target location, produces written or spoken impressions, (3) these impressions are then given to an independent judge, (4) the judge, working blind to which target the subject was assigned, ranks the impressions against the actual target and several decoys. The judge's task involves subjective judgment about how well the impressions match each candidate target.

May's dual role as investigator and judge introduced what is called an experimenter expectancy problem. Research in psychology, most famously demonstrated by Rosenthal in the 1960s, shows that experimenters' expectations unconsciously influence their behavior in ways that can affect outcomes. A judge who hopes to find a match between impressions and target might unconsciously weigh ambiguous phrases more favorably if they point toward the true target. This is not conscious fraud; it is a well-documented psychological phenomenon.

Utts addressed this concern directly by examining the subset of experiments where the judge was not May. Even in these cases, she found above-chance performance. Hyman acknowledged that this subset showed better methodological controls but argued that the effect size in these cases was smaller, consistent with his thesis that bias rather than anomalous cognition explained the overall result.

Edwin May's Response and Post-Termination Research

Edwin May provided the AIR review team with detailed documentation of SAIC experiments conducted between 1991 and 1994. His memo of July 25, 1995, listed ten experiments with trial counts, effect sizes, and p-values presented chronologically. May did not dispute the statistical findings, but he argued vigorously against the interpretation of methodological flaws. He maintained that the controls in SAIC experiments, while not perfect, were sufficiently stringent to rule out the major sources of bias Hyman cited.

After 1995, May continued his research outside government funding. He published extensively on anomalous cognition, including a four-volume scholarly work co-authored with Sonali Bhatt Marwaha, "Anomalous Cognition: Remote Viewing Research and Theory," in which Volume 2 covers the SAIC period in detail. In his subsequent publications, May argued that the effects found in SAIC remained robust and that they called for theoretical investigation, not dismissal.

The CIA's Decision and Its Context

Cover page of the AIR review commissioned by the CIA — The official AIR review was a CIA-commissioned document evaluating two decades of remote viewing research. The statistical findings were reported but not prominently featured in the public press release.

The CIA's actual decision came in late 1995, not from Utts or Hyman, but from the intelligence community's leadership. By this time, George Tenet was Deputy Director of Central Intelligence. (Tenet would later become CIA Director in 1997.) The post-Cold War environment was reshaping intelligence priorities. The Soviet Union had dissolved in 1991. Threat assessment had shifted from conventional military competition to emerging terrorism, proliferation, and regional instability. STARGATE, a program born from Cold War competition over intelligence sources, had to justify its existence in new terms.

The CIA issued a press release on November 28, 1995, announcing the termination of the remote viewing program. The language was definitive: the program had not been operationally useful and had not been scientifically proven to work. The press release did not mention Utts' statistical findings. It did not quote her conclusion that "anomalous cognition is possible and has been demonstrated." Instead, it emphasized conclusions drawn from the evaluation that emphasized remaining methodological concerns and the lack of operational validation.

This was not technically dishonest. Hyman's report, too, was part of the official AIR review. But it was a selective presentation. The full report, when published, contained both conclusions. But when the CIA went public, it led with the skeptical interpretation.

Understanding the Statistical Language

Utts' statement that the effect size was "consistent across laboratories" requires careful interpretation. She was not claiming that every experiment succeeded or that success rates were uniform. Rather, she found that when significant effects appeared, they tended to appear with similar magnitudes across different labs. If effect sizes had varied wildly—small at one lab, enormous at another—that would suggest that local factors, not a genuine phenomenon, were driving the results. Consistency in effect magnitude is a feature that suggests a real phenomenon rather than laboratory artifacts.

The p-value of 10^-20 across the SRI database is a statistical statement about one specific dataset: if the null hypothesis (pure chance) were true, this exact result would occur roughly once per 10 quintillion trials of identical experiments. This is not an exaggeration; it is a literal statement of the mathematical probability. What it does not tell us is whether that specific data came from a real effect or from a systematic bias that inflates results in a particular direction.

This is precisely where Utts and Hyman diverged. Utts argued that standard scientific practice is to accept the most straightforward interpretation of statistically robust data, which is that an effect has been demonstrated. Further investigation would then focus on understanding the mechanism and ruling out remaining biases through tighter controls. Hyman argued that in the domain of psi—where prior scientific knowledge gives no mechanism for how anomalous cognition could work—the bar for evidence should be higher. Not impossibly high, but higher than standard significance thresholds alone.

The Broader Pattern: Institutional Resistance to Anomalous Data

The 1995 termination of STARGATE occurred within a pattern familiar from the history of science. When data contradicts established models of how the world works, institutions often resist accepting the data rather than revising the models. This is not unique to psi research; it is a general feature of scientific conservatism.

Consider the case of Helicobacter pylori and gastric ulcers. For decades, the medical establishment held that stomach ulcers were caused by excess acid and stress. The therapeutic approach was acid reduction. In the 1980s, Australian physicians Barry Marshall and Robin Warren identified a bacterial infection as the cause of most ulcers. The medical establishment resisted this finding vigorously. Only after years of accumulating evidence and Marshall's famous self-infection experiment did the hypothesis gain acceptance. By 2005, the paradigm had shifted decisively. The bacterium was recognized as the causal agent, and treatment shifted to antibiotic therapy. Marshall and Warren won the Nobel Prize in Physiology or Medicine in 2005.

Similarly, continental drift was proposed by Alfred Wegener in 1912. The idea seemed absurd to most geologists; there was no known mechanism by which continents could move. The geological establishment rejected it. Not until the 1960s, when plate tectonics provided a mechanism and accumulated paleomagnetic and seismic evidence became overwhelming, did the theory gain mainstream acceptance. Wegener died in 1930, having seen his theory ridiculed and dismissed during his lifetime.

These cases illustrate a pattern: anomalous data that contradicts the established paradigm faces institutional resistance. The resistance is not necessarily irrational; it reflects a reasonable conservatism about overturning well-established models. But it can delay acceptance of valid data by decades. In the case of STARGATE, the data appeared anomalous with respect to conventional neuroscience. The statistical evidence was strong, but not so overwhelming that it forced immediate acceptance. Institutional factors—budget pressures, skepticism about psi as a domain, political context—influenced the decision to terminate.

What Utts Recommended

Utts did not recommend terminating STARGATE. She recommended the opposite: a targeted investigation program to understand the conditions under which anomalous cognition appeared most reliably, with explicit focus on mechanism and on tightening controls further against the specific biases Hyman had identified. She proposed using the above-chance effect as a starting point for research, not as a final conclusion.

She suggested focusing on a subset of promising subjects, with longer training periods and more extensive trials per subject. She recommended independent replication by skeptical investigators. She proposed explicit protocols to address the judge issue and the multiple-analysis problem. She advocated for preregistration of analyses to prevent file-drawer bias. In other words, Utts called for science to do what science does: accumulate evidence, tighten controls, investigate anomalies.

The CIA did not pursue this path. Instead, it terminated the program.

What It Means for Practice

For an individual practitioner of remote viewing, the statistical debate translates into a practical implication. If an anomalous effect exists, even a small and inconsistent one—performance 5-15 percent above chance rather than 50 percent—then tracking accuracy across many trials is the only way to identify whether you personally exhibit the effect.

You cannot discern a 5 percent above-chance effect in five sessions. The random variation would overwhelm the signal. Across 100 sessions, the pattern might emerge. Across 500 sessions, it becomes clearer. This is why serious practitioners maintain detailed records of target matches and misses. The argument for keeping records is the same argument Utts made to the CIA: if there is a signal in the noise, more data is how you find it.

The statistical significance found in laboratory experiments—particularly in the SAIC database where effect sizes were measured across thousands of trials—suggests that if you practice over extended periods and record your results carefully, patterns may eventually become visible that would not emerge in casual or sporadic practice.

Track your accuracy over time

PsionicAssist scores every session against verified targets and builds your personal accuracy record. The statistics only become meaningful across many trials. Start yours.

Begin Training →