Quantum probability updating from zero priors (by-passing Cromwell’s rule)

Cromwell’s rule (also known as the zero priors paradox) refers to the constraint of classical probability theory that if one assigns a prior probability of 0 or 1 to a hypothesis, then the posterior has to be 0 or 1 as well (this is a straightforward implication of how Bayes’ rule works). Relatedly, hypotheses with a very low prior cannot be updated to have a very high posterior without a tremendous amount of new evidence to support them (or to make other possibilities highly improbable). Cromwell’s rule appears at odds with our intuition of how humans update probabilities. In this work, we report two simple decision making experiments, which seem to be inconsistent with Cromwell’s rule. Quantum probability theory, the rules for how to assign probabilities from the mathematical formalism of quantum mechanics, provides an alternative framework for probabilistic inference. An advantage of quantum probability theory is that it is not subject to Cromwell’s rule and it can accommodate changes from zero or very small priors to significant posteriors. We outline a model of decision making, based on quantum theory, which can accommodate the changes from priors to posteriors, observed in our experiments.


Introduction
Probability theory is at the heart of our effort to understand the workings of the human mind: As cognitive agents, we have to survive and flourish in an inherently uncertain world. Therefore, it is a reasonable hypothesis that the principles that guide cognition are those of a formal probability theory. The most widely employed and successful probability theory in cognitive science is classical probability (CP) or Bayesian theory. CP cognitive models have been the basis for successful cognitive explanation for many aspects of behavior, including decision making, learning, categorization, perception, and language. Moreover, CP has provided a foundation for issues key to human existence, such as the question of rationality (for overviews see Griffiths et al., 2010;Oaksford & Chater, 2007;Tenenbaum et al., 2011). Yet there are some behavioral situations that prove problematic from a CP perspective and challenge the universal applicability of CP theory in cognition. The focus of this paper is one such situation, related to probabilistic updating.
Probabilistic updating concerns the rules for how we should revise our estimate for the probability of a hypothesis, given some new information. According to (CP), our belief in different hypotheses should be updated using Bayes' law, which states that Here H i is a particular hypothesis and D is the observed data. In the set-theoretical paradigm, Eq. (1) is equivalent to The term () i p H D  is the joint probability of a hypothesis and the data. Bayes' rule is highly intuitive, indeed to the point that it is hard to envisage alternatives. Our research program has largely been concerned with exploring a particular alternative probabilistic framework, which reveals alternative intuitions about probabilistic inference. One important characteristic of Bayes' law is its stringent linearity; updated probabilities are linearly dependent on the priors. Thus, Bayes' rule must satisfy a very important property: when the prior is zero, then regardless of the data that we observe, the posterior probability, p(H i | D)as calculated by Bayes' law, has to be zero as well. Likewise, when the prior is one, the posterior has to be one as well. In fact, the earliest documented instance of this problem goes back to the 17 th century. Oliver Cromwell allegedly said to the members of the synod of the Church of Scotland "to think it possible that you may be mistaken" (Carlyle, 1885). The argument goes that (if one cares to stay openminded) one should assign some small possibility to even the most improbable state of affairs. Otherwise, however much evidence is subsequently accumulated in favor of the zero-prior possibility, if one employs Bayes' law for probability updating, one will be trapped with all-zero posteriors. This observation has been called Cromwell's rule, but we can also call it a zero priors trap. But the problem is more general than just concerning prior hypotheses, which are entirely possible or impossible, since the ratio of the likelihoods is constrained by the ratio of the priors. This means that for initially very unlikely hypotheses, whatever the evidence we observe, it will be impossible to attain high posteriors. Behaviorally this seems unlikely and it is the purpose of the present paper to explore this intuition formally and empirically. Note, the validity of Bayesian updating has been questioned before, for example, by Van Wallendael and Hastie (Robinson & Hastie, 1985;Van Wallendael, 1989;Van Wallendael & Hastie, 1990) who noted that, upon receiving information about one hypothesis, people tend to revise only the corresponding probability and leave their other estimates untouched (so that the total fails to equal one).
A natural domain for testing Cromwell's rule is detective stories, where the criminal turns out to be someone completely unexpected (as in Dostoyevsky's The Brothers Karamazov, where, in full compliance with CP rules, prior beliefs about the innocence of the actual criminal were so strong, that his complete confession together with other hard evidence failed to convince the court) or not even a person (as in Poe's The Murder in the Rue Morgue). We present a simple experimental paradigm, based on a crime mystery, where all suspects are listed beforehand. The information about the crime mystery that participants initially see makes some suspects impossible/ extremely unlikely. We then provide additional information, which makes some of these initially impossible options likely. Do participants update their beliefs in a way that is inconsistent with Bayes' law? That is, do participants produce evaluations of posterior probabilities that exceed the prior probabilities by amounts that go beyond what is allowed by Bayes' law? We will demonstrate that the probabilities given by participants in the course of this crime-solving paradigm strongly violate Bayesian updating. In fact, in 20% of cases, participants updated a prior of close to zero to a very high posterior, in a single step.
Seeking to obtain evidence against Bayes' law in decision making raises the question of whether there may be an alternative formal framework for understanding probability updating, for these situations where Bayes' law may be problematic. Are there principles for probabilistic updating that allow the impossible to become possible? The standard way to deal with such situations in CP theory is additive smoothing, but this is typically applied a posteriori. For example, after we identify a new possibility, we can reshuffle the prior probabilities adding a "pseudo-count" to all the priors, so that no option on the (a posteriori!) list has a zero prior. There are other approaches (Hoppe, 1984), such as the Pólya urn model, directly enforcing an increase in the probability of a less probable event, each time the more probable event happens. For example, starting with an urn containing only (or mostly) black balls, we add a white ball each time a black ball is drawn. But what about options (e.g., balls of some new color) that are considered completely impossible at the outset? Yet a third possibility involves Shafer's representation of belief states (Shafer, 1976) where possible options are grouped (so that one has beliefs about the groups rather than individual hypotheses). In Section 4 we briefly consider whether this approach can address the zero priors paradox. One probabilistic framework that is not constrained by Cromwell's rule is quantum probability theory (QPT). In QPT, the rule for updating probabilities, called the von Neumann -Lüders' rule, allows arbitrarily large increases in posterior probabilities, even if the priors are zero, when there are discontinuous changes in belief states (resulting from measurements).
We call QPT the rules for how to assign probabilities to events from quantum mechanics, without any of the physics. QPT is in principle applicable in any situation where there is a need to formalize uncertainty. The motivation to employ QPT in psychology exactly concerns situations, where the prescription from CP is at odds with human behavior or intuition. For example, in decision making, such situations correspond to findings that human behavior violates CP principles, such as the law of total probability or the commutativity of conjunctions (Conte et.al., 2007;Croson, 1999;Hofstader, 1983;Shafir & Tversky,1992;Tversky & Kahneman, 1983;. Corresponding QPT cognitive models have achieved a measure of descriptive success, usually through assumptions that particular variables or questions are incompatible (overviews in Busemeyer & Bruza, 2012;Pothos & Busemeyer, 2013;Haven & Khrennikov, 2012;Khrennikov, 2003). Note, a uniform assumption in such models is that they are all effectively hidden variables models, that it, that there is no 'true' quantum structure in cognition (Yearsley & Pothos, 2014;cf. Dubois & Lambert-Mogiliansky, 2015).
QPT provides an entirely different (compared to the standard CP one) framework for probabilistic inference, where events and probabilities are represented by vectors in a Hilbert space and Hermitian operators. We shall outline the instruments from QPT relevant to Cromwell's rule in the next section. In closing the introduction, it is worth noting that in exploring alternative probabilistic frameworks, QPT is not the only choice, and there are options that are even more non-classical, compared to QPT. For example, there are probability models that account for "negative probabilities" regularly encountered both in physics and cognitive psychology (Acacio de Barros, 2014; Acacio de Barros & Oas, 2014; see also Narens, 2015).

Principles of Quantum-like Updating
The QPT approach to probability updating involves the formalism of a complex Hilbert space H (for representing belief states) and the theory of Hermitian operators (for representing observables, including decision operators).
In QPT, (pure) belief states are represented by normalized vectors in a Hilbert space H (a complete vector space, where the scalar product between vectors is defined and denoted here as angle brackets). A scalar product determines the norm of any vector ψ on H as ||ψ||=<ψ , ψ>. Vector components are, in general, complex numbers. Given a complex number z=u+iv, we denote its conjugate as z * , z * =u-iv. The number of vector components is, in general, infinite. Here we confine ourselves to an m-dimensional Hilbert space, with orthonormal basis {e 1 , e 2 , …e m }. H can then be represented as a Cartesian product C m , where the scalar product <ψ , > of vectors ψ=c 1 e 1 +c 2 e 2 +…c m e m and =k 1 e 1 +k 2 e 2 +…k m e m is then given by c 1 k 1 * +c 2 k 2 * +…c m k m * . Operators are simply Hermitian matrices. A square matrix A with elements a ij is Hermitian if it is equal to its conjugate transpose, that is a ij =a ji * .
The eigenvalues of a Hermitian operator are the various possible values that can be obtained with a measurement. Immediately after a measurement, the state of the system is given by a projection of the initial state to the eigen-subspace corresponding to the eigenvalue that was obtained, as the result of the measurement. The latter statement is known as the von-Neumann-Lüders projection postulate.
In QPT observables can be distinguished depending on whether they commute or not. For commuting observables, measurement order does not matter and, in most situations, such observables can be considered classical. For non-commuting observables, measurement order can produce different results. For non-commuting observables, also called incompatible, the features of quantum probability updating differ crucially from the well-known features of Bayesian probability updating. In QPT, the von-Neumann-Lüders postulate is used to update the prior state ψ 0 by the means of projector operator E  to a new state: 00 / EE   . (2) In CP theory there is the set of hypotheses or "states" ={ 1 ,  2 , … m }, which are mutually exclusive (only one of them is actually true) and exhaust the whole probability space, so that the sum of all probabilities ( i ) is one. More generally, when  is not necessarily discrete, one can speak about a probability density () on  rather than discrete probabilities ( i ).
Constructing a large enough state space is an important first step in a CP model. From the start, we have to account even for the most inconceivable possibilities by introducing the corresponding states into consideration (e.g., that the moon is made of green cheese). In contrast, in the QPT model, see below, we are free to assign zero priors to possibilities that are initially considered impossible.
Classically, for a random variable X taking values from the set {x 1 , x 2 , … x n }, one can specify the probability distribution p(x|) for each state . We use two different letters  and p to distinguish between probability distributions for hypotheses and information X. Now, if the random variable X is measured, one can update the prior probability distribution on the basis of the information gained from the result of the measurement, say x i . The classical probability updating gives us (|x) according to the Bayes rule (leaving probabilities 0 and 1 invariant), as: Consider two observables  with values  1 ,  2 , … m and X with values x 1 , x 2 , … x n . The first one corresponds to the initial set of hypotheses and the second one to some additional information (which will be used for probability updating). In QPT, the observables are represented by Hermitian operators, which we can also denote as  and X, with eigenvalues  and x, correspondingly. Then, where summation is over all possible values  or x, and (E  ) and (F x ) are orthogonal projectors corresponding to the eigen-subspaces of these operators. In QPT, for a particular person, the initial mental representation relevant to a situation is given by a belief state ψ 0  H. This state encodes information about subjective probabilities for all conceivable states of nature. They can be extracted by performing direct measurements of  and taking the square of the norm of the resulting vector: This observation procedure corresponds to decision making when the prior possibilities for when the observer is in state ψ 0 . 1 We want to update the probabilities of  on the basis of additional information from a measurement of X. By using Lüders rule (2), we get The result is not constrained to coincide with Bayesian probability updating. We shall address the question of when quantum probability updating coincides with classical updating elsewhere.
Let us apply these ideas to a situation analogous to the one in our empirical investigation. Suppose that some crime case is under investigation by a police officer. He has the list of suspects { 1 , A "classically thinking commissar" assigns some probabilities that the crime was committed by these suspects, ( i ), i = 1, 2,… m. Then, in the process of the investigation he obtains new pieces of evidence related to this crime, which are encoded by some variable x. For each suspect I, he assigns probability p(x| i ) that the evidence given by x is consistent with the hypothesis that the crime was committed by suspect  i . Finally, he applies Bayes' formula (3) to get the probability ( i |x) that  i is really responsible for this crime. If some person  was not present in the initial list , the commissar would never get a nontrivial ( i |x). A "quantum thinking commissar" prepares the initial belief state ψ 0 . It can be represented as superposition of belief states corresponding to the suspects in the initial list: The state of belief that some person , who was not present in the list { 1 ,  2 , …  m }, is the criminal means that information about is not present in the superposition (7). Moreover, this state is orthogonal to the superposition in (7), since its prior probability is zero. However, the structure of ψ 0 may change dramatically after encountering evidence x.
The new state (after the evidence x) can have a component which does not overlap with { 1 ,  2 , …  m }, but has some nonzero overlap with . In this case, the commissar would assign a nonzero probability (which may be quite substantial) that the person , which is not among { Intuitively, one can imagine, that together with obtaining evidence x, a quantum commissar changes his own state too, that is (in a sense) widens his horizons (see also Lambert-Mogiliansky et al.,

1
By demanding that each person is able to assign prior probabilities to all possible outcomes (rather than make a judgment about one particular outcome) we have to assume that, by performing a "prior-measurement", one does not irrevocably modify the initial belief state ψ 0 or at least ψ 0 can be perfectly reproduced and so used for further measurements. This may not sound plausible in physics, where the state collapses as the result of the measurement, but in studying cognition this assumption is natural. In physics there is assumed (at least, theoretically) a preparation procedure generating an ensemble of systems in the same state. A mental analog of this physical assumption about state preparation is that the brain, while solving a concrete problem, is able (and meant to) return to the same belief state, after a judgment. However, for some mental contexts this assumption may be very restrictive. In principle, it is possible to proceed without it. But for the traditional Bayesian approach it is really important to start with a computation of all prior probabilities, since they are explicitly involved in the update rule (3).
The crucial feature of the quantum scheme is that it is about the updating of states, not probabilities. Once the state has been updated, we can obtain the posterior probabilities. Therefore, in principle we can proceed without the explicit assignment of the prior probabilities () given by (5), meaning that the prior measurement of the  -observable can be eliminated from the quantum scheme of probability updating. So, we can start simply with a preparation of the initial belief state ψ 0 and consider how it updates as a result of gaining information with the aid of the X-observable, see below.
2009). Suppose, the commissar initially focuses his suspicions on two suspects, so that his quantum belief state is c 1 e 1 +c 2 e 2 , where vectors e 1 and e 2 are orthogonal (meaning that the options are mutually exclusive, only one or the other is true). In this initial belief state, the options are also exhaustive, meaning that probabilities sum up to one, |c 1 | 2 +|c 2 | 2 =1. Meanwhile, the actual state space may be threedimensional, with basis vectors e 1 , e 2 , and e 3 , though the commissar is probably not aware of this initially and the prior for e 3 is zero. Now, suppose he encounters evidence x, represented by a projector along the vector   23 /2 ee  . The commissar makes a measurement to examine whether x is true or not, given his initial state. As a result of this measurement, the state of the commissar is changed, for example, it would be projected on this vector (is the evidence x is found to be true). Then, the commissar will find himself in a new belief state, such that it has some overlap along the basis vector e 3 as well. That is, in this new mental state, the suspect represented by vector e 3 is no longer considered impossible.

Experiment Investigation 3.1 Setup
We conducted two experiments to test whether participants' everyday/ intuitive decision making is consistent with Bayes' law for probability updating. In both cases, we created simple scenarios, based on a crime mystery. The experiments were matched in much of their detail, so we describe some of the common elements here. All participants were provided with a main story (approximately 620 words), which described a couple, John and Jane, their children (Chad and Cheryl) and a number of other persons, such as friends, John and Jane's gardener etc. The crucial bit of information was that Jane is fond of her jewelry, which, though valuable, is kept in her bedroom in an unsecured jewelry box.
Participants are told that, on a particular Sunday evening, Jane discovers her jewelry is stolen. They are then asked to rate the probability that each of the persons introduced in the main story is the culprit. The persons were described so that some of them were slightly more likely to be guilty than others. However, in all cases, Chad (Jane's son) and John (Jane's husband) were initially completely beyond suspicion (this was directly verified in our data). That is, the main story was constructed so that the priors for Chad and John to be guilty were extremely low or zero. After participants rated the suspects, they were given further information, which (unsurprisingly for the reader of this paper) provided a strong motive for stealing the jewels for Chad (in Experiment 1) and John (in Experiment 2). Participants were asked to rate the possible suspects again.
With this setup, we therefore have a situation such that a hypothesis with an extremely low prior suddenly acquires a high posterior. However, empirically, it is impossible to establish whether a prior is exactly zero vs. just extremely low. Therefore, our test of Cromwell's rule has to correspond to a test of whether probability updating is consistent with Bayes' law, ( | ) ; put differently, according to Bayes' law, the ratio of the likelihoods is constrained by the ratio of the priors. To empirically test Bayes' law, we needed two kinds of information, (1) information on whether a suspect is guilty or not a priori and given the motive (the conditional p(|x)) and (2) information on the motive a priori and given that the jewelry was stolen by a particular suspect  (the conditional p(x|)). Theoretically, it would have been ideal to collect data on all these probabilities within participants. However, asking participants to estimate the probabilities of conditionals and their corresponding reciprocals would have potentially been very confusing, so we adopted a between participants design, regarding the estimation of conditionals in the two directions.

Participants
We tested 57 participants, all students at the University of California Irvine, who received fixed course credit for their time. There was little basis for conducting a power analysis, so the sample size was determined a priori to approximate 60 participants, over a convenience testing window. The same approach was adopted in Experiment 2, for consistency.

Materials and procedure
The experiment was designed in Qualtrics and administered online. The experiment lasted for approximately 20 minutes. All participants received the main story featuring a family of John, Jane and their two children Chad and Cheryl, their two neighbors Matt and Mary, a cleaner, a gardener, and a burglar. Following this, there were 10 multiple choice questions (such as whether John has 1, 2, 3, or 4 children), which were meant to examine participants' basic knowledge of the story. The participants could not continue with the test until they answered all 10 questions right. Regarding safeguards that participants seriously engaged with the task, additionally, participants then received 6 catch questions testing probability intuitions as well as the understanding of the main story. Note, both the main story and all subsequent text that participants were presented with were accessible to participants throughout the experiment (that is, participants were told that this was not a memory test and they could take notes or make a copy of any screen). Four of the second group of 6 questions can be called 'easy catch' questions, and they corresponded to rather a trivial understanding of probability, such as what is the probability John has a son, when this fact was explicitly stated in the story. Two of these questions can be called 'hard catch' questions, and tested participants basic probability intuitions, for example, participants were asked "What is the probability that Mary has two daughters?" (a hard question) about the neighbor Mary who is known to be a mother of two, but the gender of her children was not explicitly provided in the main story.
After the catch questions, participants were told that Jane discovered her jewelry was missing, called the police, etc. Participants were given some facts that made it slightly likely that the cleaner or the gardener could have stolen the jewelry. Note also that one of the persons introduced in the main story was a local burglar, designed to be a fairly natural suspect. Following this information, participants were told to consider how likely all of the persons in the main story were to have stolen the jewelrythere were in total nine persons and participants were also allowed to enter a probability for 'other persons'. Note, participants were told to indicate their probability estimates for each suspect on a 0 to 100 scale and that their probabilities should sum up to 100 (this was checked automatically by the experimental software).
Following these ratings, the study proceeded differently for two groups of participants. The first group was provided with a strong motive that Chad stole the jewelry. In brief, they were told that Chad wrecked a friend's car and needed money to repair it. They were then asked to rate the probability that each person had stolen the jewelry, as before. This was the last request in the experiment. This condition provided information for evaluating ( | ) ( ) , where Je denotes the event Chad stole the jewellery and I denotes the information that Chad wrecked the car and needed money to repair it.
In the second condition, after the first rating of the suspects, participants were given information about what was labeled a 'claim about Chad'. The claim was, basically, that Chad wrecked a friend's car. Participants were asked to rate the probability of this claim about Chad. In a subsequent screen, participants were informed that it is known for a fact that the jewelry was stolen by Chad and asked to update the probability of the claim about Chad. Thus, the second condition provides information that allows us to compute ( | ) ( ) . At this point, it should be clear why we opted to adopt a between participants, rather than a within participants design: for participants who had been asked to estimate e.g., ( | ), it would have been very confusing to subsequently estimate ( | ) as well. Finally, note that all main parts of the text employed in the two conditions are shown in Appendix 1.

Results
In Table 1, we show the probability estimates regarding all people in the story, prior estimates (collected in both conditions of the experiment) as well as the updated ones (collected in the first condition), following the information about Chad. The prior probability that Chad is guilty, P(Je), was on average 1.5 %, the updated probability P(Je|I) was on average 34 %, the prior probability of the claim about Chad, P(I), was on average 40 %, and the conditional probability of the claim about Chad, given that he had stolen the jewelry, P(I|Je), was on average 79 %.  We seek to test against the null hypothesis that ( ) , as required by Bayes's law of probability updating. Intuitively, the reason why Bayes' law may be violated in this case is that we were expecting a huge increase in the probability that Chad is guilty, from the prior estimate P(Je) to the estimate P(Je|I), following the information that Chad had wrecked the car. Note first that given expectations for low (or zero) priors, we computed for each participant either , depending on condition. Still, one participant produced ( | ) (out of 31) and another ( | ) (out of 26) and these were eliminated from further consideration (their results cannot be used to evaluate Bayes' law). We used the MATLAB software package to calculate z and p values.
( | ) ( ) was found highly significant, z=4.6, p=5·10 -6 (U=120, n 1 =30, n 2 =25), allowing us to reject the null hypothesis that these two quantities are equal. Next, recall, the easy catch questions provided a measure of participants' attention paid to the story plus some trivial understanding of the probability concept (max possible score 4) and the hard catch questions a measure of participants' probabilistic intuitions (max possible score 2). We ran the U-test comparison once more, but excluding all participants with an easy catch score of 2 or less (23 participants were excluded like this); the test was significant again, z=-2.8, p=.006, (U=59, n 1 =16, n 2 =16). Overall, across a number of checks, there was strong evidence that probability updating was not consistent with Bayes' law.

Participants
We tested 58 participants, all students at the University of California Irvine, who received fixed course credit for their time.

Materials and procedure
Most aspects of this experiment are the same as for Experiment 1 -Experiment 2 was meant to be a simple replication of Experiment 1. Therefore, we only describe the ways in which Experiment 2 differed from Experiment 1.
In the first condition, after participants submitted the prior estimates of the probability that each person stole the jewelry, participants were provided with a strong motive for John. They were told that John had a severe, secret gambling problem and he owed a lot of money. They were also told that if Jane were to find out about John's debts, she would likely divorce him. So, this motive was meant to make John a very likely suspect. The first condition ended as in Experiment 1, with participants being asked to update the probability that each suspect was guilty, following this new information about John.
In the second condition, following prior evaluation, participants were told about the above 'claim for John' and were asked to evaluate its probability, P(I). They were then told that the jewelry was definitely stolen by John and they were asked to rate again the probability of the claim about John, given John had stolen the jewelry P(I|Je). In Appendix 2, the interested reader can see the text for Experiment 2 that is different, compared to Experiment 1.

Results
We proceed as for Experiment 1. Table 2 shows the estimates for the prior and updated probability for each person of the story stealing the jewels. The prior probability that John has stolen the jewels, P(Je) was on average 1.6 %, the updated P(Je|I) was on average 39 %, the prior probability for the claim about John, P(I) was on average 41 %, and the conditional probability for the claim, given that John was known to have stolen the jewelry, was on average 83 %.  were uncomputable. This was the case for two participants in the first condition (out of 30) and one in the second (out of 28). The two quantities on either side of Bayes' law were significantly different, z=4.7, p=2·10 -6 (U=648, n 1 =28, n 2 =27), thus we could not sustain the null hypothesis that participants' probability updating was consistent with Bayes' law. The same conclusion was reached when we ran the same U-test, but excluding all participants with an easy catch score of 2 or less (23 participants excluded; z=2.9, p=.004, U=202, n 1 =17, n 2 =15).

QPT application
In this section we illustrate how QPT can accommodate large jumps in probability updating. The human behavior we are interested in corresponds just to the process of probability updating. Thus, there is no detailed model to be constructed, instead we specify a plausible representation and show that QPT can reproduce the observed probability updating. This is not possible with CPT, as shown in the results sections of the two experiments.
We illustrate QPT probability updating in two ways, a coarse grained one and a fine grained one. Starting with the former, let us number the basis vectors as follows: e 1 =|=Chad>, e 2 =|=the burglar>, e 3 =|=the cleaner>, e 4 =|=the gardener>, e 5 =|=others>. For the time being, we confine our consideration to this five-dimensional case. Conveniently, in the above basis, the probabilities for the different hypotheses are given as the squared norm of coefficients before the basis vectors. Let us take the prior state in the simplest form, such that the probabilities are those given in Table 3 (the |=others> possibility has been expanded to all relevant individuals, to anticipate the more fine grained demonstration to follow shortly):  Note that e 1 does not appear in the above equation because the probability amplitude for the Chad vector is zero. The other amplitudes just correspond to the observed data. Given this state vector, , we can reproduce all the prior probabilities, by projecting ψ 0 on the mutually orthogonal basis vectors defined above, such as |=Chad>, |=the burglar>,… The corresponding projectors E  , (one projector for each hypothesis; Equation 5) are given by E  =|e i ><e i | and lead to the corresponding probabilities in Table 3 through <E  ψ 0 ,ψ 0 >. Note, <vector| is the complex conjugate of |vector>. Finally, state ψ 0 represents the cognitive state of a typical or average participant in the experiment, after the first step of evaluating suspect probabilities, but we are not concerned with the cognitive state before this step. This state is meant to reproduce the response statistics across the sample, on average. As our objective is focused on probability updating in the sample as a whole, we did not pursue modeling of individual differences (with future extensions, this could be done with the formalism of density matrices).
What is the effect of new information about Chad on the mental state ψ 0 ? Mostly, it advances the Chad hypothesis without affecting the other hypotheses. We know the specific effect this new information has on the mental state, since we know how the prior probabilities change, after the new information. Thus, if the initial state is ψ 0 , as above, then we know that the effect of the new information would be to project ψ 0 to a new state ψ I with coefficients such that their squared norm gives the updated probabilities. This new state ψ I simply corresponds to the observed updated probabilities, after participants received the new information; basically, the probability for Chad has "jumped" from zero to a substantial number, while all other probabilities decreased. We can identify a projector operator that projects ψ 0 to ψ I (note that ψ I and the projector are not unique; however, all projectors have to be symmetric and equal to their square) and an example is: Using this projector, our initial state (7) Finally, to obtain the probability of each person being the thief, we again apply the projector E  for a particular suspect as shown in Eq (5). The probability for Chad has "jumped" from zero to 65%, while all other probabilities decreased. Note, a proper cognitive model of this decision making situation would focus more on how the different kinds of evidence impact on beliefs about states. However, as noted, here we only want to illustrate how QPT can accommodate large increases in probability; Equation (9) effectively constitutes an existence proof that QPT can do this. Note, for simplicity, we took the last row of our projector to be zero. Naturally, this resulted in zero updated probability for e 5 .
If one seeks a more fine-grained description of the results, then we can easily extend the above approach. Let us consider a more fine-grained basis, corresponding to the suspects in the order they appear in Tables 1, 2, and 3: e 1 =|=John>, e 2 =|=Jane>, e 3 =|=Chad>, e 4 =|=Cheryl>, e 5 =|=Matt>, e 6 =|=Mary>, e 7 =|=the gardener>, e 8 =|=the cleaner>, e 9 =|=the burglar>, e 10 =|=others>. For Experiment 1, the initial state can be chosen as: As previously, it is clear that projecting ψ 1 to a basis vector e i will yield the corresponding posterior probability, after participants had processed the new information regarding the theft case (as specified in the ith shaded row of Table 1).
Similarly, for Experiment 2, the initial state can be set to: Thus, prior probabilities coincide with those in the unshaded rows of Projecting ψ 1 to a basis vector e i will yield the corresponding probability equal to the one specified in ith shaded row of Table 1.
Finally, we can consider what are the implications regarding psychological process, from the application of QPT in this example. We can be guided by Dubois and Lambert-Mogiliansky (2015). Their argument appeals to the incompatibility of perspectives in the mind and the process of learning. For example, from the perspective of the initial information about possible suspects, John appears as an extremely unlikely suspect. In technical terms, the basis corresponding to the initial perspective makes it very unlikely that John is a suspect. When the new information about John's gambling becomes available (Experiment 2), the mind shifts perspectives regarding John and the possibility that he is guilty. This new perspective incorporates information about how he is perhaps not a bad man, but his weakness in gambling now endangers his marriage. With this new perspective, the basis set for evaluating the possible guilt of different suspects likewise changes and the projection onto the subspace corresponding to John's guilt becomes large. Importantly, the new perspective produces probabilities that are not linearly related to the priors and so QPT can accommodate changes from priors to posteriors not possible classically, as well as interference effects etc.

Discussion
Experiments 1 and 2 are very similar and demonstrated very similar results. About 20% of the participants updated their zero or nearly zero priors to high confidence (more than 50%) upon receiving new information about John or Chad, demonstrating a strong deviation from Cromwell's rule. It is possible that the zeros initially entered by participants for some suspects are but a shorthand for some small number, like 0.00001. Participants were allowed but not encouraged to respond using long decimals. Still, the results cannot be accommodated by Bayesian updating. Moreover, an assumption that people can have such a highly refined scale of probability is somewhat implausible.
An attempt to reconcile these results with classical Bayesian updating may lie in direct observation (experimental collection of the relevant data) of all ten conditional probabilities. Denote the ten possible options = 1 , = 2 , …, in general, = i , where by  we mean "the thief", and by  1 all possible suspects, for example, p(= 1 ) is the prior probability that John is the thief. In principle, In the present work, participants were asked to directly evaluate the probability p(I). Alternatively, we could have asked them about all p(I|= i ) on the right hand side of (18) and then calculate p(I).
Would this lead to smaller violation of the Bayesian updating? If yes, this would mean that the present results can be interpreted as an unpacking effect (Tversky & Kohler, 1994), a kind of violation of the law of total probability, in the form (18), or a conjunction effect, where p(I) is harder to estimate than p(I|= i ) and, hence, estimates for p(I) are smaller than p(I|= i ). For the moment, this is only a hypothesis, which shall be put to experimental test in the future. The possibilities p(= i ) are always assumed to be mutually exclusive and exhaustive, meaning that exactly one suspect is actually the thief. These are probably features that make Bayesian updating problematic in decision making. As noted in the Introduction, in earlier experiments violating Bayesian updating, Shaferian representation of beliefs was quite successfully applied (staying in the framework of CP theory). This theory aims to relax the assumptions of exhaustiveness and mutual exclusiveness of options (and account for experimental violation of these assumptions), by working with groups of options. For example, one option may include a couple of groups, so the groups cannot be thought of as mutually exclusive.
Shafer's idea is that upon encountering the new information in favor or against one option, e.g.
 1 , probability is reallocated within each group containing  1 , so that the structure of the groups may become finer, which finally leads to a decision. Interestingly, estimates of the other options, not initially grouped with x 1 , remain the same. Could our results be accounted for by a generalized version of CP theory, such as the one proposed by Shafer?
A common prior probability distribution observed in our experiments looks like Table 3: the gardener 20%, the cleaner 20%, the burglar 50%, others 10%, other six personages (four of the family plus two neighbors) 0% each.
Suppose that the actual belief system was a grouping (for simplicity, the subsets are mutually exclusive) as shown in Table 4. 2 One may interpret this grouping as a reluctance to assign a zero prior to any hypothesis. Also, this saves computational capacity. A participant may think that the most obvious suspect is the burglar and that there are two other, equally probable, suspects; then, one can group whoever is left into the least probable belief set. {the family, the neighbors, others} {the gardener, the cleaner} {the burglar} 10 % 40 % 50 % Now, the evidence pointing towards Chad or John as the possible criminals does not entail Bayesian updating on a zero prior, rather the new key suspect can reclaim all 10% of belief belonging to his group. This would nicely correspond to this particular result of our experiment (updating from zero to a significant probability). Still, other estimates, which remain unchanged in this (rather simple) Shaferian model, were found to significantly change in the experiment. According to Shafer's theory, the (classical) belief state updated on the incriminating information about Chad, would be as shown in Table 5. This does not coincide with the results of our experiment (nor with our intuition) and indicates that QPT approach is the more effective one for our problem. Of course, a more sophisticated Shaferian representation can be constructed to better accommodate the results, but we shall not address this in detail. Let us only note that two opposite directionsadding Chad to more and more belief groups (to all of them, in the extreme case) or separating him from other suspects (in the extreme case, to the group of his own) finally result in a standard Bayesian updating scheme.
We have shown that Shafer theory does not describe our data (at least, in a straightforward manner), in contrast to QPT, as our illustration, presented in Section 4, shows. One feature that makes QPT promising is that quantum belief states are much richer (contain much more information) than classical belief states.
The richness of quantum belief states and the demonstration of how belief updating can proceed in QPT (especially equations 9 and 13) raises the issue of the constraints that QPT can place on posterior probabilities, given some priors. In general, QPT can allow most mappings from a prior to a posterior distribution. For example, consider a particular state vector in a multidimensional space. We have modeled the consideration of some information f through the projection of along the ray (or subspace) that corresponds to f. Then, given that f can be any ray in the relevant multidimensional space, this scheme can likewise capture any particular mapping from priors to posteriors (this is indeed the original motivation for this project, namely that QPT probabilistic updating would allow patterns of probabilistic updating inconsistent with Cromwell's rule). Still, there are data that cannot be modeled by Luder's projective measurements, such as double conjunction fallacies  or the data considered by Khrennikov et al. (2014). Moreover, quantum correlations, though stronger than classical correlations, are themselves subject to a particular bound (the Tsirelson's bound in Bell inequalities); cognitive data so far have not produced a corresponding violation (e.g., Conte et al., 2008). Overall, formalizing the precise circumstances when the applicability of QPT principles fails in relation to cognitive data is currently still an open issue. To a first approximation, one can observe that CPT and QPT ought to be constrained in analogous ways, since they are both frameworks for probabilistic inference, but based on different axioms (see also Wang et al., 2014, for a proposal for a constraint on quantum probabilities, not required classically, , for decision data better modeled classically, and Shiffrin & Busemeyer, 2011, for a complexity comparison between matched and CPT and QPT models).
To summarize, we suggest a method from QPT to describe our experimental data, which we found violates classical Bayesian updating. The latter is based on probability of a joint event Prob(hypothesis|data)•Prob(data)=Prob(data|hypothesis)•Prob(hypothesis)=Prob(hypothesis&data). (19) Finding a direct analogue to this expression consistent with QPT is hardly feasible. To mention one problem, most often a combination (a product) of two projectors is not a projector. Still, some attempts can be found in the literature on quantum physics (Steinberg, 1995), but the limitations must always be in sight. We can safely calculate Prob(data) as ||<Projector data | belief state>|| 2 . For example, in our example from Section 4, the hypothesis under consideration is = 1 . Applying operator (9) to prior state ψ 0 in form (8), and taking the norm of the resulting vector F I ψ 0 = -0.42e 1 +0.29e 2 +0.1e 3 +0.05e 4 , (20) we get Prob(data)=0.27 (27%). Now, guided by formula (5) we apply the projector E 1 =|e 1 ><e 1 | to ψ I = F I ψ 0 /0.27 1/2 (ψ I is normed F I ψ 0 ) and we get Prob (hypothesis|data)=0.42/0.27=0.65 (65%). Still, Prob(data|hypothesis)=0 as a result of applying operator (9) to zero vector, which, in turn, is the result of projecting ψ 0 to e 1 . Substituting the calculated valued to in formula (19), we again observe a violation of Bayesian updating, which is due to noncommutativity of operators (9) and E 1 .
Noncommutativity often leads to such order effects and, indeed, our data can be interpreted as an order effect

Conclusion
We experimentally observed violations of classical, Bayesian updating of belief. As shown, updating on strong evidence can lead to a dramatic increase of confidence (from zero, practically denying the possibility) to almost complete confidence. We explain how and why quantum probability theory can be applied to describe the experimental results and resolve the zero-prior trap, in a way which is probably more efficient than following Cromwell's rule (applying only non-zero and non-one probabilities to all the options).