### We must use a modeling approach to COVID-19 data that will yield the least biased inference and prediction.

*Read the Reflection, **written 23 July 2021, below the following original Transmission. *

As the world faces the possibility of recurring waves of the current novel coronavirus pandemic, it is critical to identify patterns and dynamics that could be leveraged to decrease future transmission, infection, and death rates. At this stage in the pandemic, data on disease patterns and dynamics are emerging from almost all countries in the world. Variations across countries with respect to coronavirus infection rates, public health policies, social structure, norms, health conditions, environmental policy, climate, and other factors provide us with the data to investigate the impact of different underlying factors and governmental policies on COVID-19 transmission, infection, and death rates.

Despite the fact that millions have been infected and hundreds of thousands have died from COVID-19, the available information is still insufficient for reaching precise inferences and predictions. This is because the available data on each patient are very limited, the variables of interest are highly correlated, and great uncertainty surrounds the underlying process. In addition, though the death rate from COVID-19 is high relative to other infectious diseases, from an inferential point of view, it is still very small since the number of deaths relative to those who did not die is extremely small. As a result, the observations are in the tail of the survival probability distribution. In short, the available data for analysis of COVID-19 are complex, constantly evolving, and ill-behaved. Inferring and modeling with such data results in a continuum of explanations and predictions. We need to use a modeling and inferential approach that will yield the least biased inference and prediction. Unfortunately, traditional approaches impose strong assumptions and structures — most of which are incorrect or cannot be verified — leading to biased, unstable, and misguided inferences and predictions. Information theory offers a solution. It provides a rational inference framework for dealing with mathematically underdetermined problems, allowing us to achieve the least biased inferences.

An information-theoretic approach — specifically, info-metrics — is situated at the intersection of information theory, statistical inference, decision-making under uncertainty, and modeling. In this framework, all information enters as constraints plus added uncertainty within a constrained optimization setup, and the decision function is an information-theoretic one. That decision function is defined simultaneously as the entities of interest — say, patients’ survival probabilities — and the uncertainty surrounding the constraints. That framework extends the maximum entropy principle of Jaynes, which uses Shannon’s entropy as the decision function for problems that are surrounded with much uncertainty. Info-metrics has clear parallels with more traditional approaches, where the joint choice of the information used (within the optimization setting) and a particular decision function will determine a likelihood function. The encompassing role of constrained optimization ensures that the info-metrics framework is suitable for constructing and validating new theories and models, using all types of information. It also enables us to test hypotheses about competing theories or causal mechanisms. For certain problems, the traditional maximum likelihood is a special case of info-metrics.

The info-metrics approach is well suited to dealing with the complex and uncertain cross-country COVID-19 pandemic data, specifically the relatively small sample size of detailed data, high correlations in the data, and the observations in the tail of the distribution. For this analysis, we developed a discrete-choice, binary (recovered/died) model to infer the association between the underlying country-level factors and death. The model controls for age, sex, and whether the country had universal vaccination for measles and Hepatitis B. This information-theoretic approach also allows us to complement existing data with priors constructed from the death frequency (by age and sex) of individuals who were infected with Severe Acute Respiratory Syndrome (SARS). For the detailed study, see Golan et al.

Using data from twenty countries published on the public server on April 24, 2020, our study found a number of country-level factors with a significant impact on the survival rate of COVID-19. One of these is a country’s past or present universal TB (BCG) vaccination. Another one is the air-pollution death rate in the country. Some quantified results (by age — the x-axis — and sex) are presented in the figure below. The left panel shows the predicted death probability conditional on a universal BCG vaccination. There are three universal vaccination possibilities: countries that never had it (say, the United States), that currently have it (say, the Philippines), or that had it in the past (say, Australia). The huge impact on survival rates, across ages, of a universal BCG vaccination, is clear. The right panel demonstrates the probability of dying conditional on air-pollution death — the number of deaths attributable to the joint effects of household and ambient air pollution in a year per 100,000 population. The continuous line reflects the 90th percentile of pollution. The dashed line reflects the 10th percentile of pollution.

**Figure. **The probability of dying conditional on BCG vaccination policies (left). The probability of dying conditional on Pollution (right) show the death rate in the 10th percentile (dots) vs. those at the 90th percentile (continuous). The x-axis is the patients’ age.

The same framework can be used for modeling all other pandemic-related problems, even under much uncertainty and evolving, complex data. Examples include conditional Markov processes, dynamical systems, and systems that evolve simultaneously. The info-metrics framework allows us to construct theories and models and to perform consistent inferences and predictions with all types of information and uncertainty. Naturally, each problem is different and demands its own information and structure, but the info-metrics framework provides us with the general logical foundations and tools for approaching all inferential problems. It also allows us to incorporate priors and guides us toward a correct specification of the constraints — the information we have and use — which is a nontrivial problem.

So, should we always use info-metrics? To answer this, it is necessary to compare info-metrics with other methods used for policy analysis and causal inference. All inferential methods force choices, impose structures, and require assumptions. With complex and ill-behaved pandemic data, more assumptions are needed. Together with the data used, these imposed assumptions determine the inferred solutions. The assumptions and structures include the likelihood function, the decision function, and other parametric (or even nonparametric) assumptions on the functional form or constraints used. The reason for that is, without this additional information, all problems are under-determined. A logical way to compare different inferential approaches (classical and Bayesian), especially in relation to complex and ill-behaved pandemic data, is within a constrained optimization setup. That way, the comparison is on a fair basis as we can account for the information used in each approach.3 But such a detailed comparison, including other approaches like agent-based models (ABM), deserves its own paper and is outside to scope of this essay. Here, I point toward two basic choices we need to make when using the info-metrics approach. First, the choice of the constraints; the constraints are chosen based on the symmetry conditions or the theory we know (or hypothesize) about the problem. They capture the rules that govern the system we study. Mathematically, they must be satisfied within the optimization. Statistically, if specified correctly, they are sufficient statistics. In the more classical and Bayesian approaches, the constraints are directly related to the parametric functional form used (say, linear, nonlinear, etc.). But specifying the constraints within info-metrics, or the functional forms in other approaches, is far from trivial and affects the inferred solution. Info-metrics provides us with a way to falsify the constraints and points us in the direction of improving them. That choice, together with the decision function used, determines the exact functional form of the solution, or the inference.

The second choice we make in the info-metrics framework is constructing the constraints as stochastic. This is different than the classical maximum-entropy approach where the constraints must be perfectly satisfied. This is also different than classical approaches where the likelihood and functional forms must be perfectly specified. But there is no free lunch. To achieve this more generalized framework, which allows us to model and infer a larger class of problems, we must bear the cost of specifying the bounds on the uncertainty. These bounds are theoretically or empirically derived. But, regardless of that derivation, it implies that what we give up is the assurance that our solution is first-best; rather, it may be a second-best solution, a solution describing an approximate theory, or the evolution of a complex theory derived from a mix of different underlying elements and distributions. The benefit is that whenever we deal with insufficient and uncertain information, it allows us to account for all types of uncertainties and to handle ill-behaved data. It provides us with a way to make inferences even under much uncertainty and ill-behaved data. Out of all possible methods, it is the one that uses the least amount of information and therefore tends to produce the least biased inference.

Whether it is more convenient or appropriate to choose a likelihood function or to determine the structure of the constraints from symmetry conditions and other information is a decision faced by each researcher. When approaching this decision, we should keep in mind that the constraints are only one part of the decision. The choice, however, of what method to use, depends on the problem we try to solve, the information we have, and the researcher’s preference.

*Amos Golan
American University*

*Santa Fe Institute*

*Pembroke College, Oxford*

**REFERENCES**

- Jaynes, E. T. (1957). “Information Theory and Statistical Mechanics.” Physical Review 106 (4): 620–630. https://doi.org/10.1103/PhysRev.106.620.
- Shannon, C. E. (1948). “A Mathematical Theory of Communication.” Bell System Technical Journal 27: 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.
- Golan, A. (2018). Foundations of Info-Metrics: Modeling, Inference, and Imperfect Information. Oxford University Press. http://info-metrics.org.
- Golan, A. et al (2020). “Effect of Universal TB Vaccination and Other Policy-Relevant Factors on the Probability of Patient Death from COVID-19,” Working Paper 2020-041, Human Capital and Economic Opportunity Working Group (U Chicago).

**Read more posts in the Transmission series, dedicated to sharing SFI insights on the coronavirus pandemic.**

## Reflection

July 23, 2021

**FINDING WAYS TO MODEL COMPLEX DATA AMID THE ABSOLUTELY UNEXPECTED**

In my Transmission, “Info-Metrics for Modeling and Inference with Complex and Uncertain Pandemic Information,” I discussed the problem of modeling and trying to understand complex pandemic data under insufficient and imperfect information. I stressed the fact that such problems have multiple solutions due to model and theory ambiguity, a high level of uncertainty in the information, and the ill-behaved nature of the observed information. I argued in favor of the information-theoretic approach for modeling and inference, called info-metrics, as the least biased and most efficient approach for studying such complex problems.

Looking at the available worldwide data early in the pandemic period—when the COVID-19 disease and its exact transmission were still mysterious—I used the info-metrics approach to probe potential policies and interventions that could be applied to increase the survival rate of infected individuals. The universal Bacillus Calmette–Guérin (BCG) tuberculosis (TB) vaccination was the most pronounced factor, followed by pollution rates.

Looking back over the pandemic to date, the COVID-19 infection rate grew exponentially, more than 190 million were infected and about 4.1 million infected individuals died as of July 23, 2021. This is about a 0.021 probability of dying (or 2.1% of the infected) conditional only on being infected. We now have much more data and understanding of how to treat the disease. Related to my discussion on modeling and inference of complex systems, this means that we can get improved results and predictions. My original study controlled for age, sex, and other basic countrywide indicators and explored the possible effects of existing factors. The dataset I had was extremely small and the observed number of patients who died was much above the 2.1% mentioned above. Now we have a vaccine, but it is still important to study the potential implications for COVID-19 infection of different health, environmental, and economic policies.

A year later, I see two key lessons. First, though my original study was done with very little information and data, the basic results seem to be quite robust, even months later. Recent evidence confirms the higher survival rate for patients from countries with a TB vaccination. Similarly, using newly available data, the conditional probabilities by age and sex shown in the original study are somewhat lower but are qualitatively unchanged. I expected the model to work, but I didn’t expect it to be so robust when millions of new observations became available. Second, it is important to keep in mind that regardless of the amount of information (and data points) we have, there will always remain uncertainty and model ambiguity when modeling complex systems. Furthermore, such data are often highly correlated and, as discussed in my Transmission, the interest lies in the tail of the distribution (conditional death rate of infected individuals). This means that to understand such systems, and regardless of the amount of data, one must accommodate these complications. The information-theoretic and least-biased approach seems like a good candidate for the task. That method allows us to determine the simplest possible (approximate) theory with the information we have, but with minimal assumptions that cannot be validated. It also provides a natural way to falsify the theory and directs us toward a better one. But to reemphasize, each problem is unique and specifying the information (constraints) is far from trivial. When specifying the information, it is essential to accommodate for possible uncertainty around that information. Overall, the framework I described allows us to model and better understand very complex problems. If done correctly, it provides us with the best possible approximate theory given the uncertain and insufficient information we have.

Looking back, I also see that, though I have no knowledge in the medical sciences, my own work helped me to characterize a close connection between basic vaccinations and the individual relative (and conditional on their own characteristics and other country factors) immunity for the coronavirus. I was also able to understand the direct impact of the environment on the probability of survival. More generally, it helped me realize that certain basic policies affecting health and wellness, some of which are very simple, should be implemented (or reimplemented) in places where they are currently unavailable.

On a more fundamental level, looking back, it is also clear—and expected—that we can never accurately predict future pandemics and events originating from complex systems. But this is not only because these are complex systems and data. Rather, from our observers’ point of view, these complex systems are surrounded by much uncertainty and ignorance (defined below). This means that we must accommodate for model and theory ambiguity. Furthermore, these systems are constantly evolving, so we need to allow our theories to evolve as well, but this is practically impossible. Therefore, we have to keep testing our theories and adjust them according to the new information, but they will always remain “approximate theories.”

What have we learned? Modeling should be not just about prediction or forecasting, but rather about causal relationships. These relationships allow us to understand the direct and indirect impacts of different policies on the transmission and survival rates conditional on patients’ characteristics and other environmental and economic conditions. With that said, I believe the current theories are incomplete and cannot yet explain the pandemic, its sources, and the conditional transmission and survival rates (at least for unvaccinated individuals).

Are current models and theories satisfactory? Are they sufficient to reduce the risk of future pandemics or other unexpected natural or human-made disasters? All systems, including complex ones, evolve constantly. Society and nature coevolve simultaneously. The models and theories constructed to describe social outcomes must constantly change. Modeling the dynamics of the change is very difficult (or practically impossible). And this is not just due to uncertainty and insufficient information. This is also due to ignorance about new pathogens such as SARS-CoV-2 and our resulting difficulties in responding to them. It is not quantifiable uncertainty, which we can handle, but rather the absolute unexpected. So, yes, we have learned much, and we have improved our understanding, but we need to always update and learn from the new events and information we observe.

Mathematically, the basic problems described above also hold for modeling recent social unrest and different types of discrimination in society and the marketplace. The information is insufficient and uncertain and, regardless of the amount of data, the problems are ill-behaved. Like the pandemic data, the solution is at the tail of the distribution. The info-metrics framework summarized very briefly in my earlier Transmission, and commented on here, provides us with a basic framework for studying such problems.

The pandemic revealed to us how much we still do not understand and cannot predict. But it also taught us that science endowed us with tools that allow us to partially understand complex data even when the available information remains insufficient and complex. If these tools are used correctly, they provide us with ways to make causal inferences despite the complexities discussed above.

**Read more thoughts on the COVID-19 pandemic from complex-systems researchers in The Complex Alternative, published by SFI Press.**