Reform Macrosociology With The Algorithmic Information Criterion for Macrosocial Model Selection

The National Science Foundation is to reform macrosociology by taking all of the following steps:

  1. Publish a macrosocial database consisting of a wide range of longitudinal social measures relevant (directly or indirectly) to predicting macrosocial dynamics.
  2. All funding for sociology is to flow into a Macrosociology Trust Fund, invested in 30-year treasury instruments.
  3. Payout from the Macrosociology Trust Fund goes to any institution that, or individual who, improves the approximation of the macrosocial database’s algorithmic information.

Background:

See A Counterfactual History: Algorithmic Information Theory as the Founation of Social Science

I can pretty much guarantee RFK Jr. won’t come up with the cause of autism incidence increase.

Popper and Kuhn both successfully attacked the foundation of data-driven scientific discovery of causality with their psychologically and rhetorically intense popularizations of “the philosophy of science” at the precise moment in history that it became practical to rigorously discriminate mere correlation from causation, even without controlled experimentation, by looking at the data.

I only became aware of this after attempting to do first-order epidemiology of the rise of autism that had severely impacted colleagues of mine in Silicon Valley — investigation required since it was apparent that no one more qualified was bothering to do so. I did expect to find a correlation with non-western immigrants from India, and found one. Now, having said that, I’m not here to make the case for that particular causal hypothesis — there are others that I can set forth that I also expected and did find evidence for. What I’m here to point out is that my attempts to bring these hypotheses up were greeted with the usual “social science” rhetoric one expects: “Correlation doesn’t imply causation.” “Ecological correlations are invalid due to the ecological fallacy.” and so forth. This got me interested in precisely how it is that “social science” purports to infer causation from the data, experimental controls being the one widely accepted means of determining causality in the philosophy of science and the one thing social science can’t perform at the macro scale.

This interest was amplified when I, on something of a lark, decided to take my data that I’d gathered to investigate the ecology of autism, and see which of the ecological variables was the most powerful predictor of the other variables I had chosen. One variable, in particular, that I had been interested in, not for autism causation, but for social causality in general, was the ratio of Jews to Whites in a human ecology at the State level in the US. Well, out of hundreds of variables, guess which one came out on top?

Of course, again, I don’t need to explain to the readers the kind of rhetorical attacks on this “lark” of mine: Same old, same old…

So my investigation of causal inference intensified.

Eventually, circa 2004-2005, I intuited that data compression had the answer and suggested something I called “The C-Prize” — a prize that would pay out for incremental improvements in compression of a wide-ranging corpus of data, resulting in computational models of complex dynamical systems, including everything from physics to macrosocial models. That’s when I ran across information theoretic work that distinguished between Shannon information and what is now called “Algorithmic Information”. The seminal work in Algorithmic Information occurred in the lat 1950s and early 1960s — precisely when Moore’s Law was taking off in its relentlessly exponentiating power. Algorithmic Information content of a data set is the number of bits in its smallest executable archive — the smallest algorithm that outputs that data.

Shannon information is basically just statistical. Think of the digits of pi. Shannon says the information content is identical with the string of digits. Algorithmic Information says the information content is the size of the program that outputs the digits of pi.

That 1960s discovery threatened to bring the social sciences to heel with a rigorous and principled information criterion for model selection far superior, and provably so, to all other model selection criteria used by the social sciences. Moreover, the models so-selected would be necessarily causal in nature and be amenable to using the power of silicon to make predictions without any kind of ideological bias.

This, I strongly believe, was the precise reason Popper and Kuhn committed their acts of violence against science at the precise moment in history they did.

So what is the chance that RFK Jr. will apply this, the only rigorous tool to infer causality at the ecological level, given the threat it poses to the social pseudosciences?

See “HumesGuillotine” at github

Google DeepMind’s Gemini2.5Pro was terrible compared to Anthropic Claude 3.5 Sonnet at understanding the counterfactual consequences of a so-reformed macrosociology.

But Gemini2.5 Pro’s fallacies are quite common and it managed to pack many of them into just one paragraph, exemplifying many of those fallacies:

“Applying Kolmogorov Complexity analysis, even in approximation, necessitates a high degree of data standardization. KC is defined relative to a fixed universal description language or computational model. Comparing the complexity of different datasets, or tracking the complexity of a single dataset over time as models improve, requires that the data be represented in a perfectly consistent and unambiguous format. Any variations in definition, measurement, or format could be misinterpreted as changes in the underlying phenomenon’s complexity rather than as artifacts of inconsistent representation.”

Sentence 1: Wrong

Sentence 2: Right but without critical nuance in practice.

Sentence 3: Wrong

Sentence 4: Obviated by the model selection criterion

Sentence 1 is wrong because quite aside from missing the point entirely by conflating model analysis (ie model creation) with model selection, forensic epistemology starts with raw phenomena in any case. Raw data needs to be “cleaned” and the process by which the data is “cleaned” needs to be systematically documented and there is no better documentation than the cleaning algorithm. All Sentence 1 is saying is that if the raw dataset is not cleaned in advance then the size of the executable archive will increase. Well cry me a river.

Sentence 2 needs the nuance that if all you are talking about is the difference between widely used instruction sets – which can be reduced (even in the case of GPUs with radically different architectures) to the algorithmic descriptions, in standardized instruction sets, of the functions they perform in the decompression – then the difficulty of making the sizes commensurable is, in practice, vastly simpler than standardizing judgement criteria for various approaches to data analysis.

Sentence 3 is wrong for reasons similar to that described for Sentence 1 but I’ll expand on that since it is repeatedly thrown up by Crimestop victims: Yes data cleaning is huge… its a big deal… it costs a lot… blah blah blah. We hear you, victims! But not only is it a cost that must be paid in any event, and not only is getting it recorded and published algorithmically the only intellectually honest approach, but by so enforcing said rigor pressure will be brought to bear on not only modeling the sociology of sociologists and their data collection methods, but on creating better standards across time for data gathering.

Sentence 4 is saying that someone somewhere will engage in bad data analysis. Yes. That’s right. They will and that’s precisely why you want funding to be diverted away from them and toward people who do better data analysis with the Algorithmic Information Criterion for model selection. This, also, points to pressure on the organizations that do data selection to be better at what data they select by making it more transparent as well as the consequences of bad data selection. Bear in mind that data selection is not model selection because data are not models.