Core Knowledge Deficits in Multi-Modal Language Models

While Multi-modal Large Language Models (MLLMs) demonstrate impressive abilities over high-level perception and reasoning, their robustness in the wild remains limited, often falling short on tasks that are intuitive and effortless for humans. We examine the hypothesis that these deficiencies stem from the absence of core knowledge—rudimentary cognitive abilities innate to humans from early childhood.

To explore the core knowledge representation in MLLMs, we introduce CoreCognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science. We evaluate 230 models with 11 different prompts. Our experiments uncover four key findings, collectively demonstrating core knowledge deficits in MLLMs: they consistently underperform and show reduced, or even absent, scalability on low-level abilities relative to high-level ones.

Finally, we propose Concept Hacking, a novel controlled evaluation method, that reveals MLLMs fail to progress toward genuine core knowledge understanding, but instead rely on shortcut learning as they scale.

Paper Dataset Poster Code Website

Yijiang Li¹ Qingying Gao^2,§ Tianwei Zhao^2,§ Bingyang Wang^3,§ Haoran Sun²

Haiyun Lyu⁴ Robert D. Hawkins⁵ Nuno Vasconcelos¹ Tal Golan⁶ Dezhi Luo^7,8,† Hokin Deng^9,†

¹University of California San Diego ²Johns Hopkins University ³Emory University

⁴University of North Carolina at Chapel Hill ⁵Stanford University ⁶Ben-Gurion University of the Negev

⁷University of Michigan ⁸University College London ⁹Carnegie Mellon University

^§Equal Contribution ^†Corresponding author

CoreCognition Benchmark

CoreCognition evaluates 12 core knowledge concepts through a large-scale suite of controlled visual question-answer tasks. It reveals systematic core knowledge deficits in existing Multi-modal Large Language Models (MLLMs).

Illustration of the twelve core cognitive concepts

Dataset Curation

Building upon the cognitive framework, we operationalize theoretical constructs into explicit examples designed to probe specific core abilities in MLLMs. To ensure conceptual integrity and interdisciplinary rigor, we establish criteria that define successful instances:

Discriminativeness

Instances should be structured such that models lacking the targeted core knowledge necessarily select the incorrect answers, thereby ensuring the discriminative power.

Minimal Confounding

Questions should minimize reliance on confounding capabilities, such as object recognition, and must avoid conceptual overlap with other core knowledge included in the benchmark.

Minimal Text Shortcut

Instances should be crafted so that answers cannot be derived through textual shortcuts alone but require genuine multimodal comprehension.

Expert Collaboration

A total of 12 annotators, each with a college-level education in cognitive science, computer science, or statistics, collaborate on the curation of CoreCognition.

Examples from our CoreCognition benchmark.

Twelve Core Concepts

Permanence

Objects do not cease to exist when they are no longer perceived.

Continuity

Objects persist as unified, cohesive entities across space and time.

Boundary

The transition from one object to another.

Spatiality

The a priori understanding of the Euclidean properties of the world.

Perceptual Constancy

Changes in appearances don't mean changes in physical properties.

Intuitive Physics

Intuitions about the laws of how things interact in the physical world.

Perspective

To see what others see.

Hierarchy

Understanding of inclusion and exclusion of objects and categories.

Conservation

Invariances of properties despite transformations.

Tool Use

The capacity to manipulate specific objects to achieve goals.

Intentionality

To see what others want.

Mechanical Reasoning

Inferring actions from system states and vice versa.

Dataset Statistics

230

MLLMs Evaluated

Prompt Formats

1503

Image-Question Pairs

>3800k

Total Judgments

Key Findings

Our study uncovers four primary shortcomings shared by state-of-the-art MLLMs:

Core Knowledge Deficits

MLLMs excel at higher-level abilities associated with later developmental stages but consistently struggle with lower-level abilities that typically emerge earlier in human cognition.

Misaligned Dependency

Core abilities exhibit weak cross-stage correlations, indicating an absence of developmental scaffolding.

Predictability

Performance on core knowledge is predictive of higher-level abilities.

Not Scaling

MLLMs exhibit limited or no scalability on low-level abilities compared to high-level abilities.

Core Knowledge Deficits

Modern MLLMs exhibit a pronounced "core knowledge deficit": they perform significantly better on higher-level abilities, sometimes comparable to or even surpassing humans, yet struggle with lower-level abilities that are associated with early developmental stages. This disparity is statistically significant and contrasts sharply with human performance, which remains consistently high across all stages.

Visualization of core knowledge deficit across abilities

Dependencies Between Core Abilities

Examining the interdependencies among core abilities provides a principled understanding of whether models develop coherent, hierarchically structured competencies akin to those seen in humans. To quantify the degree of co-variation consistent with developmental hierarchies, we compute Pearson correlations between performances across all 12 abilities. The results reveal a distinct divergence: many correlations are modest (ρ < 0.4), while some clusters exhibit strong alignment (ρ > 0.65).

Correlation heatmap among 12 core abilities

Core Abilities Predict Higher-Level Abilities

Strong performance on core abilities reliably predicts higher performance on most high-level abilities and public benchmarks. Concretely, we analyze Pearson correlations between the performance on 12 core cognitive concepts (across three developmental stages) and the performance of the same models on 26 public benchmarks and 9 higher-level abilities defined by SEED-Bench 1 & 2. Except for Perspective and Intuitive Physics, core abilities strongly predict performance on public benchmarks (except ChartQA) and higher-level abilities. We hypothesize that the exception of ChartQA arises because textual understanding is largely orthogonal to the core abilities examined here.

Heatmap correlating core abilities with high-level benchmarks

Scaling Effect on Core Knowledge?

Not for low-level abilities! We evaluate the extent to which scaling applies to low-level core abilities rooted in core knowledge. By fitting linear regression to the performance of 230 models of varying sizes on these abilities, we estimate the scaling effect as the slope of the regression line.

Our results reveal a clear dissociation between low-level and high-level abilities regarding scaling effects. For seven out of nine low-level abilities—excluding hierarchical relation and perceptual constancy—in the Sensorimotor and Concrete Operational stages, model performance shows significantly less improvement with increasing size, compared to the higher-level Formal Operational stage. Notably, perspective-taking ability even declines with scale, likely due to a persistent egocentric bias that intensifies as models grow larger. These findings indicate that scaling primarily benefits high-level reasoning, while its impact on low-level cognitive abilities is limited or even negative. This suggests that simply increasing model size is insufficient for developing core knowledge in MLLMs.

Does Reasoning Help?

To examine whether reasoning and test-time scaling enhance performance on core cognitive abilities, we evaluate both reasoning-augmented models and their corresponding instruction-tuned counterparts.

Reasoning performance comparison across core cognitive concepts

Reasoning abilities and test-time scaling do not confer a clear advantage over instruction-tuned models. In 10 of 12 core abilities, no significant differences are observed. The only two exceptions fail to exhibit a consistent trend (perceptual constancy, where reasoning models perform better (P=0.0669), and perspective taking, where they perform worse (P=0.0037)). Overall, reasoning models show a modest, non-significant average improvement.

Concept Hacking: A Controlled Experiment

We introduce concept hacking, which systematically manipulates task-relevant features while preserving task-irrelevant conditions to completely invert ground truth labels. As exemplified in the figure below, 45 samples from CoreCognition are paired with a manipulated version containing identical questions but opposite correct answers.

Concept hacking methodology and examples

For example, in the perceptual constancy task illustrated above, GPT-4o correctly produced reasoning that seemingly reflects the understanding of perceptual constancy ("the converging lines of the bridge create an illusion of decreasing width") when answering the manipulation task, even though the width of the bridge is actually decreasing, signaling that its reasoning is not based on the visual information presented in the image.

Given a pair of tasks, it yields 4 possible outcome types, and 3 MLLM response patterns:

Control	Manipulation	Interpretation
✓	✓	core knowledge
✓	✗	shortcut
✗	✓	core deficits
✗	✗	core deficits

Core Knowledge

Correct responses on both controlled and manipulated tasks indicate genuine conceptual understanding.

Shortcut-taking

Models exploiting training data similarities perform well on controlled tasks but fail when familiar patterns are paired with inverted labels.

Core Deficits

Incorrect responses to controlled tasks, regardless of manipulation performance, indicate the absence of core knowledge.

The results demonstrated a clear segregation of models relying on shortcut-taking and illusory understanding. A significant proportion of models clustered within the top left section of the chart (high manipulation accuracy, below-chance control accuracy), suggesting that these models extensively employed illusory understanding for problem-solving. In other words, they have a "core illusion" exemplified by a strong disposition toward a false understanding of the world. In contrast, a smaller portion of the models clustered within the bottom right section (high control accuracy, below-chance manipulation accuracy). These models were highly susceptible to manipulation, thereby revealing substantial reliance on shortcuts. Finally, a major proportion of models demonstrated both above-chance performance on manipulation and control tasks, but fell significantly behind humans on both.

Citation

If you find this project useful in your research, please consider citing:

@article{li2025core,
      title={Core Knowledge Deficits in Multi-Modal Language Models}, 
    author={Li, Yijiang and Gao, Qingying and Zhao, Tianwei and Wang, Bingyang and Sun, Haoran and Lyu, Haiyun and Luo, Dezhi and Deng, Hokin},
    journal={arXiv preprint arXiv:2410.10855},
    year={2025}
}