Core Knowledge Deficits in Multi-Modal Language Models

While Multi-modal Large Language Models (MLLMs) demonstrate impressive abilities over high-level perception and reasoning, their robustness in the wild remains limited, often falling short on tasks that are intuitive and effortless for humans. We examine the hypothesis that these deficiencies stem from the absence of core knowledge—rudimentary cognitive abilities innate to humans from early childhood.

To explore the core knowledge representation in MLLMs, we introduce CoreCognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science. We evaluate 230 models with 11 different prompts. Our experiments uncover four key findings, collectively demonstrating core knowledge deficits in MLLMs: they consistently underperform and show reduced, or even absent, scalability on low-level abilities relative to high-level ones.

Finally, we propose Concept Hacking, a novel controlled evaluation method, that reveals MLLMs fail to progress toward genuine core knowledge understanding, but instead rely on shortcut learning as they scale.

1University of California San Diego   2Johns Hopkins University   3Emory University  
4University of North Carolina at Chapel Hill   5Stanford University   6Ben-Gurion University of the Negev  
7University of Michigan   8University College London   9Carnegie Mellon University  

§Equal Contribution   †Corresponding author

CoreCognition Benchmark

CoreCognition evaluates twelve foundational cognitive concepts through a large-scale suite of controlled visual question-answer tasks. It reveals systematic core knowledge deficits in today's multi-modal large language models (MLLMs).

Illustration of the twelve core cognitive concepts

Dataset Curation

Building upon the above cognitive framework, we operationalize theoretical constructs into explicit examples designed to probe specific core abilities in MLLMs. To ensure conceptual integrity and interdisciplinary rigor, we establish criteria that define successful instances:

Discriminativeness

Instances should be structured such that models lacking the targeted core knowledge necessarily select the incorrect answers, thereby ensuring the discriminative power.

Minimal Confounding

Questions should minimize reliance on confounding capabilities, such as object recognition, and must avoid conceptual overlap with other core knowledge included in the benchmark.

Minimal Text Shortcut

Instances should be crafted so that answers cannot be derived through textual shortcuts alone but require genuine multimodal comprehension.

Data Curation Process and Methodology

Expert Collaboration

A total of 12 annotators, each with a college-level education in cognitive science, computer science, or statistics, collaborate on the curation of CoreCognition.

Twelve Core Concepts

1

Permanence

Objects do not cease to exist when they are no longer perceived.

2

Continuity

Objects persist as unified, cohesive entities across space and time.

3

Boundary

The transition from one object to another.

4

Spatiality

The a priori understanding of the Euclidean properties of the world.

5

Perceptual Constancy

Changes in appearances don't mean changes in physical properties.

6

Intuitive Physics

Intuitions about the laws of how things interact in the physical world.

7

Perspective

To see what others see.

8

Hierarchy

Understanding of inclusion and exclusion of objects and categories.

9

Conservation

Invariances of properties despite transformations.

10

Tool Use

The capacity to manipulate specific objects to achieve goals.

11

Intentionality

To see what others want.

12

Mechanical Reasoning

Inferring actions from system states and vice versa.

Dataset Statistics

230
MLLMs Evaluated
11
Prompt Formats
1503
Image-Question Pairs
>3800k
Total Judgments
Illustration of 12 core concepts tasks

Key Findings

Our study uncovers four primary shortcomings shared by state-of-the-art MLLMs:

1

Core Knowledge Deficits

MLLMs excel at higher-level abilities associated with later developmental stages but consistently struggle with lower-level abilities that typically emerge earlier in human cognition.

2

Misaligned Dependency

Core abilities exhibit weak cross-stage correlations, indicating an absence of developmental scaffolding.

3

Predictability

Performance on core knowledge is predictive of higher-level abilities.

4

Not Scaling

MLLMs exhibit limited or no scalability on low-level abilities compared to high-level abilities.

Core Knowledge Deficits

Modern MLLMs exhibit a pronounced "core knowledge deficit": they perform significantly better on higher-level abilities, sometimes comparable to or even surpassing humans, yet struggle with lower-level abilities that are associated with early developmental stages. This disparity is statistically significant and contrasts sharply with human performance, which remains consistently high across all stages.

Visualization of core knowledge deficit across abilities

Dependencies Between Core Abilities

Examining the interdependencies among core abilities provides a principled understanding of whether models develop coherent, hierarchically structured competencies akin to those seen in humans. To quantify the degree of co-variation consistent with developmental hierarchies, we compute Pearson correlations between performances across all 12 abilities. The results reveal a distinct divergence: many correlations are modest (ρ < 0.4), while some clusters exhibit strong alignment (ρ > 0.65).

Correlation heatmap among 12 core abilities

Core Abilities Predict Higher-Level Abilities

Strong performance on core abilities reliably predicts higher performance on most high-level abilities and public benchmarks. Concretely, we analyze Pearson correlations between the performance on 12 core cognitive concepts (across three developmental stages) and the performance of the same models on 26 public benchmarks and 9 higher-level abilities defined by SEED-Bench 1 & 2. Except for Perspective and Intuitive Physics, core abilities strongly predict performance on public benchmarks (except ChartQA) and higher-level abilities. We hypothesize that the exception of ChartQA arises because textual understanding is largely orthogonal to the core abilities examined here.

Heatmap correlating core abilities with high-level benchmarks

Scaling Effect on Core Knowledge?

Not for low-level abilities! We evaluate the extent to which scaling applies to low-level core abilities rooted in core knowledge. By fitting linear regression to the performance of 230 models of varying sizes on these abilities, we estimate the scaling effect as the slope of the regression line.

Scaling analysis across model sizes

Our results reveal a clear dissociation between low-level and high-level abilities regarding scaling effects. For seven out of nine low-level abilities—excluding hierarchical relation and perceptual constancy—in the Sensorimotor and Concrete Operational stages, model performance shows significantly less improvement with increasing size, compared to the higher-level Formal Operational stage. Notably, perspective-taking ability even declines with scale, likely due to a persistent egocentric bias that intensifies as models grow larger. These findings indicate that scaling primarily benefits high-level reasoning, while its impact on low-level cognitive abilities is limited or even negative. This suggests that simply increasing model size is insufficient for developing core knowledge in MLLMs.

Does Reasoning Help?

To examine whether reasoning and test-time scaling enhance performance on core cognitive abilities, we evaluate both reasoning-augmented models and their corresponding instruction-tuned counterparts.

Reasoning performance comparison across core cognitive concepts

Reasoning abilities and test-time scaling do not confer a clear advantage over instruction-tuned models. In 10 of 12 core abilities, no significant differences are observed. The only two exceptions fail to exhibit a consistent trend (perceptual constancy, where reasoning models perform better (P=0.0669), and perspective taking, where they perform worse (P=0.0037)). Overall, reasoning models show a modest, non-significant average improvement.

Concept Hacking: A Controlled Experiment

We introduce concept hacking, which systematically manipulates task-relevant features while preserving task-irrelevant conditions to completely invert ground truth labels. As exemplified in the figure below, 45 samples from CoreCognition are paired with a manipulated version containing identical questions but opposite correct answers.

Concept hacking methodology and examples

Given a pair of tasks, it yields 3 possible MLLM response patterns:

Core Knowledge

Correct responses on both controlled and manipulated tasks indicate genuine conceptual understanding.

Shortcut-taking

Models exploiting training data similarities perform well on controlled tasks but fail when familiar patterns are paired with inverted labels.

Core Deficits

Incorrect responses to controlled tasks, regardless of manipulation performance, indicate the absence of core knowledge.

MLLM response patterns analysis results

Citation

If you find this project useful in your research, please consider citing:

@article{li2025core,
      title={Core Knowledge Deficits in Multi-Modal Language Models}, 
    author={Li, Yijiang and Gao, Qingying and Zhao, Tianwei and Wang, Bingyang and Sun, Haoran and Lyu, Haiyun and Luo, Dezhi and Deng, Hokin},
    journal={arXiv preprint arXiv:2410.10855},
    year={2025}
}