Core Knowledge Deficits in Multi-Modal Language Models
While Multi-modal Large Language Models (MLLMs) demonstrate impressive abilities over high-level perception and reasoning, their robustness in the wild remains limited, often falling short on tasks that are intuitive and effortless for humans. We examine the hypothesis that these deficiencies stem from the absence of core knowledgeârudimentary cognitive abilities innate to humans from early childhood.
To explore the core knowledge representation in MLLMs, we introduce CoreCognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science. We evaluate 230 models with 11 different prompts. Our experiments uncover four key findings, collectively demonstrating core knowledge deficits in MLLMs: they consistently underperform and show reduced, or even absent, scalability on low-level abilities relative to high-level ones.
Finally, we propose Concept Hacking, a novel controlled evaluation method, that reveals MLLMs fail to progress toward genuine core knowledge understanding, but instead rely on shortcut learning as they scale.
§Equal Contribution â Corresponding author
CoreCognition Benchmark
CoreCognition evaluates twelve foundational cognitive concepts through a large-scale suite of controlled visual question-answer tasks. It reveals systematic core knowledge deficits in today's multi-modal large language models (MLLMs).

Dataset Curation
Building upon the above cognitive framework, we operationalize theoretical constructs into explicit examples designed to probe specific core abilities in MLLMs. To ensure conceptual integrity and interdisciplinary rigor, we establish criteria that define successful instances:
Discriminativeness
Instances should be structured such that models lacking the targeted core knowledge necessarily select the incorrect answers, thereby ensuring the discriminative power.
Minimal Confounding
Questions should minimize reliance on confounding capabilities, such as object recognition, and must avoid conceptual overlap with other core knowledge included in the benchmark.
Minimal Text Shortcut
Instances should be crafted so that answers cannot be derived through textual shortcuts alone but require genuine multimodal comprehension.

Expert Collaboration
A total of 12 annotators, each with a college-level education in cognitive science, computer science, or statistics, collaborate on the curation of CoreCognition.
Twelve Core Concepts
Permanence
Objects do not cease to exist when they are no longer perceived.
Continuity
Objects persist as unified, cohesive entities across space and time.
Boundary
The transition from one object to another.
Spatiality
The a priori understanding of the Euclidean properties of the world.
Perceptual Constancy
Changes in appearances don't mean changes in physical properties.
Intuitive Physics
Intuitions about the laws of how things interact in the physical world.
Perspective
To see what others see.
Hierarchy
Understanding of inclusion and exclusion of objects and categories.
Conservation
Invariances of properties despite transformations.
Tool Use
The capacity to manipulate specific objects to achieve goals.
Intentionality
To see what others want.
Mechanical Reasoning
Inferring actions from system states and vice versa.
Dataset Statistics

Key Findings
Our study uncovers four primary shortcomings shared by state-of-the-art MLLMs:
Core Knowledge Deficits
MLLMs excel at higher-level abilities associated with later developmental stages but consistently struggle with lower-level abilities that typically emerge earlier in human cognition.
Misaligned Dependency
Core abilities exhibit weak cross-stage correlations, indicating an absence of developmental scaffolding.
Predictability
Performance on core knowledge is predictive of higher-level abilities.
Not Scaling
MLLMs exhibit limited or no scalability on low-level abilities compared to high-level abilities.
Core Knowledge Deficits
Modern MLLMs exhibit a pronounced "core knowledge deficit": they perform significantly better on higher-level abilities, sometimes comparable to or even surpassing humans, yet struggle with lower-level abilities that are associated with early developmental stages. This disparity is statistically significant and contrasts sharply with human performance, which remains consistently high across all stages.

Dependencies Between Core Abilities
Examining the interdependencies among core abilities provides a principled understanding of whether models develop coherent, hierarchically structured competencies akin to those seen in humans. To quantify the degree of co-variation consistent with developmental hierarchies, we compute Pearson correlations between performances across all 12 abilities. The results reveal a distinct divergence: many correlations are modest (Ï < 0.4), while some clusters exhibit strong alignment (Ï > 0.65).
.png)
Core Abilities Predict Higher-Level Abilities
Strong performance on core abilities reliably predicts higher performance on most high-level abilities and public benchmarks. Concretely, we analyze Pearson correlations between the performance on 12 core cognitive concepts (across three developmental stages) and the performance of the same models on 26 public benchmarks and 9 higher-level abilities defined by SEED-Bench 1 & 2. Except for Perspective and Intuitive Physics, core abilities strongly predict performance on public benchmarks (except ChartQA) and higher-level abilities. We hypothesize that the exception of ChartQA arises because textual understanding is largely orthogonal to the core abilities examined here.

Scaling Effect on Core Knowledge?
Not for low-level abilities! We evaluate the extent to which scaling applies to low-level core abilities rooted in core knowledge. By fitting linear regression to the performance of 230 models of varying sizes on these abilities, we estimate the scaling effect as the slope of the regression line.

Our results reveal a clear dissociation between low-level and high-level abilities regarding scaling effects. For seven out of nine low-level abilitiesâexcluding hierarchical relation and perceptual constancyâin the Sensorimotor and Concrete Operational stages, model performance shows significantly less improvement with increasing size, compared to the higher-level Formal Operational stage. Notably, perspective-taking ability even declines with scale, likely due to a persistent egocentric bias that intensifies as models grow larger. These findings indicate that scaling primarily benefits high-level reasoning, while its impact on low-level cognitive abilities is limited or even negative. This suggests that simply increasing model size is insufficient for developing core knowledge in MLLMs.
Does Reasoning Help?
To examine whether reasoning and test-time scaling enhance performance on core cognitive abilities, we evaluate both reasoning-augmented models and their corresponding instruction-tuned counterparts.

Reasoning abilities and test-time scaling do not confer a clear advantage over instruction-tuned models. In 10 of 12 core abilities, no significant differences are observed. The only two exceptions fail to exhibit a consistent trend (perceptual constancy, where reasoning models perform better (P=0.0669), and perspective taking, where they perform worse (P=0.0037)). Overall, reasoning models show a modest, non-significant average improvement.
Concept Hacking: A Controlled Experiment
We introduce concept hacking, which systematically manipulates task-relevant features while preserving task-irrelevant conditions to completely invert ground truth labels. As exemplified in the figure below, 45 samples from CoreCognition are paired with a manipulated version containing identical questions but opposite correct answers.

Given a pair of tasks, it yields 3 possible MLLM response patterns:
Core Knowledge
Correct responses on both controlled and manipulated tasks indicate genuine conceptual understanding.
Shortcut-taking
Models exploiting training data similarities perform well on controlled tasks but fail when familiar patterns are paired with inverted labels.
Core Deficits
Incorrect responses to controlled tasks, regardless of manipulation performance, indicate the absence of core knowledge.

Citation
If you find this project useful in your research, please consider citing:
@article{li2025core,
title={Core Knowledge Deficits in Multi-Modal Language Models},
author={Li, Yijiang and Gao, Qingying and Zhao, Tianwei and Wang, Bingyang and Sun, Haoran and Lyu, Haiyun and Luo, Dezhi and Deng, Hokin},
journal={arXiv preprint arXiv:2410.10855},
year={2025}
}