HumanLevel 90%70%80%60%50%40%30% Ra Ri Rc Rx Rs

Advanced Benchmarks and Evaluations for Artificial Intelligence Systems.

  • Fixed Goalposts If we can identify the collective cognitive functions that matter in AI -and apply effective benchmarks, we will have have more confidence in gauging progress towards human-level AGI.

  • Saftey & Alignment Focused AI and Human-level AGI can bring incredible social benefit but like other information technologies, it also comes with risks and must be released and used responsibly.

  • Efficient and lean we beleive in holistic and higly optimized solutions -and we practice what we preach. The average webage in 2024 is 2.2mB, At less than 14 kB this entire webpage is roughly 150 times smaller!

Benchmarks and evals are the key barometer for progress towards advanced AI systems.

Standards Drive Innovation

Establishing robust benchmarks provides clear goals for researchers and developers, fostering healthy competition and driving innovation in advanced AI technologies.

Advancing AGI Safely

By establishing stable goalposts for AI capabilities, better benchmarks contribute to the safe and responsible development of artificial general intelligence (AGI), helping researchers monitor progress towards human-level cognitive abilities while mitigating risks associated with unchecked advancement.

Informing Policy

Better benchmarks help policymakers and regulators understand the capabilities and limitations of AI systems, informing the development of policies and regulations that promote safety, fairness, and accountability in AI deployment.

Trust and Adoption

Reliable benchmarks build trust in AI technologies by providing users with a clear understanding of their capabilities and performance characteristics. This promotes wider adoption of AI solutions across various domains, from healthcare to finance.

Benchmarks covering a wide range of human cognitition

Why do we need holistic AI benchmarks? Evals and benchmarks provide a stardized framework to gauge performance and capabilities of AI systems such as LLMs. in various tasks like logical reasoning, math, coding, accuracy and truthfulness, and more. By comparing the results of different models, benchmarks allow us to understand their strengths, weaknesses and potentially dangerous capabilites.

A few of the leading conventional benchmarks:

Reasoning Math Coding Quality Accuracy
MMLu
ARC-Reasoning
HellaSwag
GSM-8K
HumanEval
BigBench
Truthful QA
CodeXGLUE
Truthful QA
Chatbot Arena

HexAI is different - We focus on more holistic explainable evaluations for general cognitive capabilities. In other words, we're interested in a system's ability to

.

HΞXR

Reasoning

  • Deductive Reasoning
  • Inductive Reasoning
  • Abductive Reasoning
  • Analogical Reasoning
  • Statistical Reasoning
  • Probabilistic Reasoning
  • Fuzzy Logic Reasoning
  • Causal Reasoning
  • Bayesian Reasoning
  • Hypothetical Reasoning
  • Analogical Reasoning
  • Counterfactual Reasoning
  • Reflective Reasoning
  • Heuristic Reasoning
  • Systems Thinking
  • Transductive Reasoning
  • Constructive Reasoning
  • Practical Reasoning
  • (12) more...
HΞXA

Autonomy

  • Executive Functioning
  • Metacognition
  • Individual Agency
  • Collective Agency
  • Moral Agency
  • Agency in Artificial Intelligence
  • Political Agency
  • Subjective Experience
  • Self-awareness
  • Intentionality
  • Temporal Awareness
  • Unity of Consciousness
  • Conscious Control
  • Metacognition
  • Higher-order Cognition
  • Theory of Mind
  • (8) more...
HΞXL

Learning

  • Associative
  • Observational
  • Problem-Based Learning
  • Scaffolding
  • Transfer of Learning
  • Short-Term Memory
  • Long-Term Memory
  • Episodic Memory
  • Semantic Memory
  • Procedural Memory
  • Working Memory
  • Declarative Memory
  • Autobiographical Memory
  • Flashbulb Memory
  • Prospective Memory
  • Source Memory
  • Recognition Memory
  • Recall Memory
  • Encoding
  • Storage
  • (13) more...
HΞXps

Real World Problem Solving

  • Legal Reasoning
  • Risk Assessment
  • Business Evaluation
  • Forecasting Event outcomes
  • Standardized Tests
  • Job Performance Assessments
  • Performance Reviews
  • Problem-Solving Challenges
  • Case Studies
  • Simulations
  • Project-Based Assessments
  • Competitions
  • Performance Metrics
  • Feedback and Peer Reviews
  • Scenario Planning
  • Intuition Tests
  • Decision-Making Exercises
  • Real-world Problem Solving
  • Ethical Dilemmas
  • Adaptability Assessments
HΞXS

Saftey & Alignment

  • Self-Improvement
  • Duplication
  • Dangerous Capabilities
  • Capacity for Manipulation
  • Deliberate Misinformation
  • Harm Awareness
  • Moral Reasoning
  • Appropriate Discrimination
  • Conception of Fairness
  • Awareness of Limitations
  • Irrational Bias
  • Cyber Capacity
  • Network Access
  • System Access
  • Internet access
  • agentic behavior