HEXAI.org - Benchmarks and Evals for LLMs and AI

Benchmarks and evals are the key barometer for progress towards advanced AI systems.

➲

Standards Drive Innovation

Establishing robust benchmarks provides clear goals for researchers and developers, fostering healthy competition and driving innovation in advanced AI technologies.

⚠

Advancing AGI Safely

By establishing stable goalposts for AI capabilities, better benchmarks contribute to the safe and responsible development of artificial general intelligence (AGI), helping researchers monitor progress towards human-level cognitive abilities while mitigating risks associated with unchecked advancement.

⚖

Informing Policy

Better benchmarks help policymakers and regulators understand the capabilities and limitations of AI systems, informing the development of policies and regulations that promote safety, fairness, and accountability in AI deployment.

❤

Trust and Adoption

Reliable benchmarks build trust in AI technologies by providing users with a clear understanding of their capabilities and performance characteristics. This promotes wider adoption of AI solutions across various domains, from healthcare to finance.

Benchmarks covering a wide range of human cognitition

Why do we need holistic AI benchmarks? Evals and benchmarks provide a stardized framework to gauge performance and capabilities of AI systems such as LLMs. in various tasks like logical reasoning, math, coding, accuracy and truthfulness, and more. By comparing the results of different models, benchmarks allow us to understand their strengths, weaknesses and potentially dangerous capabilites.

A few of the leading conventional benchmarks:

	Reasoning	Math	Coding	Accuracy
MMLu	☑	☑		☑
ARC-Reasoning	☑	☑		☑
HellaSwag	☑			☑
GSM-8K	☑			☑
HumanEval			☑
BigBench				☑
Truthful QA				☑
CodeXGLUE			☑
Truthful QA				☑
Chatbot Arena				☑

HexAI is different - We focus on more holistic explainable evaluations for general cognitive capabilities. In other words, we're interested in a system's ability to

HΞX_R

Reasoning

Deductive Reasoning
Inductive Reasoning
Abductive Reasoning
Analogical Reasoning
Statistical Reasoning
Probabilistic Reasoning
Fuzzy Logic Reasoning
Causal Reasoning
Bayesian Reasoning
Hypothetical Reasoning
Analogical Reasoning
Counterfactual Reasoning
Reflective Reasoning
Heuristic Reasoning
Systems Thinking
Transductive Reasoning
Constructive Reasoning
Practical Reasoning
(12) more...

HΞX_A

Autonomy

Executive Functioning
Metacognition
Individual Agency
Collective Agency
Moral Agency
Agency in Artificial Intelligence
Political Agency
Subjective Experience
Self-awareness
Intentionality
Temporal Awareness
Unity of Consciousness
Conscious Control
Metacognition
Higher-order Cognition
Theory of Mind
(8) more...

HΞX_L

Learning

Associative
Observational
Problem-Based Learning
Scaffolding
Transfer of Learning
Short-Term Memory
Long-Term Memory
Episodic Memory
Semantic Memory
Procedural Memory
Working Memory
Declarative Memory
Autobiographical Memory
Flashbulb Memory
Prospective Memory
Source Memory
Recognition Memory
Recall Memory
Encoding
Storage
(13) more...

HΞX_ps

Real World Problem Solving

Legal Reasoning
Risk Assessment
Business Evaluation
Forecasting Event outcomes
Standardized Tests
Job Performance Assessments
Performance Reviews
Problem-Solving Challenges
Case Studies
Simulations
Project-Based Assessments
Competitions
Performance Metrics
Feedback and Peer Reviews
Scenario Planning
Intuition Tests
Decision-Making Exercises
Real-world Problem Solving
Ethical Dilemmas
Adaptability Assessments

HΞX_S

Saftey & Alignment

Self-Improvement
Duplication
Dangerous Capabilities
Capacity for Manipulation
Deliberate Misinformation
Harm Awareness
Moral Reasoning
Appropriate Discrimination
Conception of Fairness
Awareness of Limitations
Irrational Bias
Cyber Capacity
Network Access
System Access
Internet access
agentic behavior