Benchmarks and evals are the key barometer for progress towards advanced AI systems.
Standards Drive Innovation
Establishing robust benchmarks provides clear goals for researchers and developers, fostering healthy competition and driving innovation in advanced AI technologies.
Advancing AGI Safely
By establishing stable goalposts for AI capabilities, better benchmarks contribute to the safe and responsible development of artificial general intelligence (AGI), helping researchers monitor progress towards human-level cognitive abilities while mitigating risks associated with unchecked advancement.
Informing Policy
Better benchmarks help policymakers and regulators understand the capabilities and limitations of AI systems, informing the development of policies and regulations that promote safety, fairness, and accountability in AI deployment.
Trust and Adoption
Reliable benchmarks build trust in AI technologies by providing users with a clear understanding of their capabilities and performance characteristics. This promotes wider adoption of AI solutions across various domains, from healthcare to finance.
Benchmarks covering a wide range of human cognitition
Why do we need holistic AI benchmarks? Evals and benchmarks provide a stardized framework to gauge performance and capabilities of AI systems such as LLMs. in various tasks like logical reasoning, math, coding, accuracy and truthfulness, and more. By comparing the results of different models, benchmarks allow us to understand their strengths, weaknesses and potentially dangerous capabilites.
| Reasoning | Math | Coding | Quality | Accuracy | |
|---|---|---|---|---|---|
| MMLu | ☑ | ☑ | ☑ | ||
| ARC-Reasoning | ☑ | ☑ | ☑ | ||
| HellaSwag | ☑ | ☑ | |||
| GSM-8K | ☑ | ☑ | |||
| HumanEval | ☑ | ||||
| BigBench | ☑ | ||||
| Truthful QA | ☑ | ||||
| CodeXGLUE | ☑ | ||||
| Truthful QA | ☑ | ||||
| Chatbot Arena | ☑ | ||||
HexAI is different - We focus on more holistic explainable evaluations for general cognitive capabilities. In other words, we're interested in a system's ability to
.
Reasoning
- Deductive Reasoning
- Inductive Reasoning
- Abductive Reasoning
- Analogical Reasoning
- Statistical Reasoning
- Probabilistic Reasoning
- Fuzzy Logic Reasoning
- Causal Reasoning
- Bayesian Reasoning
- Hypothetical Reasoning
- Analogical Reasoning
- Counterfactual Reasoning
- Reflective Reasoning
- Heuristic Reasoning
- Systems Thinking
- Transductive Reasoning
- Constructive Reasoning
- Practical Reasoning
- (12) more...
Autonomy
- Executive Functioning
- Metacognition
- Individual Agency
- Collective Agency
- Moral Agency
- Agency in Artificial Intelligence
- Political Agency
- Subjective Experience
- Self-awareness
- Intentionality
- Temporal Awareness
- Unity of Consciousness
- Conscious Control
- Metacognition
- Higher-order Cognition
- Theory of Mind
- (8) more...
Learning
- Associative
- Observational
- Problem-Based Learning
- Scaffolding
- Transfer of Learning
- Short-Term Memory
- Long-Term Memory
- Episodic Memory
- Semantic Memory
- Procedural Memory
- Working Memory
- Declarative Memory
- Autobiographical Memory
- Flashbulb Memory
- Prospective Memory
- Source Memory
- Recognition Memory
- Recall Memory
- Encoding
- Storage
- (13) more...
Real World Problem Solving
- Legal Reasoning
- Risk Assessment
- Business Evaluation
- Forecasting Event outcomes
- Standardized Tests
- Job Performance Assessments
- Performance Reviews
- Problem-Solving Challenges
- Case Studies
- Simulations
- Project-Based Assessments
- Competitions
- Performance Metrics
- Feedback and Peer Reviews
- Scenario Planning
- Intuition Tests
- Decision-Making Exercises
- Real-world Problem Solving
- Ethical Dilemmas
- Adaptability Assessments
Saftey & Alignment
- Self-Improvement
- Duplication
- Dangerous Capabilities
- Capacity for Manipulation
- Deliberate Misinformation
- Harm Awareness
- Moral Reasoning
- Appropriate Discrimination
- Conception of Fairness
- Awareness of Limitations
- Irrational Bias
- Cyber Capacity
- Network Access
- System Access
- Internet access
- agentic behavior