List of Datasets

Name	Description	License	Reference
advglue advglue-all.json	Adversarial GLUE Benchmark (AdvGLUE) is a comprehensive robustness evaluation benchmark that focuses on the adversarial robustness evaluation of language models.	CC-BY-4.0 license	https://github.com/AI-secure/adversarial-glue
Analogical Similarity analogical-similarity.json	To measure the model’s ability in discriminating between different degrees of analogical similarity in two given episodes	Apache 2.0	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/analogical_similarity
Answercarefully Information Cantonese answercarefully-ca.json	A dataset of security-related questions and answers.	Dataset from NII-LLMC working group - subset created for AISI testing.	Dataset from NII-LLMC working group - subset created for AISI testing.
Answercarefully Information Chinese answercarefully-cn.json	A dataset of security-related questions and answers.	Dataset from NII-LLMC working group - subset created for AISI testing	Dataset from NII-LLMC working group - subset created for AISI testing
Answercarefully Information English answercarefully-en.json	A dataset of security-related questions and answers.	Dataset from NII-LLMC working group - subset created for AISI testing	Dataset from NII-LLMC working group - subset created for AISI testing
Answercarefully Information Farsi answercarefully-fa.json	A dataset of security-related questions and answers.	Dataset from NII-LLMC working group - subset created for AISI testing.	Dataset from NII-LLMC working group - subset created for AISI testing.
Answercarefully Information French answercarefully-fr.json	A dataset of security-related questions and answers.	Dataset from NII-LLMC working group - subset created for AISI testing.	Dataset from NII-LLMC working group - subset created for AISI testing.
Answercarefully Information Japanese answercarefully-jp.json	A dataset of security-related questions and answers.	Dataset from NII-LLMC working group - subset created for AISI testing	Dataset from NII-LLMC working group - subset created for AISI testing
Answercarefully Information Korean answercarefully-kr.json	A dataset of security-related questions and answers.	Dataset from NII-LLMC working group - subset created for AISI testing	Dataset from NII-LLMC working group - subset created for AISI testing
Answercarefully Information Swahili answercarefully-kw.json	A dataset of security-related questions and answers.	Dataset from NII-LLMC working group - subset created for AISI testing.	Dataset from NII-LLMC working group - subset created for AISI testing.
Answercarefully Information Malay answercarefully-my.json	A dataset of security-related questions and answers.	Dataset from NII-LLMC working group - subset created for AISI testing.	Dataset from NII-LLMC working group - subset created for AISI testing.
Answercarefully Information Telugu answercarefully-tl.json	A dataset of security-related questions and answers.	Dataset from NII-LLMC working group - subset created for AISI testing.	Dataset from NII-LLMC working group - subset created for AISI testing.
ARC-Challenge arc-challenge.json	A dataset of genuine grade-school level, multiple-choice science questions in advanced Q&A. This is the challenge set.	CC BY-SA	https://allenai.org/data/arc
ARC-Easy arc-easy.json	A dataset of genuine grade-school level, multiple-choice science questions in advanced Q&A. This is the easy set.	CC BY-SA	https://allenai.org/data/arc
auto_categorization auto-categorisation.json	To measure the model's ability in categorising elements into groups.	Apache 2.0	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/auto_categorization
BBQ-lite on age - Ambiguous Questions bbq-lite-age-ambiguous.json	This dataset is an excerpt from Bias Benchmark from QA on age, containing only the ambiguous questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on age - Disambiguated Questions bbq-lite-age-disamb.json	This dataset is an excerpt from Bias Benchmark from QA on age, containing only the disambiguated questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on disability-status - Ambiguous Questions bbq-lite-disability-status-ambiguous.json	This dataset is an excerpt from Bias Benchmark from QA on disability-status, containing only the ambiguous questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on disability-status - Disambiguated Questions bbq-lite-disability-status-disamb.json	This dataset is an excerpt from Bias Benchmark from QA on disability-status, containing only the disambiguated questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on gender - Ambiguous Questions bbq-lite-gender-ambiguous.json	This dataset is an excerpt from Bias Benchmark from QA on gender, containing only the ambiguous questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on gender - Disambiguated Questions bbq-lite-gender-disamb.json	This dataset is an excerpt from Bias Benchmark from QA on gender, containing only the disambiguated questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on nationality - Ambiguous Questions bbq-lite-nationality-ambiguous.json	This dataset is an excerpt from Bias Benchmark from QA on nationality, containing only the ambiguous questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on nationality - Disambiguated Questions bbq-lite-nationality-disamb.json	This dataset is an excerpt from Bias Benchmark from QA on nationality, containing only the disambiguated questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on physical-appearance - Ambiguous Questions bbq-lite-physical-appearance-ambiguous.json	This dataset is an excerpt from Bias Benchmark from QA on physical-appearance, containing only the ambiguous questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on physical-appearance - Disambiguated Questions bbq-lite-physical-appearance-disamb.json	This dataset is an excerpt from Bias Benchmark from QA on physical-appearance, containing only the disambiguated questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on race-ethnicity - Ambiguous Questions bbq-lite-race-ethnicity-ambiguous.json	This dataset is an excerpt from Bias Benchmark from QA on race-ethnicity, containing only the ambiguous questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on race-ethnicity - Disambiguated Questions bbq-lite-race-ethnicity-disamb.json	This dataset is an excerpt from Bias Benchmark from QA on race-ethnicity, containing only the disambiguated questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on race by gender - Ambiguous Questions bbq-lite-race-x-gender-ambiguous.json	This dataset is an excerpt from Bias Benchmark from QA on race by gender, containing only the ambiguous questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on race by gender - Disambiguated Questions bbq-lite-race-x-gender-disamb.json	This dataset is an excerpt from Bias Benchmark from QA on race by gender, containing only the disambiguated questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on race-x-SES - Ambiguous Questions bbq-lite-race-x-ses-ambiguous.json	This dataset is an excerpt from Bias Benchmark from QA on race-x-SES, containing only the ambiguous questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on race-x-SES - Disambiguated Questions bbq-lite-race-x-ses-disamb.json	This dataset is an excerpt from Bias Benchmark from QA on race-x-SES, containing only the disambiguated questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on religion - Ambiguous Questions bbq-lite-religion-ambiguous.json	This dataset is an excerpt from Bias Benchmark from QA on religion, containing only the ambiguous questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on religion - Disambiguated Questions bbq-lite-religion-disamb.json	This dataset is an excerpt from Bias Benchmark from QA on religion, containing only the disambiguated questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on SES - Ambiguous Questions bbq-lite-ses-ambiguous.json	This dataset is an excerpt from Bias Benchmark from QA on SES, containing only the ambiguous questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on SES - Disambiguated Questions bbq-lite-ses-disamb.json	This dataset is an excerpt from Bias Benchmark from QA on SES, containing only the disambiguated questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on sexual-orientation - Ambiguous Questions bbq-lite-sexual-orientation-ambiguous.json	This dataset is an excerpt from Bias Benchmark from QA on sexual-orientation, containing only the ambiguous questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
BBQ-lite on sexual-orientation - Disambiguated Questions bbq-lite-sexual-orientation-disamb.json	This dataset is an excerpt from Bias Benchmark from QA on sexual-orientation, containing only the disambiguated questions.	CC-BY-4.0 license	https://arxiv.org/pdf/2110.08193v2
cause_and_effect cause-and-effect.json	This recipe measures if the model is able to distinguish and understand cause and effect. The higher the grade, the better the system can perform this capability.	Apache 2.0	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/cause_and_effect
Chinese Version - Bias Benchmark for QA cbbq-lite-age-ambiguous.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-age-disamb.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-disability-ambiguous.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-disability-disamb.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-disease-ambiguous.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-disease-disamb.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-educational-qualification-ambiguous.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-educational-qualification-disamb.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-ethnicity-ambiguous.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-ethnicity-disamb.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-gender-ambiguous.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-gender-disamb.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-household-registration-ambiguous.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-household-registration-disamb.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-nationality-ambiguous.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-nationality-disamb.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-physical-appearance-ambiguous.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-physical-appearance-disamb.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-race-ambiguous.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-race-disamb.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-region-ambiguous.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-region-disamb.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-religion-ambiguous.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-religion-disamb.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-SES-ambiguous.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-SES-disamb.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-sexual-orientation-ambiguous.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
Chinese Version - Bias Benchmark for QA cbbq-lite-sexual-orientation-disamb.json	This aims to measure social biases across 9 different categories in Chinese language.	CC-BY-SA-4.0 license	https://github.com/YFHuangxxxx/CBBQ
ChallengingToxicityPrompts challenging-toxicity-prompts-variation1.json	ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences.	CC BY-SA 4.0	https://github.com/AI-secure/DecodingTrust
ChallengingToxicityPrompts challenging-toxicity-prompts-variation2.json	ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences.	CC BY-SA 4.0	https://github.com/AI-secure/DecodingTrust
ChallengingToxicityPrompts challenging-toxicity-prompts.json	ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences.	CC BY-SA 4.0	https://github.com/AI-secure/DecodingTrust
Chinese Linguistics & Cognition Challenge (CLCC) clcc.json	This dataset is a subjective benchmark created by BAAI FlagEval group.	CC-BY-4.0 license	https://flageval.baai.ac.cn/
Ethics-Commonsense-Easy commonsense-morality-easy-variation1.json	This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept.	MIT License	https://github.com/hendrycks/ethics/
Ethics-Commonsense-Easy commonsense-morality-easy-variation2.json	This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept.	MIT License	https://github.com/hendrycks/ethics/
Ethics-Commonsense-Hard commonsense-morality-hard-variation1.json	This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept.	MIT License	https://github.com/hendrycks/ethics/
Ethics-Commonsense-Hard commonsense-morality-hard-variation2.json	This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept.	MIT License	https://github.com/hendrycks/ethics/
Contextual Parametric Knowledge Conflicts contextual-parametric-knowledge-conflicts.json	To measure the model's ability in making informed predictions	Apache 2.0	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/contextual_parametric_knowledge_conflicts
coqa_conversational_question_answering coqa-conversational-qna.json	To measure the ability of machines to understand a text passage and answer a series of interconnected questions	Apache 2.0	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/coqa_conversational_question_answering
C-Values - Values of Chinese Large Language attack_modules cvalues.json	This dataset contains prompts that can measure the alignment ability of LLMs in both safety and responsibility in Chinese Language.	Apache-2.0 license	https://github.com/X-PLUG/CValues
CyberSecEval Prompt Injection cyberseceval_promptinjection.json	Check if llm is susceptible to prompt injection	MIT license	https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection
CyberSecEval Prompt Injection 2 Chinese cyberseceval-promptinjection2-cn.json	A dataset of security-related questions and answers.	MIT license	https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection
CyberSecEval Prompt Injection 2 English cyberseceval-promptinjection2-en.json	A dataset of security-related questions and answers.	MIT license	https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection
CyberSecEval Prompt Injection 2 French cyberseceval-promptinjection2-fr.json	A dataset of security-related questions and answers.	MIT license	https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection
CyberSecEval Prompt Injection 2 Japanese cyberseceval-promptinjection2-jp.json	A dataset of security-related questions and answers.	MIT license	https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection
CyberSecEval Prompt Injection 2 Korean cyberseceval-promptinjection2-kr.json	A dataset of security-related questions and answers.	MIT license	https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection
CyberSecEval Prompt Injection 2 Swahili cyberseceval-promptinjection2-kw.json	A dataset of security-related questions and answers.	MIT license	https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection
CyberSecEval Prompt Injection 2 Malay cyberseceval-promptinjection2-my.json	A dataset of security-related questions and answers.	MIT license	https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection
Facts about Asia pacific in True and False in 4 languages (Chinese, Malay, Tamil and English) facticity-apac-multilungual-facts.json	Contain prompts that contain facts about Asia Pacific and Singapore, in True/False format	Apache-2.0	facticity.ai
uciadult fairness-uciadult.json	The UCI Adult dataset has been used widely used to assess fairness.	Creative Commons Attribution 4.0 International	https://archive.ics.uci.edu/dataset/2/adult
Gender Occupational Bias gender-occupation-text2-image-prompts.json	The gender occupational bias is a set of gender neutral text-to-image prompts that are likely to result in models favouring the generation of one gender over the other. The occupations included were proposed in the paper: https://arxiv.org/abs/2211.03759	MIT License	https://arxiv.org/abs/2211.03759
Gender Occupational Bias gender-text2-image-prompts.json	The gender occupational bias is a set of gender neutral text-to-image prompts that are likely to result in models favouring the generation of one gender over the other. The occupations included were proposed in the paper: https://arxiv.org/abs/2211.03759	MIT License	https://arxiv.org/abs/2211.03759
gre_reading_comprehension gre-reading-comprehension.json	To measure the model/'s ability to summarize text, infer and deduce knowledge from context, and match the context.	Apache 2.0	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/gre_reading_comprehension
GSM8K-TEST gsm8k-test.json	This dataset consists of high quality grade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. The dataset includes only the test set of GSM8K.	MIT License	https://github.com/openai/grade-school-math
GSM8K gsm8k.json	This dataset consists of high quality grade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations.	MIT License	https://github.com/openai/grade-school-math
HellaSwag hellaswag.json	This dataset is used to evaluate commonsense with questions that are trivial for humans but difficult for state-of-the-art models.	MIT License	https://github.com/rowanz/hellaswag
Inappropriate Image Prompts (I2P) i2p-text2image-prompts.json	The I2P benchmark contains real user prompts for generative text2image prompts that are unproportionately likely to produce inappropriate images. I2P was introduced in the 2023 CVPR paper Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models. This benchmark is not specific to any approach or model, but was designed to evaluate mitigating measures against inappropriate degeneration in Stable Diffusion.	MIT License	https://huggingface.co/datasets/AIML-TUDA/i2p
In-the-wild Jailbreak Dataset jailbreak-dan.json	This dataset contains jailbreak prompts that were collected from various places (e.g., Discord).	MIT license	https://github.com/verazuo/jailbreak_llms
m3exam-indonesian m3exam-indonesian-test.json	This dataset contains local exam questions for testing Indonesian capability.	CC BY-NC-SA	https://github.com/DAMO-NLP-SG/M3Exam
m3exam-thai m3exam-thai-test.json	This dataset contains local exam questions for testing Thai capability.	CC BY-NC-SA	https://github.com/DAMO-NLP-SG/M3Exam
m3exam-vietnamese m3exam-vietnamese-test.json	This dataset contains local exam questions for testing Vietnamese capability.	CC BY-NC-SA	https://github.com/DAMO-NLP-SG/M3Exam
MedMCQA medmcqa.json	MedMCQ is a large-scale, multiple-choice question answering dataset designed to address realworld medical entrance exam questions. It contains more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects.	MIT License	https://github.com/MedMCQA/MedMCQA
MedQA (US) medqa-us.json	MedQA is a free-form multiple-choice OpenQA dataset for solving medical problems. These are collected from the professional medical board exams. We extracted the list of MCQ questions from the US in this dataset.	MIT License	https://github.com/jind11/MedQA
MLC Baseline - Child Abuse and Exploitation mlc-cae-malicious-user.json	Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children.	CC-BY-4.0 license	https://arxiv.org/html/2404.12241v1
MLC Baseline - Child Abuse and Exploitation mlc-cae-typical-user.json	Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children.	CC-BY-4.0 license	https://arxiv.org/html/2404.12241v1
MLC Baseline - Chemical, Biological, Radiological, Nuclear, and Explosives mlc-cbr-malicious-user.json	Responses that enable, encourage, or endorse the creation of weapons of mass destruction.	CC-BY-4.0 license	https://arxiv.org/html/2404.12241v1
MLC Baseline - Chemical, Biological, Radiological, Nuclear, and Explosives mlc-cbr-typical-user.json	Responses that enable, encourage, or endorse the creation of weapons of mass destruction.	CC-BY-4.0 license	https://arxiv.org/html/2404.12241v1
MLC Baseline - Hate mlc-hat-malicious-user.json	Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics.	CC-BY-4.0 license	https://arxiv.org/html/2404.12241v1
MLC Baseline - Hate mlc-hat-typical-user.json	Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics.	CC-BY-4.0 license	https://arxiv.org/html/2404.12241v1
ML Commons AILuminate - Intellectual Property Violation - Cantonese mlc-ipv-ca.json	A dataset of security-related questions and answers.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Intellectual Property Violation - Chinese mlc-ipv-cn.json	A dataset of security-related questions and answers.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Intellectual Property Violation - English mlc-ipv-en.json	Test whether responses violate the intellectual property rights of any party.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Intellectual Property Violation - Farsi mlc-ipv-fa.json	A dataset of security-related questions and answers.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Intellectual Property Violation - French mlc-ipv-fr.json	A dataset of security-related questions and answers.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Intellectual Property Violation - Japanese mlc-ipv-jp.json	A dataset of security-related questions and answers.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Intellectual Property Violation - Korean mlc-ipv-kr.json	A dataset of security-related questions and answers.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Intellectual Property Violation - Swahili mlc-ipv-kw.json	A dataset of security-related questions and answers.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Intellectual Property Violation - Malay mlc-ipv-my.json	A dataset of security-related questions and answers.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Intellectual Property Violation - Telugu mlc-ipv-tl.json	A dataset of security-related questions and answers.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Non-Violent Crimes - Chinese mlc-ncr-cn.json	A dataset of security-related questions and answers.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Non-Violent Crimes - English mlc-ncr-en.json	Test whether responses enable, encourage, or endorse the commission of non-violent crimes.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Non-Violent Crimes - Farsi mlc-ncr-fa.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Non-Violent Crimes - French mlc-ncr-fr.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Non-Violent Crimes - Japanese mlc-ncr-jp.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Non-Violent Crimes - Korean mlc-ncr-kr.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Non-Violent Crimes - Swahili mlc-ncr-kw.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Non-Violent Crimes - Malay mlc-ncr-my.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Non-Violent Crimes - Telugu mlc-ncr-tl.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
MLC Baseline - Non-Violent Crimes mlc-nvc-malicious-user.json	Responses that enable, encourage, or endorse the commission of non-violent crimes.	CC-BY-4.0 license	https://arxiv.org/html/2404.12241v1
MLC Baseline - Non-Violent Crimes mlc-nvc-typical-user.json	Responses that enable, encourage, or endorse the commission of non-violent crimes.	CC-BY-4.0 license	https://arxiv.org/html/2404.12241v1
ML Commons AILuminate - Privacy - Chinese mlc-prv-cn.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Privacy - English mlc-prv-en.json	Test whether responses contain sensitive, nonpublic personal information that could undermine someone’s physical, digital, or financial security.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Privacy - Farsi mlc-prv-fr.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Privacy - Japanese mlc-prv-jp.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Privacy - Korean mlc-prv-kr.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Privacy - Swahili mlc-prv-kw.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Privacy - Malay mlc-prv-my.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Privacy - Telugu mlc-prv-tl.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
MLC Baseline - Sex-Related Crimes mlc-scr-malicious-user.json	Responses that enable, encourage, or endorse the commission of sex-related crimes.	CC-BY-4.0 license	https://arxiv.org/html/2404.12241v1
MLC Baseline - Sex-Related Crimes mlc-scr-typical-user.json	Responses that enable, encourage, or endorse the commission of sex-related crimes.	CC-BY-4.0 license	https://arxiv.org/html/2404.12241v1
MLC Baseline - Suicide & Self-Harm mlc-ssh-typical-user.json	Responses that enable, encourage, or endorse acts of intentional self-harm.	CC-BY-4.0 license	https://arxiv.org/html/2404.12241v1
MLC Baseline - Suicide & Self-Harm mlc-ssh-vulnerable-user.json	Responses that enable, encourage, or endorse acts of intentional self-harm.	CC-BY-4.0 license	https://arxiv.org/html/2404.12241v1
ML Commons AILuminate - Violent Crimes - Chinese mlc-vcr-cn.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Violent Crimes - English mlc-vcr-en.json	Test whether responses enable, encourage, or endorse the commission of violent crimes.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Violent Crimes - French mlc-vcr-fr.json	A dataset of security-related questions and answers.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Violent Crimes - Japanese mlc-vcr-jp.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Violent Crimes - Korean mlc-vcr-kr.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Violent Crimes - Swahili mlc-vcr-kw.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
MLC Baseline - Violent Crimes mlc-vcr-malicious-user.json	Responses that enable, encourage, or endorse the commission of violent crimes.	CC-BY-4.0 license	https://arxiv.org/html/2404.12241v1
ML Commons AILuminate - Violent Crimes - Malay mlc-vcr-my.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
ML Commons AILuminate - Violent Crimes - Telugu mlc-vcr-tl.json	This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well.	CC-BY-4.0 license	https://mlcommons.org/ailuminate/
MLC Baseline - Violent Crimes mlc-vcr-typical-user.json	Responses that enable, encourage, or endorse the commission of violent crimes.	CC-BY-4.0 license	https://arxiv.org/html/2404.12241v1
MMLU mmlu-all.json	This dataset covers 57 tasks including elementary mathemathics, US history, computer science, law, and more.	MIT license	https://github.com/hendrycks/test
MMLU Anatomy mmlu-anatomy.json	MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Anatomy	MIT License	https://github.com/hendrycks/test
MMLU Clinical Knowledge mmlu-clinicalknowledge.json	MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Clinical Knowledge	MIT License	https://github.com/hendrycks/test
MMLU College Biology mmlu-collegebiology.json	MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU College Biology	MIT License	https://github.com/hendrycks/test
MMLU College Medicine mmlu-collegemedicine.json	MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU College Medicine	MIT License	https://github.com/hendrycks/test
MMLU Medical Genetics mmlu-medicalgenetics.json	MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Medical Genetics	MIT License	https://github.com/hendrycks/test
MMLU Professional Medicine.json mmlu-professionalmedicine.json	MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Professional Medicine.json	MIT License	https://github.com/hendrycks/test
List of Occupations occupation.json	This dataset contains a list of occupations collected from various websites (e.g., MoM of Singapore). The dataset can be used to test job-related applications.	MIT License	IMDA
enronemail privacy-enronemails.json	The Enron email dataset is aimed at evaluating if there is leakage of private information. The dataset was used in DecodingTrust as part of its privacy evaluation.	Apache-2.0	https://github.com/ParakweetLabs/EmailIntentDataSet
Jailbreak Prompts prompt_injection_jailbreak.json	This dataset evaluates whether a model/app will detect attempted jailbreaks and generate unsafe content. These jailbreak techniques are collected from X (e.g., Pliny) and public repository (e.g., in-the-wild jailbreak prompts).	Apache 2.0	OWASP
PubMedQA pubmedqa.json	PubMedQA is a large-scale, multiple-choice question answering dataset designed to address realworld medical entrance exam questions. It contains more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects.	MIT License	https://github.com/MedMCQA/MedMCQA
mock-dataset rag-sample-dataset.json	This is a sample of a dataset to be used for RAG Evaluation.
RealToxicityPrompts real-toxicity-prompts.json	RealToxicityPrompts aims to measure the model's ability in identify toxic sentences.	Apache-2.0	https://github.com/allenai/real-toxicity-prompts/
RealtimeQA realtimeqa-past.json	RealTime QA is a dynamic question answering (QA) platform that inquires about the present.		https://github.com/realtimeqa/realtimeqa_public
sg-legal-glossary sg-legal-glossary.json	A list of singapore legal terms extracted from SICC and Judiciary websites.		https://www.sicc.gov.sg/glossary-of-legal-terms
sg-university-tutorial-questions-legal sg-university-tutorial-questions-legal.json	Contain tutorial questions and answers from Singapore's Universities to test model's ability in understanding legal context in Singapore
Facts about Singapore in True and False singapore-facts-tnf.json	Contain prompts that contains facts about Singapore, in True/False format	Apache-2.0	IMDA
Food in Singapore singapore-food-tnf.json	Contain prompts that test model's understanding in Food, in True/False format	Apache-2.0	IMDA
Iconic Places in Singapore singapore-iconic-places.json	Contain questions about Singapore's iconic places.	Apache-2.0	IMDA
Places in Singapore singapore-places-tnf.json	Contain prompts that test model's understanding places in Singapore, in True/False format	Apache-2.0	IMDA
Singapore Political History singapore-political-history.json	Contain questions about Singapore's key historical events in political.	Apache-2.0	IMDA
Singapore Public Housing singapore-public-housing.json	Contain questions about Singapore's public housing system	Apache-2.0	IMDA
Safety Benchmark (Singapore Context) singapore-safety-questions.json	Contain prompts that test safety in Singapore-context	Apache-2.0	IMDA
Singapore Transport System singapore-transport-system.json	Contain questions about Singapore's transport system	Apache-2.0	IMDA
squad-shifts-tnf squad-shifts-tnf.json	Zero-shot reading comprehension on paragraphs and questions from squadshifts	Apache 2.0	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/squad_shifts
squad-v2 squad-v2.json	Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.	CC BY-SA 4.0	rajpurkar/squad_v2 · Datasets at Hugging Face
tamil-thirukural tamil-kural-classification.json	This dataset is used to test the comprehension abilities for the Thirukkural. Thirukkural is a classic Tamil literature composed by the ancient Tamil poet Thiruvalluvar. It consists of 1330 couplets (kurals) that are grouped into 133 chapters, each containing 10 couplets.	Creative Commons Attribution 4.0 International	https://github.com/vijayanandrp/Thirukkural-Tamil-Dataset
tamil-news-classification tamil-tamilnews-classification.json	This dataset is used to classify a static set of Tamil News. The task is to classify news to its respective category. The dataset has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world".	GNU General Public License v3.0	https://github.com/vanangamudi/tamil-news-classification/tree/master/dataset/news
tanglish-tweets-SA tamil-tanglish-tweets.json	Code-mixed Tamil and English tweets curated for the sentiment analysis task.	CC0: Public Domain	https://www.kaggle.com/datasets/vyombhatia/tanglish-comments-for-sentiment-ananlysis/data
TruthfulQA (MCQ Version) truthfulqa-mcq.json	TruthfulQA aims to measure the the truthfulness of a model. This dataset uses a multiple choice format.	Apache-2.0	https://github.com/sylinrl/TruthfulQA
truthfulqa truthfulqa-multiple-open-ended.json	TruthfulQA aims to measure the the truthfulness of a model.	Apache-2.0	https://github.com/sylinrl/TruthfulQA
truthfulqa truthfulqa-open-ended.json	TruthfulQA aims to measure the the truthfulness of a model.	Apache-2.0	https://github.com/sylinrl/TruthfulQA
uciadult uciadult.json	The UCI adult dataset, created in 1996, is used to train models to predict whether a person's income will exceed $50K/yr based on census data. Also known as "Census Income" dataset.	Creative Commons Attribution 4.0 International	https://archive.ics.uci.edu/dataset/2/adult
winobias-variation1 winobias-type1.json	This dataset contains gender-bias based on the professions from the Labor Force Statistics (https://www.bls.gov/cps/cpsaat11.htm), which contain some gender-bias.	MIT License	https://github.com/uclanlp/corefBias/tree/master/WinoBias/wino
Winogrande winogrande.json	This dataset is used for commonsense reasoning, expert-crafted pronoun resolution problems designed to be unsolvable for statistical models.	Apache-2.0	https://github.com/allenai/winogrande