List of Datasets
Name | Description | License | Reference |
---|---|---|---|
advglue advglue-all.json |
Adversarial GLUE Benchmark (AdvGLUE) is a comprehensive robustness evaluation benchmark that focuses on the adversarial robustness evaluation of language models. | CC-BY-4.0 license | https://github.com/AI-secure/adversarial-glue |
Analogical Similarity analogical-similarity.json |
To measure the model’s ability in discriminating between different degrees of analogical similarity in two given episodes | Apache 2.0 | https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/analogical_similarity |
Answercarefully Information Cantonese answercarefully-ca.json |
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing. | Dataset from NII-LLMC working group - subset created for AISI testing. |
Answercarefully Information Chinese answercarefully-cn.json |
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing | Dataset from NII-LLMC working group - subset created for AISI testing |
Answercarefully Information English answercarefully-en.json |
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing | Dataset from NII-LLMC working group - subset created for AISI testing |
Answercarefully Information Farsi answercarefully-fa.json |
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing. | Dataset from NII-LLMC working group - subset created for AISI testing. |
Answercarefully Information French answercarefully-fr.json |
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing. | Dataset from NII-LLMC working group - subset created for AISI testing. |
Answercarefully Information Japanese answercarefully-jp.json |
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing | Dataset from NII-LLMC working group - subset created for AISI testing |
Answercarefully Information Korean answercarefully-kr.json |
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing | Dataset from NII-LLMC working group - subset created for AISI testing |
Answercarefully Information Swahili answercarefully-kw.json |
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing. | Dataset from NII-LLMC working group - subset created for AISI testing. |
Answercarefully Information Malay answercarefully-my.json |
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing. | Dataset from NII-LLMC working group - subset created for AISI testing. |
Answercarefully Information Telugu answercarefully-tl.json |
A dataset of security-related questions and answers. | Dataset from NII-LLMC working group - subset created for AISI testing. | Dataset from NII-LLMC working group - subset created for AISI testing. |
ARC-Challenge arc-challenge.json |
A dataset of genuine grade-school level, multiple-choice science questions in advanced Q&A. This is the challenge set. | CC BY-SA | https://allenai.org/data/arc |
ARC-Easy arc-easy.json |
A dataset of genuine grade-school level, multiple-choice science questions in advanced Q&A. This is the easy set. | CC BY-SA | https://allenai.org/data/arc |
auto_categorization auto-categorisation.json |
To measure the model's ability in categorising elements into groups. | Apache 2.0 | https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/auto_categorization |
BBQ-lite on age - Ambiguous Questions bbq-lite-age-ambiguous.json |
This dataset is an excerpt from Bias Benchmark from QA on age, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on age - Disambiguated Questions bbq-lite-age-disamb.json |
This dataset is an excerpt from Bias Benchmark from QA on age, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on disability-status - Ambiguous Questions bbq-lite-disability-status-ambiguous.json |
This dataset is an excerpt from Bias Benchmark from QA on disability-status, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on disability-status - Disambiguated Questions bbq-lite-disability-status-disamb.json |
This dataset is an excerpt from Bias Benchmark from QA on disability-status, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on gender - Ambiguous Questions bbq-lite-gender-ambiguous.json |
This dataset is an excerpt from Bias Benchmark from QA on gender, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on gender - Disambiguated Questions bbq-lite-gender-disamb.json |
This dataset is an excerpt from Bias Benchmark from QA on gender, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on nationality - Ambiguous Questions bbq-lite-nationality-ambiguous.json |
This dataset is an excerpt from Bias Benchmark from QA on nationality, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on nationality - Disambiguated Questions bbq-lite-nationality-disamb.json |
This dataset is an excerpt from Bias Benchmark from QA on nationality, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on physical-appearance - Ambiguous Questions bbq-lite-physical-appearance-ambiguous.json |
This dataset is an excerpt from Bias Benchmark from QA on physical-appearance, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on physical-appearance - Disambiguated Questions bbq-lite-physical-appearance-disamb.json |
This dataset is an excerpt from Bias Benchmark from QA on physical-appearance, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on race-ethnicity - Ambiguous Questions bbq-lite-race-ethnicity-ambiguous.json |
This dataset is an excerpt from Bias Benchmark from QA on race-ethnicity, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on race-ethnicity - Disambiguated Questions bbq-lite-race-ethnicity-disamb.json |
This dataset is an excerpt from Bias Benchmark from QA on race-ethnicity, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on race by gender - Ambiguous Questions bbq-lite-race-x-gender-ambiguous.json |
This dataset is an excerpt from Bias Benchmark from QA on race by gender, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on race by gender - Disambiguated Questions bbq-lite-race-x-gender-disamb.json |
This dataset is an excerpt from Bias Benchmark from QA on race by gender, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on race-x-SES - Ambiguous Questions bbq-lite-race-x-ses-ambiguous.json |
This dataset is an excerpt from Bias Benchmark from QA on race-x-SES, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on race-x-SES - Disambiguated Questions bbq-lite-race-x-ses-disamb.json |
This dataset is an excerpt from Bias Benchmark from QA on race-x-SES, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on religion - Ambiguous Questions bbq-lite-religion-ambiguous.json |
This dataset is an excerpt from Bias Benchmark from QA on religion, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on religion - Disambiguated Questions bbq-lite-religion-disamb.json |
This dataset is an excerpt from Bias Benchmark from QA on religion, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on SES - Ambiguous Questions bbq-lite-ses-ambiguous.json |
This dataset is an excerpt from Bias Benchmark from QA on SES, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on SES - Disambiguated Questions bbq-lite-ses-disamb.json |
This dataset is an excerpt from Bias Benchmark from QA on SES, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on sexual-orientation - Ambiguous Questions bbq-lite-sexual-orientation-ambiguous.json |
This dataset is an excerpt from Bias Benchmark from QA on sexual-orientation, containing only the ambiguous questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
BBQ-lite on sexual-orientation - Disambiguated Questions bbq-lite-sexual-orientation-disamb.json |
This dataset is an excerpt from Bias Benchmark from QA on sexual-orientation, containing only the disambiguated questions. | CC-BY-4.0 license | https://arxiv.org/pdf/2110.08193v2 |
cause_and_effect cause-and-effect.json |
This recipe measures if the model is able to distinguish and understand cause and effect. The higher the grade, the better the system can perform this capability. | Apache 2.0 | https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/cause_and_effect |
Chinese Version - Bias Benchmark for QA cbbq-lite-age-ambiguous.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-age-disamb.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-disability-ambiguous.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-disability-disamb.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-disease-ambiguous.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-disease-disamb.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-educational-qualification-ambiguous.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-educational-qualification-disamb.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-ethnicity-ambiguous.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-ethnicity-disamb.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-gender-ambiguous.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-gender-disamb.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-household-registration-ambiguous.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-household-registration-disamb.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-nationality-ambiguous.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-nationality-disamb.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-physical-appearance-ambiguous.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-physical-appearance-disamb.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-race-ambiguous.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-race-disamb.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-region-ambiguous.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-region-disamb.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-religion-ambiguous.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-religion-disamb.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-SES-ambiguous.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-SES-disamb.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-sexual-orientation-ambiguous.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
Chinese Version - Bias Benchmark for QA cbbq-lite-sexual-orientation-disamb.json |
This aims to measure social biases across 9 different categories in Chinese language. | CC-BY-SA-4.0 license | https://github.com/YFHuangxxxx/CBBQ |
ChallengingToxicityPrompts challenging-toxicity-prompts-variation1.json |
ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences. | CC BY-SA 4.0 | https://github.com/AI-secure/DecodingTrust |
ChallengingToxicityPrompts challenging-toxicity-prompts-variation2.json |
ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences. | CC BY-SA 4.0 | https://github.com/AI-secure/DecodingTrust |
ChallengingToxicityPrompts challenging-toxicity-prompts.json |
ChallengingToxicityPrompts was created by the authors of DecodingTrust, with the aim to measure the model's ability in identify toxic sentences. | CC BY-SA 4.0 | https://github.com/AI-secure/DecodingTrust |
Chinese Linguistics & Cognition Challenge (CLCC) clcc.json |
This dataset is a subjective benchmark created by BAAI FlagEval group. | CC-BY-4.0 license | https://flageval.baai.ac.cn/ |
Ethics-Commonsense-Easy commonsense-morality-easy-variation1.json |
This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. | MIT License | https://github.com/hendrycks/ethics/ |
Ethics-Commonsense-Easy commonsense-morality-easy-variation2.json |
This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. | MIT License | https://github.com/hendrycks/ethics/ |
Ethics-Commonsense-Hard commonsense-morality-hard-variation1.json |
This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. | MIT License | https://github.com/hendrycks/ethics/ |
Ethics-Commonsense-Hard commonsense-morality-hard-variation2.json |
This dataset contains prompts that embedded the body of moral standards and principles that most people intuitively accept. | MIT License | https://github.com/hendrycks/ethics/ |
Contextual Parametric Knowledge Conflicts contextual-parametric-knowledge-conflicts.json |
To measure the model's ability in making informed predictions | Apache 2.0 | https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/contextual_parametric_knowledge_conflicts |
coqa_conversational_question_answering coqa-conversational-qna.json |
To measure the ability of machines to understand a text passage and answer a series of interconnected questions | Apache 2.0 | https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/coqa_conversational_question_answering |
C-Values - Values of Chinese Large Language attack_modules cvalues.json |
This dataset contains prompts that can measure the alignment ability of LLMs in both safety and responsibility in Chinese Language. | Apache-2.0 license | https://github.com/X-PLUG/CValues |
CyberSecEval Prompt Injection cyberseceval_promptinjection.json |
Check if llm is susceptible to prompt injection | MIT license | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection |
CyberSecEval Prompt Injection 2 Chinese cyberseceval-promptinjection2-cn.json |
A dataset of security-related questions and answers. | MIT license | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection |
CyberSecEval Prompt Injection 2 English cyberseceval-promptinjection2-en.json |
A dataset of security-related questions and answers. | MIT license | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection |
CyberSecEval Prompt Injection 2 French cyberseceval-promptinjection2-fr.json |
A dataset of security-related questions and answers. | MIT license | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection |
CyberSecEval Prompt Injection 2 Japanese cyberseceval-promptinjection2-jp.json |
A dataset of security-related questions and answers. | MIT license | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection |
CyberSecEval Prompt Injection 2 Korean cyberseceval-promptinjection2-kr.json |
A dataset of security-related questions and answers. | MIT license | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection |
CyberSecEval Prompt Injection 2 Swahili cyberseceval-promptinjection2-kw.json |
A dataset of security-related questions and answers. | MIT license | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection |
CyberSecEval Prompt Injection 2 Malay cyberseceval-promptinjection2-my.json |
A dataset of security-related questions and answers. | MIT license | https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks/datasets/prompt_injection |
Facts about Asia pacific in True and False in 4 languages (Chinese, Malay, Tamil and English) facticity-apac-multilungual-facts.json |
Contain prompts that contain facts about Asia Pacific and Singapore, in True/False format | Apache-2.0 | facticity.ai |
uciadult fairness-uciadult.json |
The UCI Adult dataset has been used widely used to assess fairness. | Creative Commons Attribution 4.0 International | https://archive.ics.uci.edu/dataset/2/adult |
Gender Occupational Bias gender-occupation-text2-image-prompts.json |
The gender occupational bias is a set of gender neutral text-to-image prompts that are likely to result in models favouring the generation of one gender over the other. The occupations included were proposed in the paper: https://arxiv.org/abs/2211.03759 | MIT License | https://arxiv.org/abs/2211.03759 |
Gender Occupational Bias gender-text2-image-prompts.json |
The gender occupational bias is a set of gender neutral text-to-image prompts that are likely to result in models favouring the generation of one gender over the other. The occupations included were proposed in the paper: https://arxiv.org/abs/2211.03759 | MIT License | https://arxiv.org/abs/2211.03759 |
gre_reading_comprehension gre-reading-comprehension.json |
To measure the model/'s ability to summarize text, infer and deduce knowledge from context, and match the context. | Apache 2.0 | https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/gre_reading_comprehension |
GSM8K-TEST gsm8k-test.json |
This dataset consists of high quality grade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. The dataset includes only the test set of GSM8K. | MIT License | https://github.com/openai/grade-school-math |
GSM8K gsm8k.json |
This dataset consists of high quality grade school math problems created by human. These problems take between 2 and 8 steps to solve, and usually requires a sequence of elementary calculations. | MIT License | https://github.com/openai/grade-school-math |
HellaSwag hellaswag.json |
This dataset is used to evaluate commonsense with questions that are trivial for humans but difficult for state-of-the-art models. | MIT License | https://github.com/rowanz/hellaswag |
Inappropriate Image Prompts (I2P) i2p-text2image-prompts.json |
The I2P benchmark contains real user prompts for generative text2image prompts that are unproportionately likely to produce inappropriate images. I2P was introduced in the 2023 CVPR paper Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models. This benchmark is not specific to any approach or model, but was designed to evaluate mitigating measures against inappropriate degeneration in Stable Diffusion. | MIT License | https://huggingface.co/datasets/AIML-TUDA/i2p |
In-the-wild Jailbreak Dataset jailbreak-dan.json |
This dataset contains jailbreak prompts that were collected from various places (e.g., Discord). | MIT license | https://github.com/verazuo/jailbreak_llms |
m3exam-indonesian m3exam-indonesian-test.json |
This dataset contains local exam questions for testing Indonesian capability. | CC BY-NC-SA | https://github.com/DAMO-NLP-SG/M3Exam |
m3exam-thai m3exam-thai-test.json |
This dataset contains local exam questions for testing Thai capability. | CC BY-NC-SA | https://github.com/DAMO-NLP-SG/M3Exam |
m3exam-vietnamese m3exam-vietnamese-test.json |
This dataset contains local exam questions for testing Vietnamese capability. | CC BY-NC-SA | https://github.com/DAMO-NLP-SG/M3Exam |
MedMCQA medmcqa.json |
MedMCQ is a large-scale, multiple-choice question answering dataset designed to address realworld medical entrance exam questions. It contains more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects. | MIT License | https://github.com/MedMCQA/MedMCQA |
MedQA (US) medqa-us.json |
MedQA is a free-form multiple-choice OpenQA dataset for solving medical problems. These are collected from the professional medical board exams. We extracted the list of MCQ questions from the US in this dataset. | MIT License | https://github.com/jind11/MedQA |
MLC Baseline - Child Abuse and Exploitation mlc-cae-malicious-user.json |
Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 |
MLC Baseline - Child Abuse and Exploitation mlc-cae-typical-user.json |
Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 |
MLC Baseline - Chemical, Biological, Radiological, Nuclear, and Explosives mlc-cbr-malicious-user.json |
Responses that enable, encourage, or endorse the creation of weapons of mass destruction. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 |
MLC Baseline - Chemical, Biological, Radiological, Nuclear, and Explosives mlc-cbr-typical-user.json |
Responses that enable, encourage, or endorse the creation of weapons of mass destruction. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 |
MLC Baseline - Hate mlc-hat-malicious-user.json |
Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 |
MLC Baseline - Hate mlc-hat-typical-user.json |
Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 |
ML Commons AILuminate - Intellectual Property Violation - Cantonese mlc-ipv-ca.json |
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Intellectual Property Violation - Chinese mlc-ipv-cn.json |
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Intellectual Property Violation - English mlc-ipv-en.json |
Test whether responses violate the intellectual property rights of any party. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Intellectual Property Violation - Farsi mlc-ipv-fa.json |
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Intellectual Property Violation - French mlc-ipv-fr.json |
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Intellectual Property Violation - Japanese mlc-ipv-jp.json |
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Intellectual Property Violation - Korean mlc-ipv-kr.json |
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Intellectual Property Violation - Swahili mlc-ipv-kw.json |
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Intellectual Property Violation - Malay mlc-ipv-my.json |
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Intellectual Property Violation - Telugu mlc-ipv-tl.json |
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Non-Violent Crimes - Chinese mlc-ncr-cn.json |
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Non-Violent Crimes - English mlc-ncr-en.json |
Test whether responses enable, encourage, or endorse the commission of non-violent crimes. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Non-Violent Crimes - Farsi mlc-ncr-fa.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Non-Violent Crimes - French mlc-ncr-fr.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Non-Violent Crimes - Japanese mlc-ncr-jp.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Non-Violent Crimes - Korean mlc-ncr-kr.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Non-Violent Crimes - Swahili mlc-ncr-kw.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Non-Violent Crimes - Malay mlc-ncr-my.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Non-Violent Crimes - Telugu mlc-ncr-tl.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
MLC Baseline - Non-Violent Crimes mlc-nvc-malicious-user.json |
Responses that enable, encourage, or endorse the commission of non-violent crimes. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 |
MLC Baseline - Non-Violent Crimes mlc-nvc-typical-user.json |
Responses that enable, encourage, or endorse the commission of non-violent crimes. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 |
ML Commons AILuminate - Privacy - Chinese mlc-prv-cn.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Privacy - English mlc-prv-en.json |
Test whether responses contain sensitive, nonpublic personal information that could undermine someone’s physical, digital, or financial security. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Privacy - Farsi mlc-prv-fr.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Privacy - Japanese mlc-prv-jp.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Privacy - Korean mlc-prv-kr.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Privacy - Swahili mlc-prv-kw.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Privacy - Malay mlc-prv-my.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Privacy - Telugu mlc-prv-tl.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
MLC Baseline - Sex-Related Crimes mlc-scr-malicious-user.json |
Responses that enable, encourage, or endorse the commission of sex-related crimes. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 |
MLC Baseline - Sex-Related Crimes mlc-scr-typical-user.json |
Responses that enable, encourage, or endorse the commission of sex-related crimes. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 |
MLC Baseline - Suicide & Self-Harm mlc-ssh-typical-user.json |
Responses that enable, encourage, or endorse acts of intentional self-harm. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 |
MLC Baseline - Suicide & Self-Harm mlc-ssh-vulnerable-user.json |
Responses that enable, encourage, or endorse acts of intentional self-harm. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 |
ML Commons AILuminate - Violent Crimes - Chinese mlc-vcr-cn.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Violent Crimes - English mlc-vcr-en.json |
Test whether responses enable, encourage, or endorse the commission of violent crimes. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Violent Crimes - French mlc-vcr-fr.json |
A dataset of security-related questions and answers. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Violent Crimes - Japanese mlc-vcr-jp.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Violent Crimes - Korean mlc-vcr-kr.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Violent Crimes - Swahili mlc-vcr-kw.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
MLC Baseline - Violent Crimes mlc-vcr-malicious-user.json |
Responses that enable, encourage, or endorse the commission of violent crimes. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 |
ML Commons AILuminate - Violent Crimes - Malay mlc-vcr-my.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
ML Commons AILuminate - Violent Crimes - Telugu mlc-vcr-tl.json |
This recipe measures the model's responses in dealing with prompts that demean or dehumanize people on the basis of their sensitive, personal characteristics. A high grade indicates that the system under test handles the response well. | CC-BY-4.0 license | https://mlcommons.org/ailuminate/ |
MLC Baseline - Violent Crimes mlc-vcr-typical-user.json |
Responses that enable, encourage, or endorse the commission of violent crimes. | CC-BY-4.0 license | https://arxiv.org/html/2404.12241v1 |
MMLU mmlu-all.json |
This dataset covers 57 tasks including elementary mathemathics, US history, computer science, law, and more. | MIT license | https://github.com/hendrycks/test |
MMLU Anatomy mmlu-anatomy.json |
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Anatomy | MIT License | https://github.com/hendrycks/test |
MMLU Clinical Knowledge mmlu-clinicalknowledge.json |
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Clinical Knowledge | MIT License | https://github.com/hendrycks/test |
MMLU College Biology mmlu-collegebiology.json |
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU College Biology | MIT License | https://github.com/hendrycks/test |
MMLU College Medicine mmlu-collegemedicine.json |
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU College Medicine | MIT License | https://github.com/hendrycks/test |
MMLU Medical Genetics mmlu-medicalgenetics.json |
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Medical Genetics | MIT License | https://github.com/hendrycks/test |
MMLU Professional Medicine.json mmlu-professionalmedicine.json |
MMLU measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This particular dataset contains questions on MMLU Professional Medicine.json | MIT License | https://github.com/hendrycks/test |
List of Occupations occupation.json |
This dataset contains a list of occupations collected from various websites (e.g., MoM of Singapore). The dataset can be used to test job-related applications. | MIT License | IMDA |
enronemail privacy-enronemails.json |
The Enron email dataset is aimed at evaluating if there is leakage of private information. The dataset was used in DecodingTrust as part of its privacy evaluation. | Apache-2.0 | https://github.com/ParakweetLabs/EmailIntentDataSet |
Jailbreak Prompts prompt_injection_jailbreak.json |
This dataset evaluates whether a model/app will detect attempted jailbreaks and generate unsafe content. These jailbreak techniques are collected from X (e.g., Pliny) and public repository (e.g., in-the-wild jailbreak prompts). | Apache 2.0 | OWASP |
PubMedQA pubmedqa.json |
PubMedQA is a large-scale, multiple-choice question answering dataset designed to address realworld medical entrance exam questions. It contains more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects. | MIT License | https://github.com/MedMCQA/MedMCQA |
mock-dataset rag-sample-dataset.json |
This is a sample of a dataset to be used for RAG Evaluation. | ||
RealToxicityPrompts real-toxicity-prompts.json |
RealToxicityPrompts aims to measure the model's ability in identify toxic sentences. | Apache-2.0 | https://github.com/allenai/real-toxicity-prompts/ |
RealtimeQA realtimeqa-past.json |
RealTime QA is a dynamic question answering (QA) platform that inquires about the present. | https://github.com/realtimeqa/realtimeqa_public | |
sg-legal-glossary sg-legal-glossary.json |
A list of singapore legal terms extracted from SICC and Judiciary websites. | https://www.sicc.gov.sg/glossary-of-legal-terms | |
sg-university-tutorial-questions-legal sg-university-tutorial-questions-legal.json |
Contain tutorial questions and answers from Singapore's Universities to test model's ability in understanding legal context in Singapore | ||
Facts about Singapore in True and False singapore-facts-tnf.json |
Contain prompts that contains facts about Singapore, in True/False format | Apache-2.0 | IMDA |
Food in Singapore singapore-food-tnf.json |
Contain prompts that test model's understanding in Food, in True/False format | Apache-2.0 | IMDA |
Iconic Places in Singapore singapore-iconic-places.json |
Contain questions about Singapore's iconic places. | Apache-2.0 | IMDA |
Places in Singapore singapore-places-tnf.json |
Contain prompts that test model's understanding places in Singapore, in True/False format | Apache-2.0 | IMDA |
Singapore Political History singapore-political-history.json |
Contain questions about Singapore's key historical events in political. | Apache-2.0 | IMDA |
Singapore Public Housing singapore-public-housing.json |
Contain questions about Singapore's public housing system | Apache-2.0 | IMDA |
Safety Benchmark (Singapore Context) singapore-safety-questions.json |
Contain prompts that test safety in Singapore-context | Apache-2.0 | IMDA |
Singapore Transport System singapore-transport-system.json |
Contain questions about Singapore's transport system | Apache-2.0 | IMDA |
squad-shifts-tnf squad-shifts-tnf.json |
Zero-shot reading comprehension on paragraphs and questions from squadshifts | Apache 2.0 | https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/squad_shifts |
squad-v2 squad-v2.json |
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. | CC BY-SA 4.0 | rajpurkar/squad_v2 · Datasets at Hugging Face |
tamil-thirukural tamil-kural-classification.json |
This dataset is used to test the comprehension abilities for the Thirukkural. Thirukkural is a classic Tamil literature composed by the ancient Tamil poet Thiruvalluvar. It consists of 1330 couplets (kurals) that are grouped into 133 chapters, each containing 10 couplets. | Creative Commons Attribution 4.0 International | https://github.com/vijayanandrp/Thirukkural-Tamil-Dataset |
tamil-news-classification tamil-tamilnews-classification.json |
This dataset is used to classify a static set of Tamil News. The task is to classify news to its respective category. The dataset has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". | GNU General Public License v3.0 | https://github.com/vanangamudi/tamil-news-classification/tree/master/dataset/news |
tanglish-tweets-SA tamil-tanglish-tweets.json |
Code-mixed Tamil and English tweets curated for the sentiment analysis task. | CC0: Public Domain | https://www.kaggle.com/datasets/vyombhatia/tanglish-comments-for-sentiment-ananlysis/data |
TruthfulQA (MCQ Version) truthfulqa-mcq.json |
TruthfulQA aims to measure the the truthfulness of a model. This dataset uses a multiple choice format. | Apache-2.0 | https://github.com/sylinrl/TruthfulQA |
truthfulqa truthfulqa-multiple-open-ended.json |
TruthfulQA aims to measure the the truthfulness of a model. | Apache-2.0 | https://github.com/sylinrl/TruthfulQA |
truthfulqa truthfulqa-open-ended.json |
TruthfulQA aims to measure the the truthfulness of a model. | Apache-2.0 | https://github.com/sylinrl/TruthfulQA |
uciadult uciadult.json |
The UCI adult dataset, created in 1996, is used to train models to predict whether a person's income will exceed $50K/yr based on census data. Also known as "Census Income" dataset. | Creative Commons Attribution 4.0 International | https://archive.ics.uci.edu/dataset/2/adult |
winobias-variation1 winobias-type1.json |
This dataset contains gender-bias based on the professions from the Labor Force Statistics (https://www.bls.gov/cps/cpsaat11.htm), which contain some gender-bias. | MIT License | https://github.com/uclanlp/corefBias/tree/master/WinoBias/wino |
Winogrande winogrande.json |
This dataset is used for commonsense reasoning, expert-crafted pronoun resolution problems designed to be unsolvable for statistical models. | Apache-2.0 | https://github.com/allenai/winogrande |