
FlagEval
The FlagEval (Tiancheng) large model evaluation platform launched by the Beijing Academy of Artificial Intelligence (BAAI).
The large model evaluation system launched by Stanford University.
HELM, short for Holistic Evaluation of Language Models, is a large model evaluation system launched by Stanford University. This evaluation method mainly consists of three modules: scenarios, adaptations, and metrics. Each evaluation run requires specifying a scenario, a prompt for the adapted model, and one or more metrics. Its evaluation mainly focuses on English, with seven metrics including accuracy, uncertainty/calibration, robustness, fairness, bias, toxicity, and inference efficiency. The tasks include question answering, information retrieval, summarization, text classification, etc.