Performance rankings of large language models without strong LLM reference or human judgments / gold references

Current benchmarks to measure the performance of large language models (LLMs) require either human feedback (i.e. gold standard answers) or rely on strong LLMs for rating. The „rating network“ method presented in the following blog post works without these requirements.

https://lardel.li/2023/09/performance-rankings-of-large-language-models-without-strong-llm-reference-or-human-judgments-gold-references.html