Current benchmarks to measure the performance of large language models (LLMs) require either human feedback (i.e. gold standard answers) or rely on strong LLMs for rating. The „rating network“ method presented in the following blog post works without these requirements.
https://lardel.li/2023/09/performance-rankings-of-large-language-models-without-strong-llm-reference-or-human-judgments-gold-references.html