Performance rankings of large language models without strong LLM reference or human judgments / gold references
Current benchmarks to measure the performance of large language models (LLMs) require either human feedback (i.e. gold standard answers) or rely on strong LLMs for rating. The „rating network“ method …