This demo uses LLM-based (GPT-4) evaluator to grade open-ended outputs from your models.
Plese upload your json file of your model results containing {v1_0: ..., v1_1: ..., }
like this json file.
The grading may last 5 minutes. Sine we only support 1 queue, the grading time may be longer when you need to wait for other users' grading to finish.
The grading results will be downloaded as a zip file.