To evaluate the MISeD data, we compare with a dataset collected using the traditional WOZ approach. A “user” annotator was given the general context for a meeting and asked questions about it, while an ”agent” annotator used the full transcripts to provide answers and supporting attribution. This WOZ test set contains 70 dialogs (700 query-response pairs). It serves as an unbiased test set, revealing model performance on fully human-generated data. We found that the WOZ annotation time was 1.5 times slower than the MISeD annotation time.
We compared the performance for the following three model types: an encoder-decoder (LongT5 XL) fine-tuned on MISeD for long contexts (16k tokens); LLMs (Gemini Pro/Ultra) using prompts with transcripts and queries (28k tokens); and an LLM (Gemini Pro) fine-tuned on MISeD, using the same prompt and context length as above.
We trained the fine-tuned agent models using the MISeD training set (2922 training examples). Automatic evaluation was computed on the full test set (628 MISeD queries, 700 WOZ queries), while manual evaluation was run on a random subset of 100 queries of each test set.
We evaluate the agent models along two dimensions: the quality of the generated responses and the accuracy of the provided attributions, through both automatic and human evaluations. Our evaluation methodologies are described in our paper and results are presented below: