Guidelines for Annotating Data

This is a set of best practices and tips on how to create question-answer pairs.

How To Write Questions?

  • Formulate fact-seeking questions that can be answered with an entity (person, organization, location, and so on) or explanation.
  • Avoid questions that are ambiguous, incomprehensible, and dependent on false presuppositions.
  • Avoid questions that are opinion-seeking and do not request factual information.
  • Ensure that answers to your questions are within the data that you're annotating. Do not formulate questions that can only be answered with additional knowledge or your interpretation.
  • Avoid creating questions that use exactly the same wording as the answers. Reformulate your questions by using synonyms and reordering whenever possible.
  • Create precise, natural questions that you would ask if you wanted another person to answer you.
  • Create questions covering the whole document and focus on questions covering important information.
  • The more questions you create, the better. Generally, for model evaluation, it's good to have between 200 and 500 questions. For training, the more the better.
  • Ensure that your questions are understandable without the need to read the corresponding text passages.
  • Avoid creating closed questions that can only be answered with "Yes" or "No".

How To Choose Answers?

  • Always mark whole words as answer spans.
  • For short answers (such as numbers or a few words):
    • Do not include punctuation in your answers.
    • The answer should be as short and as close to spoken human language as possible.
  • For long answers (such as lists of possibilities or multiple sentences):
    • Highlight whole sentences together with punctuation.
    • If the answer is a text passage, don't make it longer than 8 to 10 sentences.
  • If more than one answer is possible, you can add your question multiple times, each time selecting a different answer.
  • If your data contains text passages containing information that cannot be labeled (for example, just lists or garbage text or data that weren't properly converted), don't annotate these passages.