Guidelines for Annotating Data
This is a set of best practices and tips on how to create question-answer pairs.
How To Write Questions?
- Formulate fact-seeking questions that can be answered with an entity (person, organization, location, and so on) or explanation.
- Avoid questions that are ambiguous, incomprehensible, and dependent on false presuppositions.
- Avoid questions that are opinion-seeking and do not request factual information.
- Ensure that answers to your questions are within the data that you're annotating. Do not formulate questions that can only be answered with additional knowledge or your interpretation.
- Avoid creating questions that use exactly the same wording as the answers. Reformulate your questions by using synonyms and reordering whenever possible.
- Create precise, natural questions that you would ask if you wanted another person to answer you.
- Create questions covering the whole document and focus on questions covering important information.
- The more questions you create, the better. Generally, for model evaluation, it's good to have between 200 and 500 questions. For training, the more the better.
- Ensure that your questions are understandable without the need to read the corresponding text passages.
- Avoid creating closed questions that can only be answered with "Yes" or "No".
How To Choose Answers?
- Always mark whole words as answer spans.
- For short answers (such as numbers or a few words):
- Do not include punctuation in your answers.
- The answer should be as short and as close to spoken human language as possible.
- For long answers (such as lists of possibilities or multiple sentences):
- Highlight whole sentences together with punctuation.
- If the answer is a text passage, don't make it longer than 8 to 10 sentences.
- If more than one answer is possible, you can add your question multiple times, each time selecting a different answer.
- If your data contains text passages containing information that cannot be labeled (for example, just lists or garbage text or data that weren't properly converted), don't annotate these passages.
Updated 5 months ago
Related Links