AnswerCorrectness

Evaluates the correctness of answers in a dataset with respect to the provided ground truths and visualizes the results in a histogram.

The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.

Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score. Users also have the option to employ a threshold value to round the resulting score to a binary value (0 or 1) based on the threshold.

Factual correctness quantifies the factual overlap between the generated answer and the ground truth answer. This is done using the concepts of:

TP (True Positive): Facts or statements that are present in both the ground truth and the generated answer.
FP (False Positive): Facts or statements that are present in the generated answer but not in the ground truth.
FN (False Negative): Facts or statements that are present in the ground truth but not in the generated answer.

Configuring Columns

This metric requires specific columns to be present in the dataset:

user_input (str): The text prompt or query that was input into the model.
response (str): The text response generated by the model.
reference (str): The ground truth answer that the generated answer is compared against.

If the above data is not in the appropriate column, you can specify different column names for these fields using the parameters question_column, answer_column, and ground_truth_column.

For example, if your dataset has this data stored in different columns, you can pass the following parameters:

params = {
user_input_column": "input_text",
response_column": "output_text",
reference_column": "human_answer",
}

If answer and contexts are stored as a dictionary in another column, specify the column and key like this:

pred_col = dataset.prediction_column(model)
params = {
response_column": f"{pred_col}.generated_answer",
reference_column": f"{pred_col}.contexts",
}

For more complex data structures, you can use a function to extract the answers:

pred_col = dataset.prediction_column(model)
params = {
response_column": lambda row: "\\n\\n".join(row[pred_col]["messages"]),
reference_column": lambda row: [row[pred_col]["context_message"]],
}