ynlu.sdk.evaluation package

Submodules

ynlu.sdk.evaluation.entity_confusion_matrix module

ynlu.sdk.evaluation.entity_confusion_matrix.entity_confusion_matrix(utterances: List[str], entity_predictions: List[List[dict]], y_trues: List[List[str]]) → Tuple[numpy.ndarray, List[str]][source]

Confusion Matrix for Evaluating Entity Predictions

Preprocess a list of raw entity predictions and compute confusion matrix by calling sklearn.metrics.confusion_matrix.

Args:
utterances (a list of strings):
The original input that produce the entity_predictions below by calling model.predict().
entity_predictions (a list of entity_predictions):
Entity part of output when callng model.predict() with the above utterance as input.
y_trues (a list of y_true):
A list of true entity lists of that utterance.
Returns:
confusion_matrix (numpy 2D array):
By definition a confusion matrix C is such that \(C_{i, j}\) is equal to the number of observations known to be in group i but predicted to be in group j. If a confusion matrix is a diagonal matrix, we can know that the predictions and true labels are perfectly matched.
unique_entities (list of strings):
It records the meaning of group \(i\), \(j\) of confusion matrix \(C_{i, j}\).
Examples:
>>> from ynlu.sdj.evaluation import entity_confusion_matrix
>>> confusion_matrix, unique_entities = entity_confusion_matrix(
        utterances=["I like apple."],
        entity_predictions=[
            [
                {"entity": "DONT_CARE", "value": "I like ", "score": 0.9},
                {"entity": "fruit", "value": "apple", "score": 0.8},
                {"entity": "drink", "value": ".", "score": 0.3},
            ],
        ],
        y_trues=[
            [
                "DONT_CARE", "DONT_CARE", "DONT_CARE", "DONT_CARE",
                "DONT_CARE", "DONT_CARE", "DONT_CARE", "fruit",
                "fruit", "fruit", "fruit", "fruit", "DONT_CARE",
            ],
        ],
    )
>>> print(unique_entities)
["DONT_CARE", "fruit", "drink"]
>>> print(confusion_matrix)
np.array([[7, 0, 1], [0, 5, 0], [0, 0, 0]])

ynlu.sdk.evaluation.entity_overlapping_score module

ynlu.sdk.evaluation.entity_overlapping_score.entity_overlapping_score(utterances: List[str], entity_predictions: List[List[dict]], y_trues: List[List[str]], wrong_penalty_rate: float = 2.0) → float[source]

Averaged Overlapping Score of all Utterances

Please take a look at function single__entity_overlapping_score first. This function is JUST a batch version of that. It would send all data to that function, then collect and average the output.

ynlu.sdk.evaluation.entity_overlapping_score.single__entity_overlapping_score(utterance: str, entity_prediction: List[dict], y_true: List[str], wrong_penalty_rate: float = 2.0) → float[source]

Overlapping Score of a Single Utterance

Examine the true and predicted entities in character-level. Then, compute a score to represent the overlapping rate between them.

Args:
utterance (a string):
The input when calling model.predict().
entity_prediction (a list of dictionaries):
Entity part of output when callng model.predict() with the above utterance as input.
y_true (a list of strings):
True entity list of that utterance.
wrong_penalty_rate (float, default is 2.0):
A penalty score would be given when predicting wrong.
Returns:
overlapping score (float):
If wrong penalty rate is 2.0, then an overlapping score would be ranging from -1 to 1. A score -1 represents all entities in an utterance are mismatched. On the other hand, score 1 means entities in an utterance are perfectly matched.
Examples:
>>> from ynlu.sdk.evaluation import single__entity_overlapping_score
>>> overlapping_score = single__entity_overlapping_score(
        utterance="I like apple.",
        entity_prediction=[
            {"entity": "DONT_CARE", "value": "I like ", "score": 0.9},
            {"entity": "fruit", "value": "apple", "score": 0.8},
            {"entity": "drink", "value": ".", "score": 0.3},
        ],
        y_true=[
            "DONT_CARE", "DONT_CARE", "DONT_CARE", "DONT_CARE",
            "DONT_CARE", "DONT_CARE", "DONT_CARE", "fruit",
            "fruit", "fruit", "fruit", "fruit", "DONT_CARE",
        ],
    )
>>> print(overlapping_score)
12 / 13

ynlu.sdk.evaluation.exceptions module

exception ynlu.sdk.evaluation.exceptions.BaseYNLUSDKEvaluationException[source]

Bases: ynlu.sdk.exceptions.BaseYNLUSDKException

ynlu.sdk.evaluation.intent_accuracy_score_with_threshold module

ynlu.sdk.evaluation.intent_accuracy_score_with_threshold.intent_accuracy_score_with_threshold(intent_predictions: List[List[Dict[str, str]]], y_trues: List[str], threshold: float = 0.5, normalize: bool = True, sample_weight: List[float] = None) → float[source]

Top1 accuracy classification score subject to score > threshold

Only the top1 predicted intent by score will be considered when computing accuracy. Moreover, the score of the predicted label should be more than threshold. Otherwise, it would be replaced with a UNK token before computing accuracy score. That is to say, the situation correctly classified occurs only when two requirements are satisfied:

  1. the top1 predicted label is the same as true label and
  2. score of predicted label is larger than threshold.
Args:
intent_predictions (list of list of dicts):
A list of intent_prediction which can contains all possible intent sorted by score.

y_trues (list of strings): A list of ground truth (correct) intents. threshold (float):

A threshold which limits the efficacy of top1 predicted intent if its score is less than threshold.
normalize (bool):
If False, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples.
sample_weight (list of float):
Sample weights.
Returns:
float:
The fraction of correctly classified samples, if normalize == True. Otherwise, the number of correctly classified sample.
Examples:
>>> from ynlu.sdk.evaluation import intent_accuracy_score_with_threshold
>>> intent_accuracy_score_with_threshold(
        intent_predictions=[
            [{"intent": "a", "score": 0.7}],
            [{"intent": "b", "score": 0.3}],
        ],
        y_trues=["a", "b"],
    )
>>> 0.5

ynlu.sdk.evaluation.intent_precision_recall module

ynlu.sdk.evaluation.intent_precision_score_with_threshold module

ynlu.sdk.evaluation.intent_precision_score_with_threshold.intent_precision_score_with_threshold(intent_predictions: List[List[Dict[str, str]]], y_trues: List[str], threshold: float = 0.5, average: str = 'weighted', sample_weight: List[float] = None) → float[source]

Top1 precision classification score subject to score > threshold

Only the top1 predicted intent by score will be considered when computing precision. Moreover, the score of the predicted label should be more than threshold. Otherwise, it would be replaced with a UNK token before computing precision score. That is to say, the situation correctly classified occurs only when two requirements are satisfied:

  1. the top1 predicted label is the same as true label and
  2. score of predicted label is larger than threshold.
Args:
intent_predictions (list of list of dicts):
A list of intent_prediction which can contains all possible intent sorted by score.
y_trues (list of strings):
A list of ground truth (correct) intents.
threshold (float):
A threshold which limits the efficacy of top1 predicted intent if its score is less than threshold.
average (string):
Options are as follows:
[‘None’, ‘binary’, ‘micro’, ‘macro’, ‘samples’, ‘weighted’(default)].
Please look at
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

for more details.

sample_weight (list of float):
Sample weights.
Returns:
float (if average is not None):
Precision of the positive class in binary classification or weighted average of the precision of each class for the multiclass task.

or array of float (shape = [n_unique_labels]):

A list of precision of each class.
Examples:
>>> from ynlu.sdk.evaluation import intent_precision_score_with_threshold
>>> intent_precision_score_with_threshold(
        intent_predictions=[
            [{"intent": "a", "score": 0.7}],
            [{"intent": "b", "score": 0.3}],
            [{"intent": "b", "score": 0.8}],
        ],
        y_trues=["a", "b", "a"],
    )
>>> 0.333333

ynlu.sdk.evaluation.intent_recall_score_with_threshold module

ynlu.sdk.evaluation.intent_recall_score_with_threshold.intent_recall_score_with_threshold(intent_predictions: List[List[Dict[str, str]]], y_trues: List[str], threshold: float = 0.5, average: str = 'weighted', sample_weight: List[float] = None) → float[source]

Top1 recall classification score subject to score > threshold

Only the top1 predicted intent by score will be considered when computing recall. Moreover, the score of the predicted label should be more than threshold. Otherwise, it would be replaced with a UNK token before computing recall score. That is to say, the situation correctly classified occurs only when two requirements are satisfied:

  1. the top1 predicted label is the same as true label and
  2. score of predicted label is larger than threshold.
Args:
intent_predictions (list of list of dicts):
A list of intent_prediction which can contains all possible intent sorted by score.
y_trues (list of strings):
A list of ground truth (correct) intents.
threshold (float):
A threshold which limits the efficacy of top1 predicted intent if its score is less than threshold.
average (string):
Options are as follows:
[‘None’, ‘binary’, ‘micro’, ‘macro’, ‘samples’, ‘weighted’(default)].
Please look at
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

for more details.

sample_weight (list of float):
Sample weights.
Returns:
float (if average is not None):
recall of the positive class in binary classification or weighted average of the recall of each class for the multiclass task.

or array of float (shape = [n_unique_labels]):

A list of recall of each class.
Examples:
>>> from ynlu.sdk.evaluation import intent_recall_score_with_threshold
>>> intent_recall_score_with_threshold(
        intent_predictions=[
            [{"intent": "a", "score": 0.7}],
            [{"intent": "b", "score": 0.3}],
            [{"intent": "b", "score": 0.8}],
        ],
        y_trues=["a", "b", "a"],
    )
>>> 0.3333

ynlu.sdk.evaluation.intent_topk_accuracy_score module

ynlu.sdk.evaluation.intent_topk_accuracy_score.intent_topk_accuracy_score(intent_predictions: List[List[Dict[str, str]]], y_trues: List[List[str]], k: int = 1) → float[source]

Compute the Accuracy of all utterances with multi-intents

Please take a look at function single__intent_topk_accuracy_score first. This function is JUST a batch version of that. It would send all data to that function, then collect and average the output.

\[\text{Accuracy of all utterances}=\frac{1}{n}\sum_{i=1}^{n}\frac{|\text{pred}_i \cap \text{true}_i|}{|\text{true}_i|}\]
ynlu.sdk.evaluation.intent_topk_accuracy_score.single__intent_topk_accuracy_score(intent_prediction: List[Dict[str, str]], y_true: List[str], k: int = 1) → float[source]

Compute the Accuracy of a single utterance with multi-intents

Accuracy of a single utterance is defined as the proportion of correctly predicted labels to the total number (predicted and true) of labels. It can be formulated as

\[\text{Accuracy of single utterance}=\frac{|\text{pred}_i \cap \text{true}_i|}{|\text{true}_i \cap \text{pred}_i|}\]
Args:
intent_prediction (a list of dictionaries):
A sorted intent prediction (by score) of a single utterance.
y_true (a list of strings):
The corresponding true intent of that utterance. Note that it can be more than one intents.
k (an integer):
The top k prediction of intents we take for computing accuracy.
Returns:
accuracy score (a float):
accuracy of a single utterance given top k prediction.
Examples:
>>> intent_prediction, _ = model.predict("I like apple.")
>>> print(intent_prediction)
[
    {"intent": "blabla", "score": 0.7},
    {"intent": "ohoh", "score": 0.2},
    {"intent": "preference", "score": 0.1},
]
>>> accuracy = single__intent_topk_accuracy_score(
    intent_prediction=intent_prediction,
    y_true=["preference", "ohoh", "YY"],
    k=2,
)
>>> print(accuracy)
0.2499999

ynlu.sdk.evaluation.intent_topk_precision_score module

ynlu.sdk.evaluation.intent_topk_precision_score.intent_topk_precision_score(intent_predictions: List[List[Dict[str, str]]], y_trues: List[List[str]], k: int = 1) → float[source]

Compute the precision of all utterances with multi-intents

Please take a look at function single__intent_topk_precision_score first. This function is JUST a batch version of that. It would send all data to that function, then collect and average the output.

\[\text{Precision of all utterances}=\frac{1}{n}\sum_{i=1}^{n}\frac{|\text{pred}_i \cap \text{true}_i|}{|\text{true}_i|}\]
ynlu.sdk.evaluation.intent_topk_precision_score.single__intent_topk_precision_score(intent_prediction: List[Dict[str, str]], y_true: List[str], k: int = 1) → float[source]

Compute the Precision of a single utterance with multi-intents

Precision of a single utterance is defined as the proportion of correctly predicted labels to the total number of the true label. It can be formulated as

\[\text{Precision of single utterance}=\frac{|\text{pred}_i \cap \text{true}_i|}{|\text{true}_i|}\]
Args:
intent_prediction (a list of dictionaries):
A sorted intent prediction (by score) of a single utterance.
y_true (a list of strings):
The corresponding true intent of that utterance. Note that it can be more than one intents.
k (an integer):
The top k prediction of intents we take for computing precision.
Returns:
precision score (a float):
precision of a single utterance given top k prediction.
Examples:
>>> intent_prediction, _ = model.predict("I like apple.")
>>> print(intent_prediction)
[
    {"intent": "blabla", "score": 0.7},
    {"intent": "ohoh", "score": 0.2},
    {"intent": "preference", "score": 0.1},
]
>>> precision = single__intent_topk_precision_score(
    intent_prediction=intent_prediction,
    y_true=["preference", "ohoh", "YY"],
    k=2,
)
>>> print(precision)
0.333333

ynlu.sdk.evaluation.intent_topk_recall_score module

ynlu.sdk.evaluation.intent_topk_recall_score.intent_topk_recall_score(intent_predictions: List[List[Dict[str, str]]], y_trues: List[List[str]], k: int = 1) → float[source]

Compute the Recall of all utterances with multi-intents

Please take a look at function single__intent_topk_recall_score first. This function is JUST a batch version of that. It would send all data to that function, then collect and average the output.

\[\text{Recall of all utterances}=\frac{1}{n}\sum_{i=1}^{n}\frac{|\text{pred}_i \cap \text{true}_i|}{|\text{pred}_i|}\]
ynlu.sdk.evaluation.intent_topk_recall_score.single__intent_topk_recall_score(intent_prediction: List[Dict[str, str]], y_true: List[str], k: int = 1) → float[source]

Compute the Recall of a single utterance with multi-intents

Recall of a single utterance is defined as the proportion of correctly predicted labels to the total number of the predicted label. It can be formulated as

\[\text{Recall of single utterance}=\frac{|\text{pred}_i \cap \text{true}_i|}{|\text{pred}_i|}\]
Args:
intent_prediction (a list of dictionaries):
A sorted intent prediction (by score) of a single utterance.
y_true (a list of strings):
The corresponding true intent of that utterance. Note that it can be more than one intents.
k (an integer):
The top k prediction of intents we take for computing Recall.
Returns:
recall score (a float):
Recall of a single utterance given top k prediction.
Examples:
>>> intent_prediction, _ = model.predict("I like apple.")
>>> print(intent_prediction)
[
    {"intent": "blabla", "score": 0.7},
    {"intent": "ohoh", "score": 0.2},
    {"intent": "preference", "score": 0.1},
]
>>> recall = single__intent_topk_recall_score(
    intent_prediction=intent_prediction,
    y_true=["preference", "ohoh"],
    k=2,
)
>>> print(recall)
0.5

ynlu.sdk.evaluation.utils module

ynlu.sdk.evaluation.utils.preprocess_annotated_utterance(annotated_utterance: str, not_entity: str = 'DONT_CARE') → List[str][source]

Character Level Entity Label Producer

Named-entity of each character is extracted by XML-like annotation. Also, they would be collected in a list conform to the order of characters in the sentence.

Args:
annotated_utterance (a string):
An utterance with annotations looks like <a>blabla</a>. It is a special format for labeling named-entity in an utterance.
not_entity (a string, default = “DONT_CARE”):
A representation of words that we don’t care about.
Returns:
entities (a list of string):
A list of named-entity labels in character level.
Examples:
>>> from ynlu.sdk.evaluation.utils import preprocess_annotated_utterance
>>> preprocess_annotated_utterance(
    annotated_utterance="<drink>Coffee</drink>, please.",
    not_entity="n",
)
>>> ["drink", "drink", "drink", "drink", "drink", "drink", "n",
    "n", "n", "n", "n", "n", "n", "n", "n"]
ynlu.sdk.evaluation.utils.preprocess_entity_prediction(utterance: str, entity_prediction: List[dict], not_entity: str = 'DONT_CARE') → List[str][source]

Character Level Entity Label Producer

Named-entity of each character is extracted by the output of model prediction. Also, they would be collected in a list conform to the order of characters in the sentence.

Args:
utterance (a string):
An input of model.predict().
entity_prediction (a list dictionaries):
Entity part of output returns by calling model.predict() with the utterance above as input. The element in the list is a segment of utterance, the predicted entity type and the confidence of prediction.
not_entity (a string, default = “DONT_CARE”):
A representation of words that we don’t care about.
Returns:
entities (a list of string):
A list of named-entity labels in character level.
Examples:
>>> from ynlu.sdk.evaluation.utils import preprocess_annotated_utterance
>>> preprocess_entity_prediction(
        utterance="Coffee, please.",
        entity_prediction=[
            {"value": "Coffee", "entity": "drink", "score": 0.8},
            {"value": ", please.", "entity": "n"}, "score": 0.7},
        ],
        not_entity="n",
    )
>>> ["drink", "drink", "drink", "drink", "drink", "drink", "n",
    "n", "n", "n", "n", "n", "n", "n", "n"]
ynlu.sdk.evaluation.utils.preprocess_intent_prediction_by_threshold(intent_predictions: List[List[Dict[str, str]]], threshold: float = 0.5, unknown_token: str = 'UNK') → List[str][source]

Predicted Intent Overrider

Override the predicted intent with the unknown token when its score is lower than the threshold.

Args:
intent_predictions( a list of intent predictions):
A list of intent part of output by calling model.predict().
threshold (float, defualt is 0.5):
The indicator about whether to override the prediciton or not.
unknown_token (a string, default is UNk):
A token which would be used as an alternative to a lower-confidence intent prediction.
Returns:
output (a list of string):
A list of preprocessed intent labels.
Examples:
>>> from ynlu.sdk.evaluation.utils import preprocess_intent_prediction_by_threshold
>>> preprocess_intent_prediction_by_threshold(
        intent_predictions=[
            [{"intent": "a", "score": 0.3}, {"intent": "b", "score": 0.1}],
            [{"intent": "b", "score": 0.7}, {"intent": "c", "score": 0.3}],
        ]
        threshold=0.5,
        unknown_token="oo",
    )
>>> [["oo", "oo"], ["b", "oo"]]
ynlu.sdk.evaluation.utils.remove_annotation(annotated_utterance: str)[source]

Module contents