Failing to detect certain entities when data is sent in bulk

I’ve trained rasa_nlu on the following intent, where I’ve 6 entities:

3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp

I gave sufficient training examples. When I hit rasa_nlu server with a similar line, I’m getting proper entities and intent predictions with high confidence, but when I put several such lines in an array and hit the server repeatedly in a loop, like:

3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp

143 root 20 0 45228 5436 4892 S 0.0 0.2 0:00.04 wpa_supplicant

27 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 tpm_dev_wq

few entities are missed for certain examples, rasa_nlu doesn’t even pick them. It picks like 4 or 5 entities and misses one or two. It’s happening in a random manner. You can’t say for which example all the entities are detected and for which a few are skipped. So when given a data set that has around 200 such examples, all the 6 entities are being detected for not more than 100 examples, for the rest, rasa_nlu is missing out on an entity or two. But the confidence is pretty high (> 96%) when all the entities are successfully predicted. I’m using CRF entity extractor as I’ve a lot of custom entities to deal with. Please suggest a solution asap. Thank You.

Hey @AbhishekChandra7 can you post the markdown format for what your training data looks like?

      "rasa_nlu_data": { 

          "common_examples": [

                   {

    "text": "718 systemd+  20   0  141928   3176   2648 S  5.8  0.1   0:00.04 systemd-timesyn",
    "intent": "getProcess",
    "entities": [
      {
        "start": 0,
        "end": 3,
        "value": "718",
        "entity": "processId"
      },
      {
        "start": 4,
        "end": 12,
        "value": "systemd+",
        "entity": "userName"
      },
      {
        "start": 46,
        "end": 49,
        "value": "5.8",
        "entity": "processCpu"
      },
      {
        "start": 51,
        "end": 54,
        "value": "0.1",
        "entity": "processMemory"
      },
      {
        "start": 57,
        "end": 64,
        "value": "0:00.04",
        "entity": "processTime"
      },
      {
        "start": 65,
        "end": 80,
        "value": "systemd-timesyn",
        "entity": "processCommand"
      }
    ]
  },{
    "text": "1130 message+  20   0   50284   4760   3924 S  1.9  0.1   0:00.09 dbus-daemon",
    "intent": "getProcess",
    "entities": [
      {
        "start": 0,
        "end": 4,
        "value": "1130",
        "entity": "processId"
      },
      {
        "start": 5,
        "end": 13,
        "value": "message+",
        "entity": "userName"
      },
      {
        "start": 47,
        "end": 50,
        "value": "1.9",
        "entity": "processCpu"
      },
      {
        "start": 52,
        "end": 55,
        "value": "0.1",
        "entity": "processMemory"
      },
      {
        "start": 58,
        "end": 65,
        "value": "0:00.09",
        "entity": "processTime"
      },
      {
        "start": 66,
        "end": 77,
        "value": "dbus-daemon",
        "entity": "processCommand"
      }
    ]
  }]}

These are only two examples, I’ve like 30 examples in the actual trained model

Ok, so the issue here is there’s no way for a model to be able to distinguish between all these different numbers and associate them with different entities (in fact I personally wouldn’t be able to do that either).

What is the use case of this assistant?

it’s not really random nature, if the numbers all match the pattern then it’ll predict with a high confidence. i probably wouldn’t suggest using entities in that way. you can use duckling to extract all these numbers, but would then need to use a custom action to decide which slot it should fill

Alright, though an additional effort goes into doing this, it seems to be the best way ahead till now. But Microsoft’s LUIS is able to do this. Do you reckon its using the same process you mentioned above in the back-end or is it a completely different entity extractor all together?

I’m sure it’s using regex matching somehow as well - what do you mean by LUIS can do this though? In what way?

I’ve trained LUIS on the same statement and hitting its API endpoint instead of Rasa’s. That’s giving the desired output (all the entities) for the same data set. I don’t know how its doing it though.