Failing to detect certain entities when data is sent in bulk

I’ve trained rasa_nlu on the following intent, where I’ve 6 entities:

3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp

I gave sufficient training examples. When I hit rasa_nlu server with a similar line, I’m getting proper entities and intent predictions with high confidence, but when I put several such lines in an array and hit the server repeatedly in a loop, like:

3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp

143 root 20 0 45228 5436 4892 S 0.0 0.2 0:00.04 wpa_supplicant

27 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 tpm_dev_wq

few entities are missed for certain examples, rasa_nlu doesn’t even pick them. It picks like 4 or 5 entities and misses one or two. It’s happening in a random manner. You can’t say for which example all the entities are detected and for which a few are skipped. So when given a data set that has around 200 such examples, all the 6 entities are being detected for not more than 100 examples, for the rest, rasa_nlu is missing out on an entity or two. But the confidence is pretty high (> 96%) when all the entities are successfully predicted. I’m using CRF entity extractor as I’ve a lot of custom entities to deal with. Please suggest a solution asap. Thank You.

Hey @AbhishekChandra7 can you post the markdown format for what your training data looks like?

      "rasa_nlu_data": { 

          "common_examples": [

                   {

    "text": "718 systemd+  20   0  141928   3176   2648 S  5.8  0.1   0:00.04 systemd-timesyn",
    "intent": "getProcess",
    "entities": [
      {
        "start": 0,
        "end": 3,
        "value": "718",
        "entity": "processId"
      },
      {
        "start": 4,
        "end": 12,
        "value": "systemd+",
        "entity": "userName"
      },
      {
        "start": 46,
        "end": 49,
        "value": "5.8",
        "entity": "processCpu"
      },
      {
        "start": 51,
        "end": 54,
        "value": "0.1",
        "entity": "processMemory"
      },
      {
        "start": 57,
        "end": 64,
        "value": "0:00.04",
        "entity": "processTime"
      },
      {
        "start": 65,
        "end": 80,
        "value": "systemd-timesyn",
        "entity": "processCommand"
      }
    ]
  },{
    "text": "1130 message+  20   0   50284   4760   3924 S  1.9  0.1   0:00.09 dbus-daemon",
    "intent": "getProcess",
    "entities": [
      {
        "start": 0,
        "end": 4,
        "value": "1130",
        "entity": "processId"
      },
      {
        "start": 5,
        "end": 13,
        "value": "message+",
        "entity": "userName"
      },
      {
        "start": 47,
        "end": 50,
        "value": "1.9",
        "entity": "processCpu"
      },
      {
        "start": 52,
        "end": 55,
        "value": "0.1",
        "entity": "processMemory"
      },
      {
        "start": 58,
        "end": 65,
        "value": "0:00.09",
        "entity": "processTime"
      },
      {
        "start": 66,
        "end": 77,
        "value": "dbus-daemon",
        "entity": "processCommand"
      }
    ]
  }]}

These are only two examples, I’ve like 30 examples in the actual trained model

Ok, so the issue here is there’s no way for a model to be able to distinguish between all these different numbers and associate them with different entities (in fact I personally wouldn’t be able to do that either).

What is the use case of this assistant?

I thought the same and that is why I started training it heavily. What I don’t get is the random nature of it. And when it predicts successfully the confidence is pretty high. How can it miss something for which it showed like 96% confidence just in the previous attempt, or in the attempt before that.

I’m using this assistant to perform command and file analysis. I find that using NLP is way more simpler and effective than writing Regex for patterns. Linux is a file based OS which makes it difficult to pass the data around, unlike Windows Powershell which uses objects and classes to store the data. I’m trying to build a similar system for Linux, wherein I use NLP to process tonnes of files together and extract useful information out of it, which are stored in variables(entities) that can be easily passed around and accessed from anywhere. I’m planning on extending it to make a chat-bot out of it where you just have buttons to execute file commands. Though it sounds whacky now, I plan on integrating it with alexa where you just give voice commands to work your way around the OS.

Adding a new file type would mean just training it on new files rather than writing regex. I started it based on the fact that it takes just a few examples to train and you’ve a UI to train.

And to enhance the prediction further, I’ve put different file types in different models and I pass the model name every time I make the API call to rasa. Right now I’ve around 50 different models and many those which have to make less number of predictions (as in the size of the array is less) are working absolutely fine. The problem arises when array size is quite large, like 300 or 400 API calls.

it’s not really random nature, if the numbers all match the pattern then it’ll predict with a high confidence. i probably wouldn’t suggest using entities in that way. you can use duckling to extract all these numbers, but would then need to use a custom action to decide which slot it should fill

Alright, though an additional effort goes into doing this, it seems to be the best way ahead till now. But Microsoft’s LUIS is able to do this. Do you reckon its using the same process you mentioned above in the back-end or is it a completely different entity extractor all together?

I’m sure it’s using regex matching somehow as well - what do you mean by LUIS can do this though? In what way?

I’ve trained LUIS on the same statement and hitting its API endpoint instead of Rasa’s. That’s giving the desired output (all the entities) for the same data set. I don’t know how its doing it though.