Hello Rasa Community! I am trying to integrate a custom NER model into my RasaNLP pipeline. I have trained my model using spaCy by creating an initial empty pipeline and then adding the tok2vec and ner components (i am using the default configuration found here [Training Pipelines & Models · spaCy Usage Documentation]. This is the config.cfg file I have used:
[paths]
train = null
dev = null
vectors = "en_core_web_lg"
init_tok2vec = null
[system]
gpu_allocator = null
seed = 0
[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}
[components]
[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3
[corpora]
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0
[training.logger]
@loggers = "spacy.WandbLogger.v3"
project_name = "custom_ner_model"
remove_config_values = []
log_dataset_dir = "./corpus"
model_log_interval = 1000
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001
[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
[initialize.tokenizer]
(Training Pipelines & Models · spaCy Usage Documentation). I have managed to integrate my custom NER model using this tutorial https://rasa.com/blog/custom-spacy-3-0-models-in-rasa/ written by @koaning . However, in the tutorial the NER component is replaced in a pretrained pipeline, whereas in my case I am creating a new spaCy model, so my pipeline will only have 2 components “tok2vec” and “ner” instead of all the components used in a pretrained pipeline “tok2vec”,“tagger”,“parser”,“senter”,“attribute_ruler”,“lemmatizer”,“ner”. In my config.cfg file, this is how I have configured my pipeline:
language: en
pipeline: - name: SpacyNLP
model: en_CustomNer
- name: SpacyTokenizer
- name: SpacyFeaturizer
pooling: mean
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 2
max_ngram: 4
- name: DIETClassifier
epochs: 150
constrain_similarities: true
- name: SpacyEntityExtractor
- name: FallbackClassifier
threshold: 0.1
ambiguity_threshold: 0.1
My question is, are the SpacyTokenizer and SpacyFeaturizer components related to my model en_CustomNer? More specifically, in the case of SpacyFeaturizer, how are the dense embeddings computed? Are these embeddings related to my custom model? Or are they computed based on some spaCy default values? I am asking because I can’t tell if by using this custom model that has only “tok2vec” and “ner” components has any effect on the performance of my pipeline. Do I need to change my approach and replace the NER component in a pretrained pipeline (like in the tutorial above) or is this custom NER model fine?