Confusion on SpacyNLP pipeline

AnisiaOltean · May 1, 2024, 6:12pm

Hello Rasa Community! I am trying to integrate a custom NER model into my RasaNLP pipeline. I have trained my model using spaCy by creating an initial empty pipeline and then adding the tok2vec and ner components (i am using the default configuration found here [Training Pipelines & Models · spaCy Usage Documentation]. This is the config.cfg file I have used:

[paths]
train = null
dev = null
vectors = "en_core_web_lg"
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.WandbLogger.v3"
project_name = "custom_ner_model"
remove_config_values = []
log_dataset_dir = "./corpus"
model_log_interval = 1000

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

(Training Pipelines & Models · spaCy Usage Documentation). I have managed to integrate my custom NER model using this tutorial https://rasa.com/blog/custom-spacy-3-0-models-in-rasa/ written by @koaning . However, in the tutorial the NER component is replaced in a pretrained pipeline, whereas in my case I am creating a new spaCy model, so my pipeline will only have 2 components “tok2vec” and “ner” instead of all the components used in a pretrained pipeline “tok2vec”,“tagger”,“parser”,“senter”,“attribute_ruler”,“lemmatizer”,“ner”. In my config.cfg file, this is how I have configured my pipeline:

language: en

pipeline: - name: SpacyNLP

  model: en_CustomNer
- name: SpacyTokenizer
- name: SpacyFeaturizer
  pooling: mean
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 2
  max_ngram: 4
- name: DIETClassifier
  epochs: 150
  constrain_similarities: true
- name: SpacyEntityExtractor
- name: FallbackClassifier
  threshold: 0.1
  ambiguity_threshold: 0.1

My question is, are the SpacyTokenizer and SpacyFeaturizer components related to my model en_CustomNer? More specifically, in the case of SpacyFeaturizer, how are the dense embeddings computed? Are these embeddings related to my custom model? Or are they computed based on some spaCy default values? I am asking because I can’t tell if by using this custom model that has only “tok2vec” and “ner” components has any effect on the performance of my pipeline. Do I need to change my approach and replace the NER component in a pretrained pipeline (like in the tutorial above) or is this custom NER model fine?

Topic		Replies	Views
How to integrate custom NER using spacy trained custom model on rasa-NLU pipeline Rasa Open Source	2	1010	December 17, 2019
Custom spaCy language model, which parts do I need to train? Rasa Open Source	2	1230	July 15, 2019
Multiple spacy models in one pipeline Rasa Open Source	1	863	September 18, 2019
RASA Spacy sklearn pipe line Rasa Open Source	2	1501	September 9, 2018
How to configure the pipeline using other language? Rasa Open Source	1	1743	September 30, 2019

Confusion on SpacyNLP pipeline

Related topics