How to add Spacy JA model into Rasa Docker

Hello,

I am trying to extend rasa/rasa:2.1.2-spacy-en to include Spacy’s ja_core_news_md. With the help of rasa-demo and the answer here, I came up with this Dockerfile.

FROM rasa/rasa:2.1.2-spacy-en

# Use subdirectory as working directory
WORKDIR /app

# Change back to root user to install dependencies
USER root

RUN apt-get install -y gcc && \
	apt-get autoremove -y

RUN pip install spacy==2.3.2 spacy-lookups-data --no-cache-dir

RUN python -m spacy download ja_core_news_md && \
    python -m spacy link ja_core_news_md ja

# Switch back to non-root to run code
USER 1001

It was successful in running a previously trained model in Windows (non-Docker). Notes:

  • I had to update spacy to 2.3.2 because the ja_core_news_md model started to be available at that version.
  • I had to install gcc because when I tried without it, there was a build error from sudachipy.
  • I moved to Docker from non-Docker (Windows, using pyenv then conda) because I want to utilize the WSL2 GPU during training. I am also working across different operating systems.

My issue is the size of the Docker image.

REPOSITORY                                               TAG                           IMAGE ID            CREATED             SIZE
rasa/rasa                                                2.1.2-spacy-en-ja             90f6e904afb4        13 minutes ago      2.39GB
rasa/rasa                                                2.1.2-full                    1ae20eafdcbd        47 hours ago        1.91GB
rasa/rasa                                                2.1.2-spacy-en                31ca289ae941        47 hours ago        1.82GB

The spacy-en-ja image is a bit bigger than 2.1.2-full I will try to build from the cloned rasa repository with my updates. (It’s just that my Internet had already slowed down)

I would appreciate it if there are any tips to improve this Dockerfile.

Update: I built the spacy-en-ja by cloning the Rasa Github repo.

Then updated the spacy version and added the Japanese model in the pyproject.toml and docker/Dockerfile_pretrained_embeddings_spacy_en, respectively The Docker image size after the building is 1.95GB.

REPOSITORY    TAG                 IMAGE ID         CREATED             SIZE
rasa/rasa     2.1.2-spacy-en-ja   4f1a59add4b1     7 minutes ago       1.95GB

Is there a better way to “extend” rasa/rasa:2.1.2-spacy-en?

We’re currently also working on updating our Spacy dependency

I could also imagine starting from the smallest Rasa image and then just adding your en-ja model. This would avoid having the unused en language model around.