I just ordered one of the Odroid-XU4 for the following improvements over the Pi:
USB3 - not such an issue with the original NCS, but the NCS2 is enough faster that using USB2 is a significant reduction in fps (my i3 has both USB and USB3 ports). Also a decent USB3 memory stick is almost as fast as a SATA drive for external storage
Gigabit Ethernet - makes a significant difference if you use SAMBA file sharing. The Pi3B+ is borderline for processing multiple rtsp streams, all the Odroid improvements should help. I'll find out after it arrives.
eMMC - significantly faster read/write speeds
2GB RAM - twice what is on the Pi3B+
I'm getting close to having a system that runs on Pi3, Windows(7&10), and Ubuntu. It depends on Python for the AI and image inputs and uses Node-Red with MQTT broker local to the AI host for notifications and control. It can use the NCS or CPU only AI (minimally useful on the Pi3 ~0.5 fps), and multiple NCS sticks. Often 1 NCS and 1CPU AI thread hit a "sweet spot" in thruput. Unfortunately the NCS is not supported on Windows. OpenVINO will solve this on Windows 10 (but not Win7 apparently).
It can analyze images from "Onvif snapshots" (http://CameraURL/xxx.jpg, meaning if you can type a URL for your camera in Firefox and it returns a jpg image and it returns a new image when the browser window is refreshed it should work), rtsp streams, jpeg images via MQTT message (node red can be ftp server and feed the images to the AI via MQTT) and for playing around it can analyze MP4 input files.
The code resizes images as needed, I've tested it with D1, 720p, 1080p and 4K camera images. 4K is generally too much of a good thing unless you break the image up into sub-tiles and run multiple AI inferences on the tiles. For my purposes 720p is "optimum" but 1080p is good. The MobileNet-SSD AI is done on a 300x300 pixel input.
Once I finish a few minor issues I will share this code on a GitHub and post a message here. After which my efforts will go to moving it to OpenVINO which while more difficult to setup initially, appears to make it much easier to try different AI subsystems (CPU, GPU, NCS2, NCS) and has many more AI models I can try. Some of the sample code I've found on GitHub and modified a bit is getting ~30fps with the NCS2 and ~11fps on the NCS with ~22fps using two NCS on 640x480 images from a webcam. So I'm very encouraged.
Unfortunately nothing more will happen for at least a week after I post this message, as I have a prior commitment and need to put this project aside.
Your node-red rtsp code could easily ship frames to the AI via MQTT buffers, the main issue is MQTT and Node-Red appear to try and buffer everything which will eventually overflow without some way to have it drop frames. My threads read the stream "continuously" and drop frames if the AI input queue is full. Reducing my Lorex DVR from 15fps (the default) to 5fps helped a lot (by not having the threads spinning so much only to mostly drop frames) and is still plenty good enough for my purposes on the 24/7 recordings.
I have discovered that node-red rbe and binary buffers lead to "out of memory" crashes eventually so I'm not sure how to drop frames on the node-red side. This is one of the remaining things on the TODO list -- keep the MQTT input buffer empty and drop frames when necessary if the AI isn't ready for a new one. I think anything that can simplify the image source side of things is worthwhile.