I have two processes wrote in C that set up PUSH/PULL
ZeroMQ sockets and two threads in a Python process that mirror the PUSH/PULL
sockets. There are roughly 80 - 300 light weight (<30 bytes) messages per second being sent from the C process to the Python process, and 10-30 similar messages from the Python process to the C process.
I was running these services on 64 bit ARMv8 (Ubuntu based) and AMD64 (Ubuntu 18.04) with no noticeable latency. I tried running the exact same services on a 32 bit Linux based system and was shocked to see messages coming through over 30 seconds behind, even after killing the C services. When checking the CPU usage, it was pretty flat 30-40% and didn't appear to be the bottle neck.
My ZeroMQ socket settings didn't change between systems, I set LINGER
to 0, I tried RCVTIMEO
between 0 to 100 ms, and I tried varying BACKLOG
between 0 and 50, with no difference either way. I tried using multiple IO threads and setting socket thread affinity, also to no avail. For the PUSH
sockets I'm connecting the sockets on tcp://localhost:#####
and binding the PULL
sockets to tcp://*:#####
. I also used ipc:///tmp/...
, messages were being sent and received, but the latency still existed on the 32 bit system.
I investigated other Python steps in-between receiving the messages, and they don't appear to be taking more than a millisecond at most. When I time the socket.recv(0)
it's as high as 0.02 seconds even when the RCVTIMEO
is set to 0 for that socket.
Any suggestions why I would see this behaviour on the new 32 bit platform and not on other platforms? Am I possibly looking in all the wrong places?
Here's a bit of code to help explain:
The connection and the _recv()
class-method are roughly depicted below:
def _connect(self): self.context = zmq.Context(4) self.sink = self.context.socket(zmq.PULL) self.sink.setsockopt(zmq.LINGER, 0) self.sink.setsockopt(zmq.RCVTIMEO, 100) self.sink.setsockopt(zmq.BACKLOG, 0) self.sink.bind("tcp://*:55755") def _recv(self): while True: msg = None try: msg = self.sink.recv(0) # Use blocking or zmq.NOBLOCK, still appears to be slow except zmq.Error ... meaningful exception handle here # This last step, when timed usually takes less than a millisecond to process if msg: msg_dict = utils.bytestream_to_dict(msg) # unpacking step (negligible) if msg_dict: self.parser.parse(msg_dict) # parser is a dict of callbacks also negligible
On the C process side
zmq_init (4); void *context = zmq_ctx_new (); /* Connect the Sender */ void *vent = zmq_socket (context, ZMQ_PUSH); int timeo = 0; int timeo_ret = zmq_setsockopt(vent, ZMQ_SNDTIMEO, &timeo, sizeof(timeo)); if (timeo_ret != 0) error("Failed to set ZMQ recv timeout because %s", zmq_strerror(errno)); int linger = 100; int linger_ret = zmq_setsockopt(vent, ZMQ_LINGER, &linger, sizeof(linger)); if (linger_ret != 0) error("Failed to set ZMQ linger because %s", zmq_strerror(errno)); if (zmq_connect (vent, vent_port) == 0) info("Successfully initialized ZeroMQ ventilator on %s", vent_port); else { error("Failed to initialize %s ZeroMQ ventilator with error %s", sink_port, zmq_strerror(errno)); ret = 1; } ... /* When a message needs to be sent it's instantly hitting this where msg is a char* */ ret = zmq_send(vent, msg, msg_len, ZMQ_NOBLOCK);
On docker running on target 32 bit systemlstopo - -v --no-io
Machine (P#0 local=1019216KB total=1019216KB HardwareName="Freescale i.MX6 Quad/DualLite (Device Tree)" HardwareRevision=0000 HardwareSerial=0000000000000000 Backend=Linux LinuxCgroup=/docker/d2b0a3b3a5eedb7e10fc89fdee6e8493716a359597ac61350801cc302d79b8c0 OSName=Linux OSRelease=3.10.54-dey+g441c8d4 OSVersion="#1 SMP PREEMPT RT Tue Jan 28 12:11:37 CST 2020" HostName=db1docker Architecture=armv7l hwlocVersion=1.11.12 ProcessName=lstopo) Package L#0 (P#0 CPUModel="ARMv7 Processor rev 10 (v7l)" CPUImplementer=0x41 CPUArchitecture=7 CPUVariant=0x2 CPUPart=0xc09 CPURevision=10) Core L#0 (P#0) PU L#0 (P#0) Core L#1 (P#1) PU L#1 (P#1) Core L#2 (P#2) PU L#2 (P#2) Core L#3 (P#3) PU L#3 (P#3)depth 0: 1 Machine (type #1) depth 1: 1 Package (type #3) depth 2: 4 Core (type #5) depth 3: 4 PU (type #6)
EDIT:
We were able to make the latency disappear on our target machine by disabling nearly all other worker threads.