In my last post, I mentioned that I am working on a peer-to-peer network protocol for the RS485 network that I am using to communicate between my “smart walkway lights”. In today’s post, I will continue with a general outline of the approach I plan to take. This is all very nerdy, so if you are not interested in building a peer-to-peer network protocol, you are welcome to skip this post.
What I’m most interested in here is how to prevent a collision between two devices who both want to transmit data at the same time. The simplest way is to allow the collision, but wait for an acknowledgement (ACK) from the recipient. If you don’t get one, then you will need to try again.
But the problem is that each time a collision occurs, precious bandwidth is wasted since both sides will need to resend this data if they really want the recipient to receive it. What is worse is that the more often that collisions occur, the more likely they will occur since other devices will also eventually want to send data. (Think of people leaving a building through a single doorway. The moment some of them get stuck trying to go through at the same time, the people behind them start clogging up, increasing the chance that people behind them start pushing, making it harder for the first set of people to get unstuck… you get the idea. It is better if no one gets stuck and people continue moving smoothly through.)

I’ve been playing with ways to detect the likelihood of a collision before trying to send data in order to try to minimize the chance of causing a collision. The most promising approach I’ve found is based on the idea that the RX (receive) pin on the ATmega328p can also be set-up to trigger what is called a “pin change interrupt”. The idea here is that the RX pin is normally high when no data is being sent across the bus by any of the devices. The moment data is sent, the first bit transmitted always pulls the bus low before transmitting the normal 8-bits for a byte and then the stop bit. (This first 0/low bit is referred to as a “start bit”.) There are actually several flavors of serial communication, but I use the standard 8N1, or 8 bits of data following the start bit followed by 1 stop bit. The start bit is always low and the stop bit is always high.

Trivia break: Thus the serial bus actually needs to send 10 bits for every byte of data. So when you calculate the theoretical maximum amount of data you can send down a serial bus like this one, you need to take the baud rate and divide by 10 to get the number of bytes per second. (Not 8!) 

Since the RX pin can generate an interrupt whenever it changes, and it always changes from high to low at the very beginning of transmission by another device on the bus, this is a good way to detect if someone else has begun a transmission. Additionally, I already need to process serial data as it is received, so I treat the bytes as they come in to be a “bus continues to be busy” signal. To detect the end of transmission, I simply wait 1ms after the last byte is received. At 38400 baud (my current test case serial bus speed), this is the time to transmit about 4 bytes of data. (10 bits @38400 bits per second = 10/38400 ~= 1/4000 seconds or 0.25ms per byte of data.)

So using this approach to detect collisions, every microcontroller is always listening for the beginning of transmission of data from other devices. They mark that the bus is “busy” whenever this first start bit is sent. The bus then stays “busy” until a period of time passes where no data is sent. On then does the bus goes back to being considered “idle”, or free to transmit data on.

So the basic algorithm is:

  • If you have data to send, first check if the bus is busy
  • If it is not busy, go ahead and start sending. And hopefully any other devices trying to send will detect that you are sending before they try to send, and no collision occurs.
  • If you detect that the bus is busy, then wait for the bus to become free again
  • If you send data and want to be really sure it is received, the recipient should send a response immediately after receiving the packet. The response should be sent before the 1ms timeout occurs so that no other device starts sending another unrelated transmission tying up the bus and preventing the recipient from responding.
Are you following so far?

(Yes? Good!) 

Then I have one last thing to add here to confuse things a bit more. While my goal is to minimize the chance of “data transmit” collisions, collisions will nonetheless occur if two devices try to transmit at exactly the same moment. (Plus or minus a microsecond or two, I think. This is a discussion unto itself.) In this case, both of the devices will check the bus, find it “not busy”, and then start their transmission at the same time. If they both are trying to transmit within the time window between when one device starts to transmit and the other one detects it, then they will both begin transmission and a collision will occur.
Under normal circumstances, it is unlikely that two devices will both decide to transmit at exactly the same microsecond. To illustrate this point, think about an example use case. Imagine that the devices can sense their own battery voltage and send a transmission whenever the voltage drops below a “low voltage” threshold. Something like calling out “Warning! My battery voltage is getting low!” What is the probability that the battery voltage on two different devices will fall below the threshold within 1 microsecond of each other? Not very likely. 
But the problem here is that external events can bring devices into a synchronized state. And one very common external event that would synchronize them is any time the bus is busy because someone else is transmitting. If two different devices happen to come up with anything to say while a third device is transmitting a packet, then they both will wait exactly the same amount of time after the last byte is transmitted before they will try to transmit. Thus, this “end of transmission” event acts as a synchronization between both devices and greatly increases the chance that they will try to transmit their own data within 1us of each other. 
For reference, the largest packet size in my protocol is 265bytes. At 38400baud, this takes about 69,000us to send. So is is certainly possible that more that two or more devices would come up with something to send during the 69,000us that the bus is busy sending a long packet.
My solution to this is pretty simple though. Each device waits an additional random amount of time after the 1ms timeout occurs before they begin transmitting. This additional amount of time is chosen randomly between 0us and maybe 1000us (1ms). So any time an external event is likely to synchronize whatever devices might want to send data, we should add a small random wait in order to (probably) get them out of sync enough that one of them will go first and the other one will detect it and wait for its own turn to send afterward.

I’m sure I’m not the first person to do this, so I’m most likely reinventing the wheel here. But I didn’t have any good leads on simple approaches, so the above seemed to make sense. If I get stuck somewhere along the way implementing this, I’ll have to start digging around in the literature to understand why. But until then, this approach seems to make sense. Easy enough to implement, but likely to be pretty effective and efficient.

Note to self: random wait time must be recalculated after every packet receipt. Otherwise, if the bus is very busy, then any device that happens to roll the dice and get a very high wait time will be cut in front of over and over again by other devices that roll the dice and get lower wait times. Possibly indefinitely. But if each device rolls the dice for a new wait time after each packet, then eventually each device will get a low wait time and each will eventually get a turn. Even when the bus is very congested. More on this later.