A few months back, I saw this post from my friend Professor Montri Karnjanadecha regarding timing of an Arduino Nano pin. In it, he uses an oscilloscope to test what kind of speeds you can get from alternately pulling high/low a pin.

But as he points out, he doesn’t get as much speed as he could due to the Arduino main loop overhead.

I’ve never been a big fan of Arduino personally. I can understand why people like it, and good for them. It is a great platform to begin exploring microcontrollers. But once you have a feel for the hardware (and particularly if you have a strong background on C/C++ programming), it feels to me like it takes a bit too much control away for not so much gain.

And Montri’s post is an example of this. The fastest waveform speed he can get with digital I/O writes is ~1.1MHz using a 16MHz ATmega328.

For my test, I am running the same ATmega328p board that I am using for the “smart lights” running at 8MHz. (Bear in mind that since I am using the internal 8MHz RC oscillator and he is running with an external 16MHz crystal oscillator, I am running at half the clock speed, which means half the speed on everything.) Also, I am using the standard avr-g++ compiler installed on Ubuntu from the ‘gcc-avr’ Debian package. (“sudo apt-get install gcc-avr”)

Here is the equivalent code to what Montri wrote on Arduino using the more “bare bones” C approach:

#include <avr/io.h>

int main( void )
{
// set-up PB6 to be an output pin
DDRB |= _BV(6);
// do main loop
while(true)
{ PORTB |= _BV(6); PORTB &= ~_BV(6); }
}

You can see that this code is not significantly more complex than the Ardiuno code.
To compile it, you can use the following commands (copied from my Makefile… which I would recommend using if you know how to use gnumake, but works fine from the command-line also):

avr-g++ -O2 -mmcu=atmega328p -Wall -DF_CPU=8000000
      -o pin_switch_test_firmware.out
      pin_switch_test_firmware.c++
      -Wl,-Map=pin_switch_test_firmware.map -static
avr-objcopy -R .eeprom -O ihex
      pin_switch_test_firmware.out
      pin_switch_test_firmware.hex

This can be flashed to the microcontroller using avrdude and a USB avrispv2 compatible programmer (“sudo apt-get install avrdude”):

avrdude -p m328p -P usb -c avrispv2
      -U flash:w:pin_switch_test_firmware.hex

Here is a picture of the scope on this test running the above program:

This is about a 1.4MHz waveform. Despite running at half the clock speed, it is already faster than the Arduino Nano running similar code. While it is less asymmetrical compared with Montri’s test case, you can still see how it is low longer than it is high. This is due to the “while( true )” adding time.

It can be useful to check out the assembly code created by avr-g++ to understand timing. The way to do that is to add a “-S” to the compiler:

avr-g++ -S -O2 -mmcu=atmega328p -Wall -DF_CPU=8000000
      -c pin_switch_test_firmware.c++

Looking at the output assembly code file, the main function compiles to:

.L__stack_usage = 0
sbi 0x4,6
.L2:
sbi 0x5,6
cbi 0x5,6
rjmp .L2

You can see how the line that pulls PB6 high in C (“PORTB |= _BV(6);”) becomes one line in assembly (sbi 0x5,6″). This is useful to know, since in C it actually looks like it could be as many as 4 operations:
  1. Take the number 6, and use the _BV() macro to turn it into the binary number b00100000 (a one in the sixth place)
  2. Load the value of PORTB
  3. OR the values from #1 and #2 above
  4. Store the result from #3 back into PORTB
But you can see that the compiler is smart enough to know that all of this reduces to the single assembly instruction “sbi”. The assembly line means:
  1. set bit number 6 in register 0x5 (PORTB)
But while the set bit and clear bit instructions take equal amount of time, the rjmp back to .L2 to start all over again takes more time before the bit gets set.
This can be improved a little bit by “loop unrolling”. Instead of a single set/clear inside of the while loop, you can have a whole bunch of them (lets say 16 of them in a row). That way, the rjmp may be called once every 16 cycles. This speeds things up on average, but once every 16 waveform cycles, it will take a little longer on the low side of the waveform, which can still cause problems. Nonetheless, it can look pretty good on average:
You can see that it is now up to about 2MHz. Since it is an 8MHz clock, this means that the set-bit takes two clocks and the clear-bit takes two clocks resulting in 4 clocks for the two of them. So the conclusion to the question in the title of this blog post: “how fast is an AVR GPIO pin change?” is that. It takes two clock cycles.
Going back to the question of Arduino vs straight avr-gcc…
The Arduino framework might make things a little easier, but it really isn’t all that hard to do in straight avr-gcc either. And with the flexibility of direct access to the compiler, you can look at the assembly, understand what is going on, and get complete control over your device. In this case, we end up with almost double the frequency from a microcontroller running at half the speed, or nearly a 4x speedup.
Additionally, the Arduino framework restricts the kinds of devices and device configurations that you can run. For example, I believe it can only run with the external 16MHz crystal. But at 8MHz or even 1MHz, you need less hardware, and it runs with less power. And in the case of the “smart lights” it frees up the 2 external crystal pins to use for other purposes. (This was the reason we didn’t use the 16MHz crystal on this project. This circuit supports a lot of external hardware so we used pretty much every pin available. More on that in another post.)
But things really start looking better once you take advantage of the hardware directly. These ATmega microcontrollers all have PWM hardware built in. If you find a library that wraps the hardware for you in just the way you need, go ahead and use it. It should work equally well on Arduino and straight avr-gcc. I wrote one for myself, but since it isn’t publicly distributed, I will ignore it for this blog post.
Without such a library, you need to start reading the Atmel datasheets. Atmel writes really good datasheets, and the ATmega328p one can be found here:
Chapter 15 covers the Timer1 interface, which pretty much tells you everything you need to know about how to set up the PWM. Here is my code:

#include <avr/io.h> 

//    timer1 constants
#define MODE_FAST_PWM_8BIT        5
#define MODE_FAST_PWM_ICR1_TOP    14
#define COM_CLEAR_ON_MATCH         2
#define CLKSLCT_CLKIO_DIV_1        1
 

int main( void )
{
    //————————————
    // enable PWM1
   
    // set OC1A(=PB1) to be an OUTPUT pin
    //    and pull pin low
    DDRB |= _BV(1);
    PORTB &= ~_BV(1);
   
    // set OCR1A compare output mode
    TCCR1A &= ~( _BV(COM1A1) | _BV(COM1A0) );
    TCCR1A |= ( COM_CLEAR_ON_MATCH << COM1A0 );
   
    // set waveform generation mode
    TCCR1A &= ~( _BV(WGM11) | _BV(WGM10) );
    TCCR1B &= ~( _BV(WGM13) | _BV(WGM12) );
    TCCR1A |= ( (MODE_FAST_PWM_ICR1_TOP&0x3) << WGM10 );
    TCCR1B |= ( ((MODE_FAST_PWM_ICR1_TOP>>2)&0x3)
                     << WGM12 );
    // set timer counter to 0
    TCNT1 = 0;
       
    // set TOP
    ICR1 = 1;
   
    // set match value
    OCR1A = 0;
 

    // start the PWM clock
    TCCR1B &= ~( _BV(CS12) | _BV(CS11) | _BV(CS10) );
    TCCR1B |= ( CLKSLCT_CLKIO_DIV_1 << CS10 );

    while( 1 ); 

}


It is clearly more lines of code. But it is worth it to get the waveform loop out of the software and into the hardware. For the above program, once the setup is done, the processor can move on to other computations and calculations and the waveform will continue at full speed with no interruptions. The idea is that the first section sets the PWM output pin to be an output pin and pulls it low (the PWM hardware will then pull it up). The CLEAR_ON_MATCH part says that we should reset the pin high at the beginning of every cycle and automatically pull it low whenever the timer (TCNT1) matches the match value (OCR1A). The waveform generation mode is set to “fast PWM that resets whenever TCNT1 matches ICR1”. Then we initialize the counter (TCNT1) to 0, followed by setting the TOP value (ICR1) to 1. We do our match at OCR1A=0. 
Finally, we start the timer at full cpu speed (clkIO/1).
At this point, the OC1A pin will rise every time the TCNT1 resets back to zero, will fall at the end of the cycle where it matches OCR1A (0), and the reset after the cycle where it matches ICR1 (1). Thus it will spend 1 cycle low followed by one cycle high and repeat. 
Note that because all of this happens in hardware, we don’t have any issue of the “rjmp” command adding extra time.
Here is a picture:
As you can see, at 8MHz CPU clock and spending 1 cycle low and 1 cycle high, we get a 4MHz waveform. With no funny asymmetries at all. That’s as fast and clean as you will ever get from PWM on this device.
Now if you want to do more normal pulse width modulation, you can increase ICR1 up to 0xff or even 0xffff (or anything else between 0 and 0xffff), and adjust OCR1A between 0 and ICR1 to change pulse width. The larger ICR1 is, the more precision you get in pulse width, but the lower the speed.
Since this exercise was all about seeing how fast we can get, we go for the fastest 50% duty cycle waveform possible, which will occur at ICR1 = 1.
Hopefully I haven’t scared you off with a bunch of register writes and bitwise OR, NOT, and AND operations. But it actually isn’t very difficult once you get the hang of it. Try it out!