TL;DR

  • Fixed fractional factor upscaling method
  • Targeted at real-time frame upscaling
  • Combines nearest-neighbor with local interpolation
  • Well-suited for pixel-based graphics retro game visuals
  • Optimized for resource-constrained embedded systems
  • See accompanying images and example code

Introduction

I came across the problem of upscaling retro game image/frame data in real-time. While integer upscaling is trivial, fractional upscaling becomes challenging when working with limited resources (e.g., slow floating-point arithmetic). Additionally, the algorithms commonly used for upscaling photos are not well-suited for low-resolution game content or pixel art.

The following image demonstrates what happens when you upscale a 160×144 image by a factor of 1.5 using standard methods. The nearest neighbour approach preserves the overall appearance but introduces visual artifacts. In contrast, bilinear interpolation smooths the image, resulting in washed-out colors and a blurry look.

Upscaling algorithm comparison

There is also a problem when using nearest-neighbor scaling in applications that feature thin lines with high contrast against the background. When these lines move between frames, a flickering effect can occur. I recorded a video to demonstrate this on a real TFT display. The footage was captured at 120 fps and slowed down to 0.25× the original speed. Have a look, for example, at the moving platforms and the background trees.

What I came up with is somewhat of a hybrid approach. First, the scaled original image is overlaid on the target pixel grid. If a pixel from the original image fully covers a target pixel, we simply use that color. If two or four source pixels intersect within a target pixel, we compute the average of their colors.

Example for 1.5x upscaling

I’m not sure if this is a known method - so far, I haven’t found any information or a name for it. A fitting description might be something like “nearest neighbour with local bilinear interpolation,” “nearest neighbour with edge averaging,” or “nearest neighbour with edge blurring.” For simplicity, I’ll refer to it as “Edge Smoothing”.

In the following, I will outline an efficient implementation of the proposed method, tailored for resource-constrained systems.

Idea and Implementation

While experimenting with pixel data, I noticed some interesting properties when using scaling factors expressed as fractions a b , especially when b = 2 n , with a , n .

a 2 n (e.g. factors like 1.5, 1.375, 1.125, 2.75)

  • The pixel boundaries will meet at a respectively b when laying the scaled origin pixel grid over the new one. This results in a repeating pattern of axa pixels.

    • This observation is somehow obvious but important for the algorithm
    • A repeating pattern can allow some pre-calculation
    • A repeating pattern can save a lot of memory when doing precalculations
  • The area covered by a scaled source pixel on a new pixel will always be a fraction in the form of 12x when b has been choosen as b=2n

    • This allows efficent division by using shifing
    • and can be used to efficently calculate an average color of overlapping pixels

I think a visual representation will be helpful:

Upscaling 1.5x

The grey pixels represents the source pixels scaled with a factor of 1.5 or

3 2 1 (a=3, n=1).

The blue pixels represent the unscaled target pixels. The yellow square marks the repeating pattern. This works the same for other factors but get’s a bit messy to show as an example:

Upscaling 2.75x

With the pattern we are able to precalculate which pixels of the source are relevant for a pixel on the target by just using an offset during runtime. Every a rows respecively cols we are applying the same pattern with a different offset to the source pixels.

Example 1

Let’s say we want to upscale a given source graphic from 160×144 px to 240×216 px.

  • The scaling factor is 1.5, or 321(a=3,n=1) respectively a=3, b=2.
  • We have a 3x3 repeating pattern.
  • As most image data comes in a linear buffer, we will work line by line.
    • We are using zero-based indices.
    • The upper-left pixel is stored at position 0 in a one-dimensional array.
    • Line y starts at offset width*y
    • This concept applies to both the source and the destination.
  • The pattern is hardcoded to allow for optimization and to avoid unnecessary calculations.

Setting a pixel in the target array is done using macros (defines) which assign pixel values based on one, two, or four different pixels relative to the current source pixel offset. In this case, we only use equal distributions of the pixels: 100%, 50%/50%, and 25%/25%/25%/25%.

We apply the first row of our pattern to the first three pixels of the target data. Then we add an offset of b to the source pixel x-offset and repeat the process until the first line of the target data is complete.

After resetting the x-offset, we apply the second row of the pattern to generate our destination data, and then repeat with the third row. By adding an offset of b to the source pixel y-offset, we shift the pattern to match the next three lines.

After repeating this for every row in the target array, we end up with our upscaled frame.

The advantage of carefully choosing the denominator as 2n becomes evident in the mix_2 and mix_4 functions: we don’t need any floating-point arithmetic.

#define SRC_PX(x,y) psSrc[y*uiSrcWidth+x]
#define SRC_OFF_X(x) (uiSrcOffsetX + x)
#define SRC_OFF_Y(y) (uiSrcOffsetY + y)

#define DST_SET_1_100(sx, sy) psDst[uiDstPos++] = SRC_PX(SRC_OFF_X(sx), SRC_OFF_Y(sy))
#define DST_SET_2_50_50(sx0, sy0, sx1, sy1) mix_2(&psDst[uiDstPos++], &SRC_PX(SRC_OFF_X(sx0),SRC_OFF_Y(sy0)), &SRC_PX(SRC_OFF_X(sx1),SRC_OFF_Y(sy1)))
#define DST_SET_4_25_25_25_25(sx0, sy0, sx1, sy1, sx2, sy2, sx3, sy3) mix_4(&psDst[uiDstPos++], &SRC_PX(SRC_OFF_X(sx0),SRC_OFF_Y(sy0)), &SRC_PX(SRC_OFF_X(sx1),SRC_OFF_Y(sy1)), &SRC_PX(SRC_OFF_X(sx2),SRC_OFF_Y(sy2)), &SRC_PX(SRC_OFF_X(sx3),SRC_OFF_Y(sy3)))

static inline void mix_2(uint16_t* const puwDst, const uint16_t* const puwSrc0, const uint16_t* const puwSrc1)
{
  *puwDst = (((*puwSrc0) & 0x1F) + ((*puwSrc1) & 0x1F))>>1;
  *puwDst |= (((((*puwSrc0)>>5) & 0x3F) + (((*puwSrc1)>>5) & 0x3F))>>1)<<5;
  *puwDst |= ((((*puwSrc0)>>11) + ((*puwSrc1)>>11))>>1)<<11;
}

static inline void mix_4(uint16_t* const puwDst, const uint16_t* const puwSrc0, const uint16_t* const puwSrc1, const uint16_t* const puwSrc2, const uint16_t* const puwSrc3)
{
  *puwDst = (((*puwSrc0) & 0x1F) + ((*puwSrc1) & 0x1F) + ((*puwSrc2) & 0x1F) + ((*puwSrc3) & 0x1F))>>2;
  *puwDst |= (((((*puwSrc0)>>5) & 0x3F) + (((*puwSrc1)>>5) & 0x3F) + (((*puwSrc2)>>5) & 0x3F) + (((*puwSrc3)>>5) & 0x3F))>>2)<<5;
  *puwDst |= ((((*puwSrc0)>>11) + ((*puwSrc1)>>11) + ((*puwSrc2)>>11) + ((*puwSrc3)>>11))>>2)<<11;
}

void scale_1_5_rgb565(const uint16_t* const psSrc, uint16_t* const psDst, uint uiSrcWidth, uint uiSrcHeight)
{
  // We expect the destination height and width to be a multiple of 3
  // so we have to assure the source width and height is a multiple of 2
  // otherwise we have to add special handling for the last row/column
  assert(uiSrcWidth%2 == 0);
  assert(uiSrcHeight%2 == 0);

  uint uiDstWidth = uiSrcWidth + (uiSrcWidth>>1);
  uint uiDstHeight = uiSrcHeight + (uiSrcHeight>>1);

  uint uiSrcOffsetX = 0;
  uint uiSrcOffsetY = 0;
  uint uiDstPos = 0;
  uint uiWidthDiv = uiDstWidth / 3; // a

  for (uint uiY = 0; uiY < uiDstHeight; uiY+=3)
  {
    // Pattern row 0
    uiSrcOffsetX = 0;
    for (uint uiX = 0; uiX < uiWidthDiv; uiX++)
    {
      DST_SET_1_100( 0,0 ); 
      DST_SET_2_50_50( 0,0 , 1,0 ); 
      DST_SET_1_100( 1,0 );
      uiSrcOffsetX += 2; // 2^n 
    }

    // Pattern row 1
    uiSrcOffsetX = 0;
    for (uint uiX = 0; uiX < uiWidthDiv; uiX++)
    {
      DST_SET_2_50_50( 0,0 , 0,1 );
      DST_SET_4_25_25_25_25( 0,0 , 0,1 , 1,0, 1,1 );
      DST_SET_2_50_50( 1,0 , 1,1 );
      uiSrcOffsetX += 2; // 2^n 
    }
    
    // Pattern row 2
    uiSrcOffsetX = 0;
    for (uint uiX = 0; uiX < uiWidthDiv; uiX++)
    {
      DST_SET_1_100( 0,1 );
      DST_SET_2_50_50( 0,1 , 1,1 );
      DST_SET_1_100( 1,1 );
      uiSrcOffsetX += 2; // 2^n 
    }

    uiSrcOffsetY += 2; // 2^n
  }

}

Example 2

This can be seen more or less as a real-world example. While the first example was used to generate the demo images, this code was used for the demo videos. It includes additional optimizations and is designed to upscale image data in real time (~60 fps), captured from a Game Boy Advance (240×160) running a Game Boy Classic game (160×144). The scaling factor is again 1.5. The code takes into account the offset of the smaller image within the 240×160 display area. Since the TFT driver in this case requests the image line by line, only one line of the image is generated per call.

void tft_line_callback_1_5(uint uiLineCounter, uint16_t* const puwLineBuffer)
{
  uint16_t* auwFrameBuffer = agb_capture_get_framebuffer();
  uint16_t* psSrc;
  uint uiDstPos = 0;
  static uint uiSrcOffsetY = 0;

  if (uiLineCounter < 216)
  {

    psSrc = &auwFrameBuffer[240 * (8 + uiSrcOffsetY) + 40];

    switch (uiLineCounter % 3)
    {
      case 0:
        for (uint uiX = 0; uiX < 240/3; uiX++)
        {
          uint32_t ulSumR0001 = ((uint32_t)(psSrc[0 + 0*240] & 0xF800) + (uint32_t)(psSrc[0 + 1*240] & 0xF800));
          uint32_t ulSumG0001 = ((uint32_t)(psSrc[0 + 0*240] & 0x07E0) + (uint32_t)(psSrc[0 + 1*240] & 0x07E0));
          uint32_t ulSumB0001 = ((uint32_t)(psSrc[0 + 0*240] & 0x001F) + (uint32_t)(psSrc[0 + 1*240] & 0x001F));

          puwLineBuffer[uiDstPos++] = psSrc[0 + 0*240 ];
          puwLineBuffer[uiDstPos++] = ((ulSumR0001>>1)&0xF800) | ((ulSumG0001>>1)&0x07E0) | (ulSumB0001>>1);
          puwLineBuffer[uiDstPos++] = psSrc[1 + 0*240];
          psSrc += 2;
        }
        break;
      case 1:
        for (uint uiX = 0; uiX < 240/3; uiX++)
        {
          uint32_t ulSumR0001 = ((uint32_t)(psSrc[0 + 0*240] & 0xF800) + (uint32_t)(psSrc[0 + 1*240] & 0xF800));
          uint32_t ulSumG0001 = ((uint32_t)(psSrc[0 + 0*240] & 0x07E0) + (uint32_t)(psSrc[0 + 1*240] & 0x07E0));
          uint32_t ulSumB0001 = ((uint32_t)(psSrc[0 + 0*240] & 0x001F) + (uint32_t)(psSrc[0 + 1*240] & 0x001F));

          uint32_t ulSumR1011 = ((uint32_t)(psSrc[1 + 0*240] & 0xF800) + (uint32_t)(psSrc[1 + 1*240] & 0xF800));
          uint32_t ulSumG1011 = ((uint32_t)(psSrc[1 + 0*240] & 0x07E0) + (uint32_t)(psSrc[1 + 1*240] & 0x07E0));
          uint32_t ulSumB1011 = ((uint32_t)(psSrc[1 + 0*240] & 0x001F) + (uint32_t)(psSrc[1 + 1*240] & 0x001F));

          puwLineBuffer[uiDstPos++] = ((ulSumR0001>>1)&0xF800) | ((ulSumG0001>>1)&0x07E0) | (ulSumB0001>>1);
          puwLineBuffer[uiDstPos++] =
              (((ulSumR0001 + ulSumR1011)>>2)&0xF800)
            | (((ulSumG0001 + ulSumG1011)>>2)&0x07E0)
            | ((ulSumB0001 + ulSumB1011)>>2);
          puwLineBuffer[uiDstPos++] = ((ulSumR1011>>1)&0xF800) | ((ulSumG1011>>1)&0x07E0) | (ulSumB1011>>1);
          psSrc += 2;
        }
        break;
      case 2:
        for (uint uiX = 0; uiX < 240/3; uiX++)
        {
          uint32_t ulSumR0111 = ((uint32_t)(psSrc[0 + 1*240] & 0xF800) + (uint32_t)(psSrc[1 + 1*240] & 0xF800));
          uint32_t ulSumG0111 = ((uint32_t)(psSrc[0 + 1*240] & 0x07E0) + (uint32_t)(psSrc[1 + 1*240] & 0x07E0));
          uint32_t ulSumB0111 = ((uint32_t)(psSrc[0 + 1*240] & 0x001F) + (uint32_t)(psSrc[1 + 1*240] & 0x001F));

          puwLineBuffer[uiDstPos++] = psSrc[0 + 1*240];
          puwLineBuffer[uiDstPos++] = ((ulSumR0111>>1)&0xF800) | ((ulSumG0111>>1)&0x07E0) | (ulSumB0111>>1);
          puwLineBuffer[uiDstPos++] = psSrc[1 + 1*240];
          psSrc += 2;
        }
        uiSrcOffsetY += 2;
        break;
    }
  }
  else
  {
    memset(puwLineBuffer, 0x00, sizeof(uint16_t)*480);
    uiSrcOffsetY = 0;
  }
}