HowVideoCardsWork

Video Cards

So you want to know how modern video cards work. Here goes...

Modern video cards usually have several common features:

Video Ram
Display control
2D engine
3D engine
Overlay
HW sprites (cursor, icon, etc.)
AGP/PCI/PCIE
Apertures (registers, framebuffer)

Video Ram

Basically a large chunk of fast ram. This memory is used for all sorts of things:

Scan-out buffers (what you see on your monitor)
Offscreen rendering buffers
Cursor images
Command buffers
Vertex data
Textures

Buffers in video ram generally have a stride (also called pitch) associated with them. The stride is the width of the buffer in bytes. For example, if you have a 1024x768 pixel buffer at 16 bits/pixel (2 bytes/pixel), your stride would be:

1024 pixels * 2 bytes/pixel = 2048 bytes

At 32 bits/pixel (4 bytes/pixel), your stride would be:

1024 pixels * 4 bytes/pixel = 4096 bytes

Stride is important as it delineates where each line of the buffer starts and ends. With a linear buffer format, each line of the buffer follows the previous linearly in video ram:

framebuffer address
0              2048            4096
|---------------|---------------|---------------| ... |---------------|

Tiled framebuffers

The above layout is called "linear", because the layout of pixels in memory is like that on the screen: the pixel to the right of the current one on the screen is the one at the next highest address in memory. Tiling is a common variation where pixel layout in memory is not linear, but instead laid out in small squares. For example, a 4x4 tile would look like:

 0  1  2  3
 4  5  6  7
 8  9 10 11
12 13 14 15

In other words, the 4th (zero-based) pixel in memory would be at screen coordinate (0, 1), whereas in linear memory it would be at screen coordinate (4, 0). The pattern then continues: the 16th (zero-based) pixel is screen coordinate (4, 0) instead of (16, 0). The reason for this alternate layout is it makes pixels that are adjacent on the screen also adjacent in memory, which improves cache locality.

Some hardware has multiple levels of tiling. For example, Radeon hardware can have microtiles composed of pixels, and macrotiles composed of microtiles. Sometimes the GPU can hide tiling from the CPU (ie, make tiled regions appear linear to PCI bus accesses).

Display control

Overview

The display cell on most video cards controls the size, timing, and type of signal sent to the monitor. There are 3 elements involved in this:

CRTC or Display Controller
PLLs (pixel clock)
Outputs

CRTCs

CRTC is a jargon term for "CRT controller", and CRTs are those big bulky glass things with pictures on them you see in old movies. Practically speaking, they define a region of pixels you can see.

The crtc controls the size and timing of the signal. This includes the vertical and horizontal sizes and blanking periods. Most cards have 2 or more crtcs. Each crtc can drive one or more outputs. Generally, each crtc can have it's own set of timings. If that crtc is driving more than one output, each output is driven at the same timings. Crtcs can also scan out of different parts of the framebuffer. If you have more than one crtc pointing at the same framebuffer address you have "clone" modes. Clone modes can also be achieved by driving more than one output with one crtc. If you point the crtcs to different parts of the framebuffer, you have dualhead.

On VGA-like signalling, this signal includes sync signals so the monitor can find the edges of the image. A modeline contains the timings (in pixels) where these sync signals are generated, relative to the active pixel times. (For the rest of this discussion we'll use "pixel" to mean "pixel interval" for brevity.) For example:

Modeline "1680x1050R"  119.00  1680 1728 1760 1840  1050 1053 1059 1080 +hsync -vsync

Here, 1680 of 1840 total pixel in each horizontal interval contain actual pixel data, and the horizontal sync pulse runs from pixel 1728 to pixel 1760. 1050 of the 1080 total lines contain actual pixel data, and the vertical sync pulse runs from line 1053 to line 1059. The interval between the end of the active region and the beginning of the sync pulse is called the front porch; the interval between the end of the sync pulse and the end of a line or frame is called the back porch. Sync polarity is set by convention, so the monitor can know which timing formula is in use. Normal modes generated by the GTF or CVT timing formulas are -hsync +vsync. Modes generated by the CVT reduced-blanking formula or by GTF when using a secondary curve are +hsync -vsync. Other polarity combos are occasionally seen for various historical modes.

The stride of a crtc is set to the stride of the buffer it is scanning out of. The stride of the buffer does not have to correspond the size of the crtc mode. This allows you to implement things like virtual desktops (1024x768 mode scanning out of a 2048x2048 pixel virtual desktop) or have multiple crtcs scan out of different parts of the same buffer (two 1024x768 crtcs scanning out of a 2048x768 pixel buffer).

PLLs

The PLLs controls the pixel/video clock. This is the rate at which pixels are sent to the monitor. The higher the vertical refresh rate or resolution of your screen the higher the pixel clock.

The pixel clock is usually generated using the following formula:

pixel clock = (ref freq) * (m/n) * (1/(1 + r))

ref freq = the base clock frequency provided by the hardware
m = clock multiplier
n = clock divider
r = clock post divider

Outputs

The outputs convert the data stream sent from the crtc into something the monitor understands. For example a DAC (Digital Analog Converter) converts the digital data stream into an analog signal for your monitor. Some other examples include TMDS (Transition Minimized Differential Signaling) transmitters (converts to the digital format used by DVI and some other connectors), LVDS (Low Voltage Differential Signaling) transmitters (commonly used to connect local flat panels like LCDs on laptops), and TV encoders (converts to an analog TV signal often with image scaling). Outputs can be integrated into the graphics chip or provided as external components (usually connected via a standard interface like DVO (Digital Video Out) or SDVO (Serial Digital Video Out)).

Driver Examples

In most Xorg drivers there are 3 sets functions (usually found in chipname_driver.c) associated with configuring the display controllers:

Save() - Saves the current hardware state of the output registers
Init() - Initializes the hardware register data structures for the requested output configuration
Restore()/Write() - Writes the initialized register values set up in the Init() functions to the hardware

Radeon

Save:

RADEONSaveMemMapRegisters() - saves memory map register state
RADEONSaveCommonRegisters() - saves common register state
RADEONSaveCrtcRegisters() - saves the registers for the primary crtc
RADEONSaveFPRegisters() - saves the registers for the panel outputs (RMX, TMDS, LVDS)
RADEONSaveCrtc2Registers() - saves the registers for the secondary crtc
RADEONSavePLLRegisters() - saves the registers for the primary (crtc1) pixel clock
RADEONSavePLL2Registers() - saves the registers for the secondary (crtc2) pixel clock
RADEONSavePalette() - saves the palette/CLUT registers
RADEONSaveMode() - calls the above functions

Init:

RADEONInitOutputRegisters() - Initializes registers for outputs and sets up the crtc to output mapping. Calls output init functions
RADEONInitCrtcRegisters() - Initializes registers for crtc1. Calls RADEONInitOutputRegisters() to initialize the outputs driven by crtc1 and RADEONInitPLLRegisters() to set up the pixel clock.
RADEONInitCrtc2Registers() - Initializes registers for crtc2. Calls RADEONInitOutputRegisters() to initialize the outputs driven by crtc2 and RADEONInitPLL2Registers() to set up the pixel clock.
RADEONInitPLLRegisters() - initialize the pixel clock for crtc1
RADEONInitPLL2Registers() - initialize the pixel clock for crtc2
RADEONInit2() - calls the above functions

Restore/Write:

RADEONRestoreMemMapRegisters() - restore memory map register state
RADEONRestoreCommonRegisters() - restore common register state
RADEONRestoreCrtcRegisters() - restore the registers for the primary crtc
RADEONRestoreFPRegisters() - restore the registers for the panel outputs (RMX, TMDS, LVDS)
RADEONRestoreCrtc2Registers() - restore the registers for the secondary crtc
RADEONRestorePLLRegisters() - restore the registers for the primary (crtc1) pixel clock
RADEONRestorePLL2Registers() - restore the registers for the secondary (crtc2) pixel clock
RADEONRestorePalette() - restore the palette/CLUT registers
RADEONEnableDisplay() - enables/disables outputs
RADEONRestoreMode() - calls the above functions

2D Engine

Overview

The 2D engine (often called a blitter) basically moves data around in video ram. There are generally 4 operations done by the 2D engine: blits (copying data from one place to another), fills (draw a solid color), lines (draws lines), and color expansion (convert mono data to color data; e.g. convert monochrome font glyphs to the depth of your screen: usually 16 or 24 bit color). Logical operations (rops -- raster operations) can also be performed on the data. You have a source and destination buffers (often called surfaces) and these operations will use one or more surfaces. Some, like solid fills, only use a destination (where do I draw the red rectangle). Others like blits require a source and destination (copy this rectangle from address A to address B). Surfaces can (and often do) overlap. Because of this, blitting also has the concept of direction: if you are copying data from overlapping source and destination regions you need to make sure you copy the right data (e.g., top to bottom, right to left, etc.). Data from system memory can also be the source of these operations. This is referred to as a hostdata blit. With hostdata blits, host data is copied into a special region of video ram or into the command queue depending on the chip and from there it is copied to the destination in the framebuffer via the blitter.

2D engines are usually either controlled via direct MMIO access to the relevant registers or via a command queue. With direct MMIO, the appropriate values are written the relevant registers and then the command is usually executed when the last reg in the series is written or when the command register is written (depends on HW). With a command queue, part of the framebuffer is reserved as a command queue (FIFO). Commands and associated data are written sequentially to the queue and processed via the drawing engine.

Solid example

Draw a solid red 200x400 pixel rectangle on the screen at (x,y) location (25, 75).

Set the pitch of your destination surface to the pitch of the screen and set the offset to the offset in video ram where your screen buffer is located.
Set the rop you want to use
Set the color you want
Set the destination rectangle width and height and (x,y) location relative to the surface

Blit Example

Copy a 200x400 pixel rectangle on the screen from (500, 400) to (25, 75).

Set the pitch of your source surface to the pitch of the screen and set the offset to the offset in video ram where your screen buffer is located.
Set the pitch of your destination surface to the pitch of the screen and set the offset to the offset in video ram where your screen buffer is located.
Set the rop you want to use
Set the source rectangle width and height and (x,y) location relative to the source surface
Set the destination rectangle width and height and (x,y) location relative to the destination surface

Xorg Acceleration Examples

Blits: XAA ScreentoScreenCopy; EXA Copy
Hostdata Blits: XAA ImageWrite, CPUToScreen functions; EXA UploadToScreen
Solid Fills: XAA SolidFillRect; EXA Solid
Lines: XAA SolidBresenhamLine, SolidTwoPointLine
Color Expansion: XAA CPUToScreenColorExpandFill

Driver Examples

Radeon

EXA Solid Fill:

RADEONPrepareSolid() - Sets up the hardware state for the solid fill
RADEONSolid() - Draws a solid rectangle of size w x h at location (x,y)

EXA Blit:

RADEONPrepareCopy() - Sets up the hardware state for the copy
RADEONCopy() - Performs a copy of a rectangle of size w x h from (x1,y1) to (x2,y2)

3D Engine

Overview

The 3D engine provides HW to build and rasterize a 3 dimensional scene. Most fixed function hardware has the following layout:

Small set of 3D state registers. These control the state of the 3D scene: fog, mipmapping, texturing, blending, etc.
3D engine offset registers. Controls where in the framebuffer the 3D engine renders to
Texture control and offset registers. Control texture format and size and where the textures are located
Depth buffer control and offset registers. Controls depth buffer layout and location
Vertex registers. Used to specify the location and format of the vertexes which make up the 3D scene.

Buffers

Generally 3 buffers are required for 3D:

Front buffer. This is usually the buffer that is scanned out for the user to see.
Back buffer. This is the buffer that is rendered to while that front buffer is being scanned out.
Depth buffer. Also called z-buffer. This buffer is used to determine the relative depth of different object in the 3D scene. This is used to determine which elements are visible and which are obscured.

ToDo: give driver examples

Overlay

Overview

The overlay provides a mechanism for mixing data from multiple framebuffers automatically. It is most often used for mixing YUV (video) and RGB data. Most overlays contain special filtering and scaling hardware along with a colorspace converter. The streams are mixed or blended in several ways (depending on the hardware):

Colorkey. Overlay data is overlaid on the primary data stream where the color of the primary stream matches the colorkey RGB color. Generally used to overlay YUV or RGB data on an RGB surface.
Chromakey. Same as colorkey but the key is a YUV value rather than RGB. Generally used to overlay RGB or YUV data on a YUV surface.
Position/Offset. Overlay data appears at specified position in the scan out buffer.

When an overlay is enabled, data from the overlay framebuffer is automatically mixed into the output stream during the scanout of the visible framebuffer. For example, with colorkeying, the crtc scans out of the primary framebuffer until it hits a region with a color matching the colorkey. At this point, the hardware automatically scans the data out of the overlay buffer.

Most hardware only has one overlay which is often tied to a crtc or can only be sourced to one crtc at a time.

Overlays are most commonly used for video playback and scaling. See Xv.

Driver Examples

Radeon

RADEONPutImage() - Prepares and copies overlay data to video ram, then calls RADEONDisplayVideo().
RADEONDisplayVideo() - Write the overlay configuration to hardware to display the overlay data.

HW sprites

Overview

HW sprites are small buffers that get blended with the output stream during scan out. The most common examples are HW cursors and HW icons. Sprites are usually limited to small sizes (64x64 or 128x128 pixels) and on older hardware they are limited to 2 colors (newer hardware supports 32 bit ARGB sprites). The cursor image is written to a location in video ram and that image is mixed into the output stream at a particular location during scan out.

ToDo: give driver examples

PCI

PCI is by now the standard bus for connecting video cards to computers. AGP and PCIE merely look like enhanced versions of PCI, as far as the host software is concerned.

PCI devices can present various resources to the host, along with a standardized way of discovering and accessing them. The important ones as far as video is concerned are BARs, or bus address ranges. Each device can present up to 6 BARs, which can function as video memory or register banks. BARs can be either memory or I/O ranges, but are usually memory. There is also an optional "7th BAR", the option ROM, which most video devices support. This is used to support multiple video cards, since the ROM contains the initialization code for the chip, and most system BIOSes will not attempt to initialize more than one card at boot time.

PCI also provides a mechanism for supporting the legacy VGA address space and I/O ports, by allowing the host software to route this space to individual PCI cards. Again, this is mostly used for multi-card setups.

AGP

ToDo: fill me in.

PCIE

ToDo: fill me in.

Apertures

ToDo: fill me in.

ToDo: indexed vs. direct access registers

Further Reference

For more details, see these additional documents:

-- AlexDeucher