So you want to know how modern video cards work. Here goes...
Modern video cards usually have several common features:
Basically a large chunk of fast ram. This memory is used for all sorts of things:
Buffers in video ram generally have a stride (also called pitch) associated with them. The stride is the width of the buffer in bytes. For example, if you have a 1024x768 pixel buffer at 16 bits/pixel (2 bytes/pixel), your stride would be:
1024 pixels * 2 bytes/pixel = 2048 bytes
At 32 bits/pixel (4 bytes/pixel), your stride would be:
1024 pixels * 4 bytes/pixel = 4096 bytes
Stride is important as it delineates where each line of the buffer starts and ends. With a linear buffer format, each line of the buffer follows the previous linearly in video ram:
framebuffer address
0 2048 4096
|---------------|---------------|---------------| ... |---------------|
The above layout is called "linear", because the layout of pixels in memory is like that on the screen: the pixel to the right of the current one on the screen is the one at the next highest address in memory. Tiling is a common variation where pixel layout in memory is not linear, but instead laid out in small squares. For example, a 4x4 tile would look like:
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
In other words, the 4th (zero-based) pixel in memory would be at screen coordinate (0, 1), whereas in linear memory it would be at screen coordinate (4, 0). The pattern then continues: the 16th (zero-based) pixel is screen coordinate (4, 0) instead of (16, 0). The reason for this alternate layout is it makes pixels that are adjacent on the screen also adjacent in memory, which improves cache locality.
Some hardware has multiple levels of tiling. For example, Radeon hardware can have microtiles composed of pixels, and macrotiles composed of microtiles. Sometimes the GPU can hide tiling from the CPU (ie, make tiled regions appear linear to PCI bus accesses).
The display cell on most video cards controls the size, timing, and type of signal sent to the monitor. There are 3 elements involved in this:
CRTC is a jargon term for "CRT controller", and CRTs are those big bulky glass things with pictures on them you see in old movies. Practically speaking, they define a region of pixels you can see.
The crtc controls the size and timing of the signal. This includes the vertical and horizontal sizes and blanking periods. Most cards have 2 or more crtcs. Each crtc can drive one or more outputs. Generally, each crtc can have it's own set of timings. If that crtc is driving more than one output, each output is driven at the same timings. Crtcs can also scan out of different parts of the framebuffer. If you have more than one crtc pointing at the same framebuffer address you have "clone" modes. Clone modes can also be achieved by driving more than one output with one crtc. If you point the crtcs to different parts of the framebuffer, you have dualhead.
On VGA-like signalling, this signal includes sync signals so the monitor can find the edges of the image. A modeline contains the timings (in pixels) where these sync signals are generated, relative to the active pixel times. (For the rest of this discussion we'll use "pixel" to mean "pixel interval" for brevity.) For example:
Modeline "1680x1050R" 119.00 1680 1728 1760 1840 1050 1053 1059 1080 +hsync -vsync
Here, 1680 of 1840 total pixel in each horizontal interval contain actual pixel data, and the horizontal sync pulse runs from pixel 1728 to pixel 1760. 1050 of the 1080 total lines contain actual pixel data, and the vertical sync pulse runs from line 1053 to line 1059. The interval between the end of the active region and the beginning of the sync pulse is called the front porch; the interval between the end of the sync pulse and the end of a line or frame is called the back porch. Sync polarity is set by convention, so the monitor can know which timing formula is in use. Normal modes generated by the GTF or CVT timing formulas are -hsync +vsync. Modes generated by the CVT reduced-blanking formula or by GTF when using a secondary curve are +hsync -vsync. Other polarity combos are occasionally seen for various historical modes.
The stride of a crtc is set to the stride of the buffer it is scanning out of. The stride of the buffer does not have to correspond the size of the crtc mode. This allows you to implement things like virtual desktops (1024x768 mode scanning out of a 2048x2048 pixel virtual desktop) or have multiple crtcs scan out of different parts of the same buffer (two 1024x768 crtcs scanning out of a 2048x768 pixel buffer).
The PLLs controls the pixel/video clock. This is the rate at which pixels are sent to the monitor. The higher the vertical refresh rate or resolution of your screen the higher the pixel clock.
The pixel clock is usually generated using the following formula:
pixel clock = (ref freq) * (m/n) * (1/(1 + r))
ref freq = the base clock frequency provided by the hardware
m = clock multiplier
n = clock divider
r = clock post divider
The outputs convert the data stream sent from the crtc into something the monitor understands. For example a DAC (Digital Analog Converter) converts the digital data stream into an analog signal for your monitor. Some other examples include TMDS (Transition Minimized Differential Signaling) transmitters (converts to the digital format used by DVI and some other connectors), LVDS (Low Voltage Differential Signaling) transmitters (commonly used to connect local flat panels like LCDs on laptops), and TV encoders (converts to an analog TV signal often with image scaling). Outputs can be integrated into the graphics chip or provided as external components (usually connected via a standard interface like DVO (Digital Video Out) or SDVO (Serial Digital Video Out)).
In most Xorg drivers there are 3 sets functions (usually found in chipname_driver.c) associated with configuring the display controllers:
Save:
Init:
Restore/Write:
The 2D engine (often called a blitter) basically moves data around in video ram. There are generally 4 operations done by the 2D engine: blits (copying data from one place to another), fills (draw a solid color), lines (draws lines), and color expansion (convert mono data to color data; e.g. convert monochrome font glyphs to the depth of your screen: usually 16 or 24 bit color). Logical operations (rops -- raster operations) can also be performed on the data. You have a source and destination buffers (often called surfaces) and these operations will use one or more surfaces. Some, like solid fills, only use a destination (where do I draw the red rectangle). Others like blits require a source and destination (copy this rectangle from address A to address B). Surfaces can (and often do) overlap. Because of this, blitting also has the concept of direction: if you are copying data from overlapping source and destination regions you need to make sure you copy the right data (e.g., top to bottom, right to left, etc.). Data from system memory can also be the source of these operations. This is referred to as a hostdata blit. With hostdata blits, host data is copied into a special region of video ram or into the command queue depending on the chip and from there it is copied to the destination in the framebuffer via the blitter.
2D engines are usually either controlled via direct MMIO access to the relevant registers or via a command queue. With direct MMIO, the appropriate values are written the relevant registers and then the command is usually executed when the last reg in the series is written or when the command register is written (depends on HW). With a command queue, part of the framebuffer is reserved as a command queue (FIFO). Commands and associated data are written sequentially to the queue and processed via the drawing engine.
Draw a solid red 200x400 pixel rectangle on the screen at (x,y) location (25, 75).
Copy a 200x400 pixel rectangle on the screen from (500, 400) to (25, 75).
EXA Solid Fill:
EXA Blit:
The 3D engine provides HW to build and rasterize a 3 dimensional scene. Most fixed function hardware has the following layout:
Generally 3 buffers are required for 3D:
ToDo: give driver examples
The overlay provides a mechanism for mixing data from multiple framebuffers automatically. It is most often used for mixing YUV (video) and RGB data. Most overlays contain special filtering and scaling hardware along with a colorspace converter. The streams are mixed or blended in several ways (depending on the hardware):
When an overlay is enabled, data from the overlay framebuffer is automatically mixed into the output stream during the scanout of the visible framebuffer. For example, with colorkeying, the crtc scans out of the primary framebuffer until it hits a region with a color matching the colorkey. At this point, the hardware automatically scans the data out of the overlay buffer.
Most hardware only has one overlay which is often tied to a crtc or can only be sourced to one crtc at a time.
Overlays are most commonly used for video playback and scaling. See Xv.
HW sprites are small buffers that get blended with the output stream during scan out. The most common examples are HW cursors and HW icons. Sprites are usually limited to small sizes (64x64 or 128x128 pixels) and on older hardware they are limited to 2 colors (newer hardware supports 32 bit ARGB sprites). The cursor image is written to a location in video ram and that image is mixed into the output stream at a particular location during scan out.
ToDo: give driver examples
PCI is by now the standard bus for connecting video cards to computers. AGP and PCIE merely look like enhanced versions of PCI, as far as the host software is concerned.
PCI devices can present various resources to the host, along with a standardized way of discovering and accessing them. The important ones as far as video is concerned are BARs, or bus address ranges. Each device can present up to 6 BARs, which can function as video memory or register banks. BARs can be either memory or I/O ranges, but are usually memory. There is also an optional "7th BAR", the option ROM, which most video devices support. This is used to support multiple video cards, since the ROM contains the initialization code for the chip, and most system BIOSes will not attempt to initialize more than one card at boot time.
PCI also provides a mechanism for supporting the legacy VGA address space and I/O ports, by allowing the host software to route this space to individual PCI cards. Again, this is mostly used for multi-card setups.
ToDo: fill me in.
ToDo: fill me in.
ToDo: fill me in.
ToDo: indexed vs. direct access registers
For more details, see these additional documents:
-- AlexDeucher