"OpenGL Programming Guide (the original book version 8)" for a variety of characteristics OpenGL4.3 for a new version of the elaborate and comprehensive introduction to OpenGL and OpenGL shading language, for the first time shader technology and function-centric classic technical presentations combine, showing the latest OpenGL programming techniques.
Because the graphics processor can be calculated per second, hundreds of millions of times, it has become a very impressive performance of the device. In the past, this processor is designed for real-time graphics rendering massive bear math. However, its potential computing power can also be used independent of the graphics processing tasks, especially when if you can not work well with fixed-function graphics pipeline binding. In order to make such applications possible, OpenG introducing a special shader: compute shader. Compute shader can be considered as only one of a line, there is no fixed inputs and outputs, all of the default input via a built-in variable to pass. When you need additional input, you can control access to the texture buffer and fixed by those input and output. The side effects are all visible image storage, atomic operations, and access to atomic counter. However, coupled with a common memory read and write operations, these seem limited functionality enables the compute shader to obtain a certain degree of flexibility, at the same time get rid of the shackles of graphics-related, as well as open a broad application space.
OpenGL calculations in shaders and other shader is very similar. It () function created by glCreateShader, compile with glCompilerShader (), by glAttachShader () to bind the program, according to the last common practice with glLinkProgram () to link these programs. Calculated using GLSL shaders to write, in principle, all other graphics shaders (such as vertex shader, geometry shader or fragment shaders) can use the features it can be used. Of course, this does not include, such as the geometry shader EmitVertex () or EndPrimitive () and other functions, and other similar graphics pipeline with unique built-in variables. On the other hand, the compute shader also contains some unique built-in variables and functions, variables, and functions can not access elsewhere in OpenGL pipeline.
The Working Group and its implementation
As the graphics rendering pipeline is placed in the different stages of the operation of graphics-related unit is used as the compute shader is effectively a calculation into a pipeline and associated processing and computing unit. In this analogy, the role of each vertex shader vertex, geometry shader applied to each element, and the fragment shader is applied to each fragment. The main performance graphics hardware to get through a parallel, this parallelism through a large number of vertices, primitives and the fragment flow through the respective pipeline stages can be achieved. In the compute shader, this parallelism is much more direct, the task group as a unit for execution, we called the Working Group (work group). The Working Group has a neighbor known as the Local Working Group (local workgroup), these groups can form larger groups called the Global Working Group (global workgroup), and generally as a unit to execute the command.
Each compute shader is a local working group in each unit call the global workgroup, every unit of the Working Group called work items (work item), each call is called an execution. It can be performed by the unit between variables and memory communications and perform synchronous operation consistency. Figure 12-1 This way of working has been described. In this simplified example, the global working group includes 16 local working groups, and each local workgroup also contains 16 execution units, arranged in 4 x 4 grid. Each execution unit has an index value of a two-dimensional vector representation.
Although in Figure 12-1, the global and local working groups are two-dimensional, but in fact they are three-dimensional, in order to be able to adapt to the one-dimensional, two-dimensional task logically, that just need an extra two-dimensional or one-dimensional size can be set to 0. Each execution unit on the nature of computing shaders are independent of each other, it can be performed in parallel with the support of OpenGL GPU hardware. In practice, most of the OpenGL hardware package will put them into a smaller set of execution units (lockstep), then these small collections put together consisting of local working groups. The size of the local workgroup input layout qualifiers in compute shader source code used to set up. Size of the global working group is a local work group size integer multiples. When the compute shader execution, it can be built-in variables to know the current relative coordinates in the local working group, the size of the local working group, and the relative coordinate local working group in the global workgroup. These can be further obtained in the global coordinate execution units and other workgroup based. Shader based on these variables should be responsible for determining which parts of the computing tasks, but also to know that a working group of the other execution units, in order to share data.
Enter layout qualifier declaration compute shader size of local working groups, respectively, using local_size_x, local_size_y and local_size_z, their default values are 1. For example, if ignored local_size_z, N * M will create a two-dimensional group. For example, in the example 12.1 to declare a local working group size is 16 * 16 shaders.
12.1 Simple local workgroup declared cases
Although the example 12.1 shaders did not do anything, it is still a "complete" shaders, normal can compile, link, and execute in OpenGL hardware. To create a compute shader, simply call glCreateShader () function, set the type to GL_COMPUTE_SHADER, and calls glShaderSource () function to set the shader source code, then you can compile normal. And then attached to a shader program, call glLinkProgram (). This will produce the compute shader stage requires an executable program. Example 12.2 shows from creation to link a computer program (using the "computer program" to represent the compute shader to use a compiled program) full step.
Example 12.2 Create, compile and link the compute shader
Once like cases that create and link 12.2 after a compute shader, you can use glUseProgram () function to set it as current program to be executed, then glDispatchCompute () to send working group to calculate the pipeline, its prototype is as follows:
Void glDispatchCompute (GLuint num_groups_x, GLuint num_groups_y, GLuint num_groups_z);
On the three dimensions of the distribution Computing Group. num_groups_x, num_groups_y and num_groups_z number of working groups were set in the X, Y and Z dimensions. Each parameter must be greater than 0 and less than or equal to a device-dependent constant array GL_MAX_COMPUTE_WORK_GROUP_SIZE corresponding elements.
When calling glDispatchCompute (), OpenGL creates a three-dimensional array contains the size of the local working group num_groups_x * num_groups_y * num_gourps_z of. Note that in three dimensions a 1 or any value or glDispatchCompute () parameters may be two dimensions. Therefore, the total number of compute shader execution unit is the size of the local working group in order to multiply the amount of 3-dimensional array defined in the shader code. One can imagine, this method can create very large jobs to an image processor, and by computing shaders can be obtained relatively easily parallelism.
As relations glDrawArraysIndirect () and glDrawArrays () is the same, except that glDispatchCompute () than by glDispatchComputeIndirect () You can use the parameters stored in the buffer object to sending computing tasks. Buffer object is bound in the GL_DISPATCH_INDIRECT_BUFFER, and stored in the buffer parameter contains three packaged together unsigned integer. The role of these three unsigned integers and () the parameters glDispatchCompute are equivalent. Reference glDispatchComputeIndirect prototype is as follows:
void glDispatchComputeIndirect (GLintptr indirect);
Distributed computing in three dimensions the Working Group, at the same time using the parameters stored in the object cache. indirect offsets represent the position of the cache data storage parameters, use the basic machine units. Cache parameters of the current shift position, the three unsigned integer values closely spaced, used to indicate the number of local working groups. These unsigned integer value equivalent to glDispatchCompute () in num_groups_x, num_groups_y and num_groups_z parameters. Each parameter must be greater than 0 and less than or equal to a constant array GL_MAX_COMPUTE_WORK_GROUP_SIZE device corresponding element related.
Source bound to GL_DISPATCH_INDIRECT_BUFFER data buffers can be varied, for example, generated by another computing shaders. As a result, the graphics processor is able through the buffer parameter settings to send its own mission to do the calculation or drawing. Example 12.3 using glDispatchComputeIndirect () to send computing tasks.
Example 12.3 distributed computing workload
Noting 12.3 cases simply using glUseProgram () of the current program objects to point to a specific computer program. In addition to those who can not access the fixed-function graphics pipeline section (such as a raster or frame buffer), and compute shader program is completely normal, which means you can use glGetProgramiv () to request some of their properties (such as a valid uniform constant, or memory block) or visit uniform constant as usual. Of course, compute shader can access all resources can access other shaders, such as images, sampler, buffers, counters atoms, as well as the constant memory block.
Compute shader program and there are some unique attributes. The pname is set to GL_MAX_COMPUTE_WORK_GROUP_SIZE when, for example, to get the size of the local working group (set up in the source code layout qualifier), call glGetProgramiv () and set the param array address contains three unsigned integers. Number three this array will be assigned in order for the local work group size on the X, Y and Z directions.
Know the location of the working group
Once you begin the compute shader, it is likely to require a plurality of units or output array assignment (such as an image or an atomic counter array), or need to read data from a particular location of the input array. To do need to know the current local working group in what position, and a position in the wider global workgroup. So, OpenGL provides a set of built-in variables for the calculation of shaders. As shown in Example 12.4, these built-in variables are implicitly declared.
Example 12.4 compute shader built-in variable declaration
These definitions compute shader is as follows:
gl_WorkGroupSize is a constant storage size used for local working groups. It has been local_size_x, local_size_y and local_size_z declared in the shader layout qualifier. The reason why the copy information, mainly for two purposes: First, it makes the size of the work group may be accessed many times in the shader without relying on pre-processing; secondly, it makes the size of the working group is represented in the form of a multidimensional vector processing can be direct, without explicitly constructed.
gl_NumWorkGroups is a vector that contains passed glDispatchCompute () parameters (num_groups_x, num_groups_y and num_groups_z). This makes the shader know the size of the overall work of the group to which it belongs. In addition to the uniform than the manual explicitly assigned to the parties themselves, some OpenGL hardware settings for these constants also provides an efficient method.
gl_LocalInvocationID indicates the position of the local unit in the current implementation of the Working Group. It ranges from uvec3 (0) to gl_WorkGroupSize - uvec3 (1)
gl_WorkGroupID indicates the current position of a local workgroup in a larger global workgroup. The scope of the variable in uvec3 (0) and gl_NumWorkGroups - uvec3 (1) between.
gl_GlobalInvocationID by the gl_LocalInvocationID, gl_WorkGroupSize and gl_WorkGroupID derived. Its exact value is gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID, so it is a positional units in the current global workgroup performed an effective 3-dimensional index.
gl_LocalInvocationIndex is a flat form of gl_LocalInvocationID. Its value is equal gl_LocalInvocationID.z * gl_WorkGroupSize.x * gl_WorkGroupSize.y + gl_LocalInvocationID.y * gl_WorkGroupSize.x + gl_LocalInvocationID.x. It can be one-dimensional index to represent the two-dimensional or three-dimensional data.
Assumed to have been aware of their position in the local working groups, and global workgroup, you can use the information to manipulate data. As shown in Example 12.5, add a variable image allows us to write data to the image coordinate position indicated by the current execution unit decided to go, and can be updated in the calculation of the shader.
Example 12.5 Operating data
Example 12.5 shader unit to coordinate the implementation of the local working group on local work group size is normalized, then the result is written to the image location ID identified by the global upswing request. Image result expresses the relationship of global and local request ID, and a rectangular display workgroup computing shader definition. (In this case 32 * 16 execution units, such as the image shown in 12.2)
In order to generate the image shown in Figure 12-2, in the calculation of shaders written data, simply render the texture to a triangle strip to full screen.
Communication and synchronization
When you call glDispatchCompute () (or glDispatchComputeIndirect ()) when the internal graphics processor will perform a lot of work. Graphics processor will take as much as possible parallel work, and each requesting computing shaders are seen as a team to perform a task. We are bound to be reinforced by the communication between team members, even if OpenGL does not define the order of execution and parallel levels of information, we can establish some degree of partnership between requests to shared variables. In addition, we can all requests for a local working group is synchronized, so that they arrive at the same time while a location shaders.
We can use the shared keyword to declare shader variables, format and other keywords, such as similar uniform, in, out, etc. Example 12.6 shows an example of using the shared keyword to declarations.
Example Example 12.6 declare shared variables
If a variable is declared as shared, it will be saved to a specific location, thereby requesting visible to all compute shader with a local working group. If a compute shader written request for a shared variable, then modify the information of this data will eventually notice to all shaders with a local working group requests. Here we use a "final" word, it is because each shader execution order requested is not defined, even within the same local working group as well. Therefore, the request to write a shader moment shared shared variables may read the variable time and another request far apart, whether after the first read or write after the first read-write. In order to ensure that the results obtained can be expected, we need to use some kind of synchronization in the code. The next section describes in detail this issue.
Generally access the shared variable performance shared access will be much better than the image or shader cache memory (such as main memory) performance. Because the shader processor Shared memory is the amount of local processing, and can be copied in the device, so access to shared variables may be more rapid than the buffer method. We therefore recommend that if your shader needs a lot of memory access, in particular, may require multiple shaders request access at the same memory address, may wish to first copy to a shared memory variable shader, and then by operating this way, if necessary, and then write the result back to the main memory.
Because of the need to declare a variable as shared storage to high-performance graphics processor resources and the environment, the environment and such resources are limited, it is necessary to inquire about the maximum number of shared variables in a compute shader programs. To obtain this limit, you can call glGetIntegerv () and set pname to GL_MAX_COMPUTE_SHARED_MEMORY_SIZE.
If the execution order execution order local working group requests, as well as the global workgroup all local working groups are defined, then the request the opportunity to perform operations and other requests are completely unrelated. If the request does not need to communicate with each other, just completely independently executed, and so there is no problem. However, if the request is required between the communication, whether through images, or shared memory cache, then we will need to synchronize their operation processed.
There are two types of synchronization commands. First run barrier (execution barrier), can () function triggered by a barrier. It subdivision control shader barrier () function is similar, which can be used to implement process control point requests synchronization. If a request compute shader encountered barrier (), then it will stop and wait for a local working group with all the requests reached. When an interrupt request from the local barrier () re-started running, we can conclude that all other requests has reached the barrier (), and before that all operations are complete. barrier () function in the calculation shader usage more flexible than in the segment control shader. In particular, there is no need to limit the main shader () function to perform barrier (). However, it must call the barrier () in a unified flow control process. That is, if a request for local working group carried out barrier () function, then all requests for the same work group must perform this function. This is reasonable, because a request shader control flow can not know the situation of other requests, we can only assume that other requests can reach the location of the barrier, otherwise deadlock situation will occur.
If communication between requests in the local workgroup, you can write to shared variables in a request, and then read another request. However, we must determine the destination for the request to read the shared variable timing, that is, in the original request has been completed, after the corresponding write operation. To ensure this, we can write the variables in the original request, and then perform barrier () function in the two requests at the same time. When requested to return from the target barrier (), the source of the request must have been carried out with a function (that is, the completion of the shared variable write), so you can read the value of the variable safely.
The second type is called synchronous memory barrier (memory barrier). The most straightforward version of the memory barrier is memoryBarrier (). If you call memoryBarrier (), then you can guarantee shader memory write operation request must be submitted to the end of memory, rather than through the buffer (cache) or the queue scheduling classes. All occurred after () operation in memoryBarrier read memory at the same time, you can use the results of these memory write, even other requests with a compute shader as well. In addition, memoryBarrier () can also give the shader compiler is instructed it not to reorder memory operations, and therefore in order to avoid crossing the barrier function. If you think memoryBarrier () constraints are too strict, so you feel quite right. In fact, memoryBarrier () family as well as various other memory barrier subroutine. memoryBarrier () simply made according to some undefined order (this argument is not necessarily accurate) in turn calls the subroutine only.
memoryBarrierAtomicCounter () function will wait atom counter update, and then continue. memoryBarrierBuffer () and memoryBarrierImage () function will wait for the write operation to complete the cache and image variables. memoryBarrierShared () function will wait for a variable with a shared qualifier updates. These functions can provide more granular control method and wait for different types of memory access. For example, if you are using the counter to achieve atomic access cache variable, we may wish to ensure atomic updates are notified to counter shader other requests, but does not need to wait for the completion of cache writes itself, as the latter may take longer. In addition, calling memoryBarrierAtomicCounter () allows access to the shader compiler cache variable reordering without being logical effects of atomic counter operation.
Note that even calling memoryBarrier () or one of its sub-functions, we still can not guarantee that all requests are to reach the same position shader. To ensure this, we can only call execution barrier function barrier (), and then read the data memory, which should be in memoryBarrier () before it is written.
Memory barrier used for a single shader request to establish order in memory swapping is not necessary. Read variables in a request shader will return the value is always written to the results of this last variable, regardless of whether they are the compiler reordering operation.
Finally, we introduce a function called groupMemoryBarrier (), which is equivalent to memoryBarrier (), but it can only be applied to other requests for the same local working group. And all other functions are applied to the overall barrier. That is, they will ensure that the overall work of the group will be any memory write request after the submission, and then continue execution.