8.2 Data and Task Parallelism

The foundational step in creating any parallel algorithm is problem decomposition—the process of breaking a large computational problem into smaller pieces that can be solved concurrently.¹ This decomposition can be approached from two distinct perspectives: one can either partition the data on which the computation operates, or one can partition the computation itself. These two approaches give rise to the two fundamental strategies in parallel algorithm design: data parallelism and task parallelism.²

Credit: Image generated by Google Gemini

Data Parallelism (Domain Decomposition)

Data parallelism, also known as domain decomposition, is a strategy that focuses on distributing the data across different processors. Each processor then performs the same operation, or sequence of operations, on its assigned subset of the data.³ This model is typically characterized by synchronous computation, where processors execute the same instructions in lock-step or in loosely synchronized phases.³ A canonical example of data parallelism is image processing. To apply a sharpening filter to a large image, the image can be divided into multiple tiles. Each processor receives one tile and independently applies the sharpening filter to the pixels within its tile.⁴ After all processors have completed their work, the processed tiles are reassembled to form the final sharpened image. Another fundamental example is vector addition, where for two vectors v and w, the operation z[i]←v[i]+w[i] for each element i is an independent computation that can be assigned to a different processor.⁵

Example: Data-Parallel Vector Addition

Algorithm: DataParallel_VectorAdd(v[1..n], w[1..n], p processors)
Input: Vectors v and w of length n, p processors
Output: Vector z where z[i] = v[i] + w[i]

// Partition data among processors
chunk_size = n / p

parallel for processor_id = 0 to p-1 do
    start = processor_id * chunk_size
    end = start + chunk_size

    for i = start to end-1 do
        z[i] = v[i] + w[i]
    end
end parallel

return z

Time Complexity: O(n/p) per processor Total Work: O(n) Communication: O(1) for distributing data and gathering results

Example: Data-Parallel Image Processing

Algorithm: DataParallel_ImageFilter(Image[rows][cols], Filter, p processors)
Input: Image matrix, convolution filter, p processors
Output: Filtered image

// Divide image into horizontal strips
rows_per_processor = rows / p

parallel for processor_id = 0 to p-1 do
    start_row = processor_id * rows_per_processor
    end_row = start_row + rows_per_processor

    for row = start_row to end_row-1 do
        for col = 0 to cols-1 do
            // Apply filter to each pixel
            filtered_image[row][col] = ApplyFilter(Image, Filter, row, col)
        end
    end
end parallel

return filtered_image

Strengths:

Scalability: Data-parallel applications often exhibit excellent scalability. As the dataset grows, performance can be maintained or improved by simply adding more processors to handle the additional data partitions.⁶
Simplicity: The programming model is often simpler because only one program needs to be written, which is then executed by all processors on their respective data (a style known as Single Program, Multiple Data or SPMD).⁷
Load Balancing: For uniformly distributed data, this approach naturally leads to good load balancing, as each processor is assigned an equal amount of work.³

Weaknesses:

Limited Applicability: This strategy is only effective for problems where the same operations can be applied to all data partitions. It is less suitable for algorithms with complex, data-dependent control flow.⁷
Communication Overhead: While the computation is parallel, there is often a need for communication to distribute the initial data and, more significantly, to combine or aggregate the partial results at the end. This final reduction or synchronization step can become a bottleneck, especially with a large number of processors.⁷

Task Parallelism (Functional Decomposition)

Task parallelism, also known as functional decomposition, takes the complementary approach. It focuses on decomposing the computation itself into a collection of distinct tasks that can be executed concurrently.² These tasks may perform different operations and can work on the same or different data. This model is often characterized by asynchronous execution, where tasks run independently and communicate with each other as needed to exchange data or synchronize.³ A classic example of task parallelism is a modern web server. One task might be responsible for listening for incoming network connections, another for fetching data from a database, a third for executing business logic, and a fourth for rendering the HTML response page.⁸ These tasks are functionally distinct and can run in parallel to serve multiple user requests simultaneously. Another example is a media player, where one task is dedicated to decoding the video stream while a separate, concurrent task decodes the audio stream.⁸

Example: Task-Parallel Web Server

Algorithm: TaskParallel_WebServer()
// Different tasks handle different functional components

Task 1: ConnectionListener()
    while server_running do
        connection = accept_incoming_connection()
        enqueue(request_queue, connection)
    end

Task 2: RequestHandler()
    while server_running do
        request = dequeue(request_queue)
        parsed_request = parse_http_request(request)
        enqueue(database_queue, parsed_request)
    end

Task 3: DatabaseWorker()
    while server_running do
        request = dequeue(database_queue)
        data = fetch_from_database(request)
        enqueue(response_queue, data)
    end

Task 4: ResponseRenderer()
    while server_running do
        data = dequeue(response_queue)
        html = render_template(data)
        send_http_response(html)
    end

// All tasks run concurrently and communicate via queues
spawn_task(ConnectionListener)
spawn_task(RequestHandler)
spawn_task(DatabaseWorker)
spawn_task(ResponseRenderer)

Example: Task-Parallel Media Processing

Algorithm: TaskParallel_MediaPlayer(video_file, audio_file)
// Two independent tasks process different data streams

shared buffer video_buffer, audio_buffer
shared semaphore sync_semaphore

Task 1: VideoDecoder()
    while not end_of_video do
        frame = decode_video_frame(video_file)
        video_buffer.write(frame)
        signal(sync_semaphore)  // Notify audio task
    end

Task 2: AudioDecoder()
    while not end_of_audio do
        audio_chunk = decode_audio_chunk(audio_file)
        audio_buffer.write(audio_chunk)
        wait(sync_semaphore)  // Synchronize with video
    end

Task 3: Renderer()
    while playback_active do
        frame = video_buffer.read()
        audio = audio_buffer.read()
        display(frame)
        play(audio)
    end

// Spawn all tasks concurrently
parallel
    spawn_task(VideoDecoder)
    spawn_task(AudioDecoder)
    spawn_task(Renderer)
end parallel

Strengths:

Flexibility: Task parallelism is highly flexible and can be applied to a wide variety of complex problems, especially those with functionally distinct and independent sub-problems.⁷
Improved Resource Utilization: By assigning different types of tasks to different processors, it is possible to keep all available computational resources busy, even if the problem does not have a regular, data-parallel structure.⁹

Weaknesses:

Complexity: Managing and scheduling tasks, especially when there are complex dependencies between them, significantly increases the complexity of the program. The programmer is often responsible for identifying tasks, managing their dependencies, and ensuring correct synchronization.⁷
Load Balancing: Achieving good load balancing is much more difficult than in data parallelism. Tasks may have different and unpredictable execution times, requiring sophisticated dynamic scheduling algorithms to distribute work evenly and prevent processors from becoming idle.³
Communication: Inter-task communication can be complex and a major source of overhead. It can also introduce subtle programming errors such as race conditions and deadlocks if not managed carefully.⁷

The Continuum and Hybrid Approaches

In practice, few large-scale applications are purely data-parallel or purely task-parallel. Most fall somewhere on a continuum between the two extremes.³ Moreover, the two strategies can be combined in a hybrid parallelism approach to exploit parallelism at multiple levels. A prime example of a hybrid approach is found in global climate modeling. The simulation space (Earth’s atmosphere and oceans) is typically decomposed into a massive 3D grid. The computation of physical quantities like temperature and pressure at each grid point is performed using data parallelism. Concurrently, different, functionally distinct models—such as one for atmospheric dynamics and another for ocean currents—are executed as separate tasks that interact with each other, representing task parallelism.³ The choice between a data-parallel and a task-parallel strategy is a key design decision. Data parallelism requires a focus on data distribution, partitioning, and efficient aggregation of results. The main challenges are managing data locality and minimizing communication volume. Task parallelism requires identifying independent functions, managing task dependencies, designing effective dynamic schedulers, and implementing safe inter-task communication and synchronization.¹⁰

Both strategies are governed by the concept of granularity, which is the ratio of computation to communication.¹ For either model to be efficient, the amount of work in a parallel unit must be large enough to amortize the associated overhead. In data parallelism, this is achieved through agglomeration, where fine-grained data elements are grouped into larger chunks.¹¹ In task parallelism, this means defining tasks that perform a substantial amount of computation before needing to communicate. An algorithm that is too fine-grained will be dominated by overhead and will not perform well on real hardware.

Feature	Data Parallelism	Task Parallelism
Core Concept	Distribute the data across processors; each performs the same operation.	Distribute the computation (tasks) across processors; each performs a different operation.
Alternate Name	Domain Decomposition	Functional Decomposition
Computation Model	Typically synchronous; processors execute the same code (SPMD).	Typically asynchronous; processors execute different code.
Synchronization	Often occurs in a collective, bulk-synchronous manner at the end of computational phases.	Occurs as needed between individual tasks to enforce dependencies.
Key Challenge	Efficient data partitioning, minimizing communication during result aggregation, load balancing with non-uniform data.	Managing complex task dependencies, dynamic scheduling for load balancing, avoiding deadlocks.
Scalability Driver	Amount of parallelization is proportional to the input data size.	Amount of parallelization is proportional to the number of independent tasks.
Ideal Use Cases	Image processing, matrix operations, scientific simulations on regular grids, large-scale data processing (e.g., search).	Web servers, GUI applications, pipeline processing, complex workflows with functionally distinct modules.

Table 8.2.1: Comparative Analysis of Data and Task Parallelism. This table provides a side-by-side comparison of the two core parallel design strategies, highlighting their fundamental differences in approach, typical execution models, primary engineering challenges, and ideal application domains.³

References

Introduction to Parallel Computing Tutorial - | HPC @ LLNL, accessed October 6, 2025, https://hpc.llnl.gov/documentation/tutorials/introduction-parallel-computing-tutorial ↩ ↩²
9.3. Parallel Design Patterns — Computer Systems Fundamentals, accessed October 6, 2025, https://w3.cs.jmu.edu/kirkpams/OpenCSF/Books/csf/html/ParallelDesign.html ↩ ↩²
Data parallelism - Wikipedia, accessed October 6, 2025, https://en.wikipedia.org/wiki/Data_parallelism ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
Parallel Algorithm Design Strategies | Parallel and Distributed Computing Class Notes | Fiveable, accessed October 6, 2025, https://fiveable.me/parallel-and-distributed-computing/unit-6/parallel-algorithm-design-strategies/study-guide/B9NPGnrWtPEbON5O ↩
COMP 633: Parallel Computing PRAM Algorithms, accessed October 6, 2025, https://www.cs.unc.edu/~prins/Classes/633/Readings/pram.pdf ↩
What Is Data Parallelism? | Pure Storage, accessed October 6, 2025, https://www.purestorage.com/knowledge/what-is-data-parallelism.html ↩
Data Parallel, Task Parallel, and Agent Actor Architectures - bytewax, accessed October 6, 2025, https://bytewax.io/blog/data-parallel-task-parallel-and-agent-actor-architectures ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Types of parallelism - Arm Immortalis and Mali GPU OpenCL Developer Guide, accessed October 6, 2025, https://developer.arm.com/documentation/101574/latest/Parallel-processing-concepts/Types-of-parallelism ↩ ↩²
Data and Task Parallelism - Intel, accessed October 6, 2025, https://www.intel.com/content/www/us/en/docs/advisor/user-guide/2023-2/data-and-task-parallelism.html ↩
Principles of Parallel Algorithm Design: Concurrency and Decomposition - Rice University, accessed October 6, 2025, https://www.clear.rice.edu/comp422/lecture-notes/comp422-534-2020-Lecture2-ConcurrencyDecomposition.pdf ↩
Design of Parallel Algorithms - Physics and Astronomy, accessed October 6, 2025, http://homepage.physics.uiowa.edu/~ghowes/teach/phys5905/lect/NumLec13_Design.pdf ↩

8.2 Data and Task Parallelism

Data Parallelism (Domain Decomposition)

Example: Data-Parallel Vector Addition

Example: Data-Parallel Image Processing

Task Parallelism (Functional Decomposition)

Example: Task-Parallel Web Server

Example: Task-Parallel Media Processing

The Continuum and Hybrid Approaches

References

Footnotes