- Parallel and Distributed Processing in Structural Engineering
- Low cost CPU–GPGPU parallel computing in real-world structural engineering - 中国知网
- Product details
- Download Product Flyer
Barriers are typically implemented using a lock or a semaphore. However, this approach is generally difficult to implement and requires correctly designed data structures. Not all parallelization results in speed-up. Generally, as a task is split up into more and more threads, those threads spend an ever-increasing portion of their time communicating with each other or waiting on each other for access to resources. This problem, known as parallel slowdown ,  can be improved in some cases by software analysis and redesign.
Applications are often classified according to how often their subtasks need to synchronize or communicate with each other. An application exhibits fine-grained parallelism if its subtasks must communicate many times per second; it exhibits coarse-grained parallelism if they do not communicate many times per second, and it exhibits embarrassing parallelism if they rarely or never have to communicate.
Embarrassingly parallel applications are considered the easiest to parallelize. Parallel programming languages and parallel computers must have a consistency model also known as a memory model. The consistency model defines rules for how operations on computer memory occur and how results are produced.
One of the first consistency models was Leslie Lamport 's sequential consistency model.
Parallel and Distributed Processing in Structural Engineering
Sequential consistency is the property of a parallel program that its parallel execution produces the same results as a sequential program. Specifically, a program is sequentially consistent if "the results of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program".
Software transactional memory is a common type of consistency model. Software transactional memory borrows from database theory the concept of atomic transactions and applies them to memory accesses.
Mathematically, these models can be represented in several ways. Introduced in , Petri nets were an early attempt to codify the rules of consistency models. Dataflow theory later built upon these, and Dataflow architectures were created to physically implement the ideas of dataflow theory. Beginning in the late s, process calculi such as Calculus of Communicating Systems and Communicating Sequential Processes were developed to permit algebraic reasoning about systems composed of interacting components. Michael J. Flynn created one of the earliest classification systems for parallel and sequential computers and programs, now known as Flynn's taxonomy.
Flynn classified programs and computers by whether they were operating using a single set or multiple sets of instructions, and whether or not those instructions were using a single set or multiple sets of data. The single-instruction-single-data SISD classification is equivalent to an entirely sequential program. The single-instruction-multiple-data SIMD classification is analogous to doing the same operation repeatedly over a large data set. This is commonly done in signal processing applications.
Multiple-instruction-single-data MISD is a rarely used classification. While computer architectures to deal with this were devised such as systolic arrays , few applications that fit this class materialized. Multiple-instruction-multiple-data MIMD programs are by far the most common type of parallel programs. According to David A. Patterson and John L. Hennessy , "Some machines are hybrids of these categories, of course, but this classic model has survived because it is simple, easy to understand, and gives a good first approximation.
It is also—perhaps because of its understandability—the most widely used scheme. From the advent of very-large-scale integration VLSI computer-chip fabrication technology in the s until about , speed-up in computer architecture was driven by doubling computer word size —the amount of information the processor can manipulate per cycle. Historically, 4-bit microprocessors were replaced with 8-bit, then bit, then bit microprocessors. This trend generally came to an end with the introduction of bit processors, which has been a standard in general-purpose computing for two decades.
Not until the early s, with the advent of x architectures, did bit processors become commonplace. A computer program is, in essence, a stream of instructions executed by a processor. These processors are known as subscalar processors. These instructions can be re-ordered and combined into groups which are then executed in parallel without changing the result of the program.
This is known as instruction-level parallelism. Advances in instruction-level parallelism dominated computer architecture from the mids until the mids. All modern processors have multi-stage instruction pipelines. These processors are known as scalar processors. The Pentium 4 processor had a stage pipeline. Most modern processors also have multiple execution units. These processors are known as superscalar processors. Instructions can be grouped together only if there is no data dependency between them.
Scoreboarding and the Tomasulo algorithm which is similar to scoreboarding but makes use of register renaming are two of the most common techniques for implementing out-of-order execution and instruction-level parallelism. Task parallelisms is the characteristic of a parallel program that "entirely different calculations can be performed on either the same or different sets of data".
Low cost CPU–GPGPU parallel computing in real-world structural engineering - 中国知网
Task parallelism involves the decomposition of a task into sub-tasks and then allocating each sub-task to a processor for execution. The processors would then execute these sub-tasks concurrently and often cooperatively. Task parallelism does not usually scale with the size of a problem. Main memory in a parallel computer is either shared memory shared between all processing elements in a single address space , or distributed memory in which each processing element has its own local address space.
Distributed shared memory and memory virtualization combine the two approaches, where the processing element has its own local memory and access to the memory on non-local processors. Accesses to local memory are typically faster than accesses to non-local memory. On the supercomputers , distributed shared memory space can be implemented using the programming model such as PGAS.
This model allows processes on one compute node to transparently access the remote memory of another compute node. Computer architectures in which each element of main memory can be accessed with equal latency and bandwidth are known as uniform memory access UMA systems.
Typically, that can be achieved only by a shared memory system, in which the memory is not physically distributed. A system that does not have this property is known as a non-uniform memory access NUMA architecture. Distributed memory systems have non-uniform memory access. Computer systems make use of caches —small and fast memories located close to the processor which store temporary copies of memory values nearby in both the physical and logical sense. Parallel computer systems have difficulties with caches that may store the same value in more than one location, with the possibility of incorrect program execution.
These computers require a cache coherency system, which keeps track of cached values and strategically purges them, thus ensuring correct program execution. Bus snooping is one of the most common methods for keeping track of which values are being accessed and thus should be purged. Designing large, high-performance cache coherence systems is a very difficult problem in computer architecture.
As a result, shared memory computer architectures do not scale as well as distributed memory systems do. Processor—processor and processor—memory communication can be implemented in hardware in several ways, including via shared either multiported or multiplexed memory, a crossbar switch , a shared bus or an interconnect network of a myriad of topologies including star , ring , tree , hypercube , fat hypercube a hypercube with more than one processor at a node , or n-dimensional mesh.
Parallel computers based on interconnected networks need to have some kind of routing to enable the passing of messages between nodes that are not directly connected. The medium used for communication between the processors is likely to be hierarchical in large multiprocessor machines. Parallel computers can be roughly classified according to the level at which the hardware supports parallelism.
This classification is broadly analogous to the distance between basic computing nodes. These are not mutually exclusive; for example, clusters of symmetric multiprocessors are relatively common. A multi-core processor is a processor that includes multiple processing units called "cores" on the same chip. This processor differs from a superscalar processor, which includes multiple execution units and can issue multiple instructions per clock cycle from one instruction stream thread ; in contrast, a multi-core processor can issue multiple instructions per clock cycle from multiple instruction streams.
Each core in a multi-core processor can potentially be superscalar as well—that is, on every clock cycle, each core can issue multiple instructions from one thread.
Simultaneous multithreading of which Intel's Hyper-Threading is the best known was an early form of pseudo-multi-coreism. A processor capable of concurrent multithreading includes multiple execution units in the same processing unit—that is it has a superscalar architecture—and can issue multiple instructions per clock cycle from multiple threads. Temporal multithreading on the other hand includes a single execution unit in the same processing unit and can issue one instruction at a time from multiple threads.
A symmetric multiprocessor SMP is a computer system with multiple identical processors that share memory and connect via a bus. A distributed computer also known as a distributed memory multiprocessor is a distributed memory computer system in which the processing elements are connected by a network. Distributed computers are highly scalable. The terms " concurrent computing ", "parallel computing", and "distributed computing" have a lot of overlap, and no clear distinction exists between them.
A cluster is a group of loosely coupled computers that work together closely, so that in some respects they can be regarded as a single computer. While machines in a cluster do not have to be symmetric, load balancing is more difficult if they are not. Because grid computing systems described below can easily handle embarrassingly parallel problems, modern clusters are typically designed to handle more difficult problems—problems that require nodes to share intermediate results with each other more often.
This requires a high bandwidth and, more importantly, a low- latency interconnection network. Many historic and current supercomputers use customized high-performance network hardware specifically designed for cluster computing, such as the Cray Gemini network. A massively parallel processor MPP is a single computer with many networked processors. MPPs have many of the same characteristics as clusters, but MPPs have specialized interconnect networks whereas clusters use commodity hardware for networking. Each subsystem communicates with the others via a high-speed interconnect.
Grid computing is the most distributed form of parallel computing. It makes use of computers communicating over the Internet to work on a given problem. Because of the low bandwidth and extremely high latency available on the Internet, distributed computing typically deals only with embarrassingly parallel problems. Many distributed computing applications have been created, of which SETI home and Folding home are the best-known examples.
Download Product Flyer
Most grid computing applications use middleware software that sits between the operating system and the application to manage network resources and standardize the software interface. Often, distributed computing software makes use of "spare cycles", performing computations at times when a computer is idling. Within parallel computing, there are specialized parallel devices that remain niche areas of interest. While not domain-specific , they tend to be applicable to only a few classes of parallel problems.