Assignment for Parallel Computing

assignment 01
Experiment 1: Parallel Eratosthenes Sieve (MPI)
Parallel Floyd Algorithm
1. MPI 点对点通信

assignment 01

1 Work and Span

一个二重循环,
递推关系: . 因此 .
. 解得 .
, , 则 .
, , 则 ,

2 Y++ Language

给定一个仅含有的括号序列判断是否匹配.

设计一个时间复杂度为的串行算法, 要求只能使用的额外的空间.

SOL 首先想到的做法是维护一个栈, 从括号序列的结尾开始遍历, 如果是右括号则入栈; 如果是左括号则弹出栈顶元素, 得到一对匹配的括号, 当串空时栈也空, 该括号序列匹配, 反之不匹配. 这个要求额外开的栈空间, 不满足题意.

但是借此思路可以不存括号而是对括号计数, 即定义右括号数目的变量, 从括号序列的结尾开始遍历, 遇到右括号自增, 遇到左括号自减, 串空时计数变量为 0 则匹配, 反之不匹配.

设计用来计数一个括号序列中左括号 ( 和右括号 ) 的个数的并行算法, 可以使用额外的的空间, 保证的 work 和的 span.

SOL 类似于 reduce, 将整个序列分成两部分分别并行地计数其左右括号数, 再 join 起来即可.

def count(P,n):
    if n==0:
        return [0,0]
    if n==1:
        if P[0]='(':
            return [1,0]
        else:
            return [0,1]
    par_do:
        cnt1=count(P,n/2)
        cnt2=count(P+n/2,n-n/2)
   	return [cnt1[0]+cnt2[0],cnt1[1]+cnt2[1]]

满足 , .

将 (1) 中的算法并行化, 可以使用额外的的空间, 保证的 work 和的 span.

SOL 分支, 将序列分成两份, 分别用一个计数变量进行 (1) 中的计数, 最后将两者加起来返回即可.

int match(int *P,int n,int threhold){
    if(n==0)return 0;
    int cnt=0;
    if (n<=threhold){
        for(int i=n-1;i>=0;i--){
            if(P[i]==')')cnt++;
            if(P[i]=='(')cnt--;
        }
        return cnt;
    }
    par_do:
    	int cnt1=match(P,n/2);
    	int cnt2=match(P+n/2,n-n/2);
    return cnt1+cnt2;
}

仍然满足 , .

得到 .

3 Compute the Integral

采用分区间的方式计算积分 (1) 如果直接将

for (size_t i = 0; i < n; i++){
    double x = low + (high - low) * i / n; // 用的是每个小区间的左端点
    double dx = (high - low) / n;
    ans += f(x) * dx;
}

改成并行的 parallel_for:

auto cal = [&](size_t i) {
    double x = low + (high - low) * i / n;
    double dx = (high - low) / n;
    ans += f(x) * dx;
};
parallel_for(0, n, cal);

当时, 两者的结果分别为

Total sum: 0.3740279091741
Warmup round running time: 1.875804
Round 1 running time: 2.020929
Round 2 running time: 2.056737
Round 3 running time: 1.742642
Average running time: 1.940102666667
Total sum: 0.1339990181118
Warmup round running time: 1.995736
Round 1 running time: 2.017125
Round 2 running time: 2.14402
Round 3 running time: 2.092346
Average running time: 2.084497

时间上并没有明显改变, 并且结果是错的.

原因很明显, 出现了 race, 如果两个进程同时执行 ans+=f(x)*dx, 这两个进程同时读到相同的 ans 并将自己的增量加到 ans 上, 由于没有互斥, 最终一个进程的结果会覆盖掉另一个进程的结果, 导致最终积分的结果仅仅是部分项的和. 还有就是这里每一项的计算 overhead 仍然不大, 因此 fork 的 overhead 还是处于支配地位.

解决上面问题最直接的方法是加锁.

#include<mutex>
std::mutex ans_mutex;
auto cal = [&](size_t i) {
    double x = low + (high - low) * i / n;
    double dx = (high - low) / n;
    std::lock_guard<std::mutex> lock(ans_mutex);
    ans += f(x) * dx;
};

这会破坏 parallelism, 下面是新的结果

Total sum: 0.3740279091741
Warmup round running time: 1.725927
Round 1 running time: 1.718653
Round 2 running time: 1.720002
Round 3 running time: 1.717201
Average running time: 1.718618666667
Total sum: 0.3740279091742
Warmup round running time: 10.593743
Round 1 running time: 10.475321
Round 2 running time: 10.407566
Round 3 running time: 10.49787
Average running time: 10.46025233333

时间变成了原来(串行)的五倍.

设计新的并行算法.

这就是计算项和, 所以可以直接使用 reduce 算法, 同时增加粒度的控制:

template <class Func>
double integral_reduce(const Func &f, size_t n, double low, double high, size_t threhold){
    auto rect = [&](size_t i) -> double{
        double x = low + (high - low) * i / n;
        double dx = (high - low) / n;
        return f(x) * dx;
    };
    if (n < threhold){
        double ans = 0;
        for (size_t i = 0; i < n; i++){
            ans += rect(i);
        }
        return ans;
    }
    else{
        double v1, v2;
        auto f1 = [&](){
            v1 = integral_reduce(f, n / 2, low, low + (high - low) * (n / 2) / n, threhold);
        };
        auto f2 = [&](){
            v2 = integral_reduce(f, n - n / 2, low + (high - low) * (n / 2) / n, high, threhold);
        };
        par_do(f1, f2);
        return v1 + v2;
    }
}

, 粒度为 , 8个线程时偶尔可以达到3秒内:

Total sum: 0.3740279120104
Warmup round running time: 2.975084
Round 1 running time: 2.995158
Round 2 running time: 2.970043
Round 3 running time: 2.956155
Average running time: 2.973785333333

粒度调细一些可能会更好.

再提高速度和精度!

上面的结果精度不够, 提升精度的一个很显然的方法就是用梯形代替矩形:

auto trape = [&](size_t i) -> double
{
    double l = low + (high - low) * i / n; 
    double r=low + (high - low) * (i+1) / n;
    double dx = (high - low) / n;
    return (f(l)+f(r)) * dx/2;
};

或者更简单的写法:

auto trape = [&](size_t i) -> double
{
    double l = low + (high - low) * i / n; 
    double dx = (high - low) / n;
    return (f(l)+f(l+dx)) * dx/2;
};

这样很容易就达到了要求的精度, 但是计算梯形的面积增加了 overhead, 时间变成原来的两倍了:

Total sum: 0.3740279123255
Warmup round running time: 7.468748
Round 1 running time: 7.875462
Round 2 running time: 8.723461
Round 3 running time: 7.854643
Average running time: 8.151188666667

很容易发现开销主要在除之外的点都被算了两遍, 而且 dx 只需要计算一次. 同时由于开不了大小为的数组存下所有的点(这样写还需要考虑区间的起始下标), 因此可以算完第个梯形的面积后把下底传出作为第个梯形的上底, 保证区间上的每个点只被计算一次, 如下:

double dx = (high - low) / n;
auto trape = [&](double &ls, double r) -> double {
    double rs = f(r);
    double res = (ls + rs) * dx / 2;
    ls = rs;
    return res;
};
if (n < threhold) {
    double ans = 0, ls = f(low), r = low;
    for (size_t i = 0; i < n; i++) {
        r += dx;
        ans += trape(ls, r);
    }
    return ans;
}

最终 , 8 个线程, threhold=200 的结果如下:

Total sum: 0.3740279335276
Warmup round running time: 10.416636
Round 1 running time: 10.530778
Round 2 running time: 10.372824
Round 3 running time: 10.660854
Average running time: 10.52148533333
Total sum: 0.3740279123255
Warmup round running time: 2.727995
Round 1 running time: 2.793685
Round 2 running time: 2.812348
Round 3 running time: 2.778975
Average running time: 2.795002666667

第一个结果是串行计算, 第二个结果是并行计算, 两者都是按照梯形计算的.

为什么串行计算的结果偏大 ?

怎么搞到 2s 内 ?

REF :

[1] CS214 of UCR

[2] dropbox

Experiment 1: Parallel Eratosthenes Sieve (MPI)

串行的埃筛如下 (筛区间)

(第一个未被标记的数)
当时标记中所有的倍数转 3; 反之结束, 所有未被标记的数为素数.
$下一个未被标记的数$ , 这一定是当前的下一个素数.

复杂度为 .

为什么只需要标记中所有的倍数? 这意味着中的合数已经被小于的素数标记过了, 这是显然的, 因为对每一个且为合数, 一定有小于的素因子, 否则 .

复杂度的证明:

参考当数论遇上分析（3）——数论函数的加权平均、切比雪夫定理以及埃氏筛 - 知乎 (zhihu.com)

Mertens 定理 - 香蕉空间 (bananaspace.org)

黎曼－斯蒂尔杰斯积分有什么存在的意义吗？ - 知乎 (zhihu.com)

数论函数(1) - 知乎 (zhihu.com)

筛法 - OI Wiki (oi-wiki.org)

需要分别证明 Mertens 第一定理和 Mertens 第二定理 .

这是典型的 RS 积分的应用, 同时 RS 积分和 Abel 变换又是一脉相承的.

ok 先放一下, 先看代码.

MPI Something

int MPI_Bcast(
    void *buffer,          /* Addr of 1st element */
    int count,             /* # elements to broadcast */
    MPI_Datatype datatype, /* Type of elements */
    int root,              /* ID of root process */
    MPI_Comm comm)         /* Communicator */
MPI_Bcast(&k, 1, MPI_INT, 0, MPI_COMM_WORLD);

并行思路

很容易能发现标记的过程可以并行, 可以分块, 每个处理器分别负责不同区间的标记, 设有个处理器:

希望每个区间的大小为或保证负载均衡
当时, 直接均分即可
当时, 记 , 则 , 有个长度为的区间, 有个长度为的区间. 然后需要求出区间的起始元素的下标, 这有很多组织方式, 需要找出较为表达式简单的那一种.
区间的第一个元素为 , 最后一个元素为 . 显然总的元素的个数为同时 , 其中 , 注意到那么两不等式相加有显然即上的整数仅有两个 (也可能是一个), 那么就一定有 , 这就意味着每个区间的元素个数要么为 , 要么为 , 这就是我们所希望的.
对于元素而言, 它所属的区间为 .

设个块为 , 块的长度为且元素为 . 现在探究一下具体并行标记的过程.

从块首元素开始枚举, 找到第一个的倍数标记, 接下来下标每次加标记即可.

样例代码, 先不管.

#include <mpi.h>
#include <math.h>
#include <stdio.h>
#include "MyMPI.h"
#define MIN(a, b) ((a) < (b) ? (a) : (b))
int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);
    MPI_Barrier(MPI_COMM_WORLD);
    elapsed_time = -MPI_Wtime();
    MPI_Comm_rank(MPI_COMM_WORLD, &id);
    MPI_Comm_size(MPI_COMM_WORLD, &p);
    if (argc != 2)
    {
        if (!id)
            printf("Command line: %s <m>\n", argv[0]);
        MPI_Finalize();
        exit(1);
    }
    n = atoi(argv[1]);
    low_value = 2 + BLOCK_LOW(id, p, n - 1);
    high_value = 2 + BLOCK_HIGH(id, p, n - 1);
    size = BLOCK_SIZE(id, p, n - 1);
    proc0_size = (n - 1) / p;
    if ((2 + proc0_size) < (int)sqrt((double)n))
    {
        if (!id)
            printf("Too many processes\n");
        MPI_Finalize();
        exit(1);
    }
    marked = (char *)malloc(size);
    if (marked == NULL)
    {
        printf("Cannot allocate enough memory\n");
        MPI_Finalize();
        exit(1);
    }
    for (i = 0; i < size; i++)
        marked[i] = 0;
    if (!id)
        index = 0;
    prime = 2;
    do
    {
        if (prime * prime > low_value)
            first = prime * prime - low_value;
        else
        {
            if (!(low_value % prime))
                first = 0;
            else
                first = prime - (low_value % prime);
        }
        for (i = first; i < size; i += prime)
            marked[i] = 1;
        if (!id)
        {
            while (marked[++index])
                ; /*find the smallest unmarked*/
            prime = index + 2;
        }
        MPI_Bcast(&prime, 1, MPI_INT, 0, MPI_COMM_WORLD);
    } while (prime * prime <= n);
    count = 0;
    for (i = 0; i < size; i++)
        if (!marked[i])
            count++;
    MPI_Reduce(&count, &global_count, 1, MPI_INT, MPI_SUM,
               0, MPI_COMM_WORLD);
    elapsed_time += MPI_Wtime();
    if (!id)
    {
        printf("%d primes are less than or equal to %d\n",
               global_count, n);
        printf("Total elapsed time: %10.6f\n", elapsed_time);
    }
    MPI_Finalize();
    return 0;
}

优化

去掉偶数
优化通信
考虑 cache miss 重新组织循环

Parallel Floyd Algorithm

Floyd 算法计算的每一对顶点的最短路径的长度, 算法需要三重循环. 设表示之间的最短路径长度, 则枚举所有的 , 更新 , 初始化 , 其中为边的边权.

MPI 点对点通信

2024-01-25 该篇文章被 Baoduo Xu 归为分类: Parallel Computing

以上