<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Hey, Myrice! &#187; Parallel Computing</title>
	<atom:link href="http://myrice.me/category/parallel-computing/feed/" rel="self" type="application/rss+xml" />
	<link>http://myrice.me</link>
	<description>Something Really Belong to My Own</description>
	<lastBuildDate>Sat, 17 Dec 2011 16:58:40 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>CUDA深入浅出学习小结(1)</title>
		<link>http://myrice.me/2010/07/cuda%e6%b7%b1%e5%85%a5%e6%b5%85%e5%87%ba%e5%ad%a6%e4%b9%a0%e5%b0%8f%e7%bb%931/</link>
		<comments>http://myrice.me/2010/07/cuda%e6%b7%b1%e5%85%a5%e6%b5%85%e5%87%ba%e5%ad%a6%e4%b9%a0%e5%b0%8f%e7%bb%931/#comments</comments>
		<pubDate>Fri, 30 Jul 2010 16:29:06 +0000</pubDate>
		<dc:creator>myrice</dc:creator>
				<category><![CDATA[CUDA]]></category>
		<category><![CDATA[Parallel Computing]]></category>
		<category><![CDATA[study]]></category>

		<guid isPermaLink="false">http://myrice.me/?p=119</guid>
		<description><![CDATA[平方和的并行计算 1. 直接分别使用CPU平方和, 一个Thread 一个Block的计算平方然后CPU计算平方和, 在10^7数据面前, CPU表示毫无压力(0cycle), 9600MGS惨不忍睹(多少cycles没记下) 2. 使用256个Threads后, GPU效率大大提升 3. 使用32blocks后, GPU再次提升效率 4. 此时已经有 32*256 = 8192 个线程在跑, 利用CPU计算和的时间被忽略造成不准确, 把计算和运算放在GPU中, 出现小幅性能下降 5. 计算sum可以并行化, 用树状图表示如下: 大意为第一次把1回到0上, 3加到2上, 因为所有线程都是要发起的, 对于判断哪些线程要执行加法很重要, 一个很简单想法是线程ID对offset的2倍取模看是否为0, 代码如下: ?View Code C1 2 3 4 5 6 7 8 9 __syncthreads&#40;&#41;; //make sure all threads are complete extern __shared__ share&#91;&#93;; int offset=1 [...]]]></description>
			<content:encoded><![CDATA[<p>平方和的并行计算<br />
1. 直接分别使用CPU平方和, 一个Thread 一个Block的计算平方然后CPU计算平方和, 在10^7数据面前, CPU表示毫无压力(0cycle), 9600MGS惨不忍睹(多少cycles没记下)<br />
2. 使用256个Threads后, GPU效率大大提升<br />
3. 使用32blocks后, GPU再次提升效率<br />
4. 此时已经有 32*256 = 8192 个线程在跑, 利用CPU计算和的时间被忽略造成不准确, 把计算和运算放在GPU中, 出现小幅性能下降<br />
5. 计算sum可以并行化, 用树状图表示如下:</p>
<p style="text-align: center;"><a href="http://myrice.me/wp-content/uploads/2010/07/treeCuda.png"><img class="size-medium wp-image-121 aligncenter" title="treeCuda" src="http://myrice.me/wp-content/uploads/2010/07/treeCuda-300x248.png" alt="" width="300" height="248" /></a></p>
<p>大意为第一次把1回到0上, 3加到2上, 因为所有线程都是要发起的, 对于判断哪些线程要执行加法很重要, 一个很简单想法是线程ID对offset的2倍取模看是否为0, 代码如下:</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p119code4'); return false;">View Code</a> C</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p1194"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
</pre></td><td class="code" id="p119code4"><pre class="c" style="font-family:monospace;">__syncthreads<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>  <span style="color: #666666; font-style: italic;">//make sure all threads are complete</span>
<span style="color: #000000; font-weight: bold;">extern</span> __shared__ share<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
<span style="color: #993333;">int</span> offset<span style="color: #339933;">=</span><span style="color: #0000dd;">1</span>
<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span> offset <span style="color: #339933;">&lt;</span> THREADNUM <span style="color: #009900;">&#41;</span> <span style="color: #666666; font-style: italic;">//THREADNUM is the number of threads per block</span>
<span style="color: #009900;">&#123;</span>
    <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span> threadIdx.<span style="color: #202020;">x</span><span style="color: #339933;">%</span><span style="color: #009900;">&#40;</span>offset<span style="color: #339933;">+</span>offset<span style="color: #009900;">&#41;</span><span style="color: #339933;">==</span><span style="color: #0000dd;">0</span> <span style="color: #009900;">&#41;</span>
        share<span style="color: #009900;">&#91;</span>threadIdx.<span style="color: #202020;">x</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">+=</span> share<span style="color: #009900;">&#91;</span>threadIdx.<span style="color: #202020;">x</span><span style="color: #339933;">+</span>offset<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
    offset <span style="color: #339933;">+=</span> offset
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>然而时间并没有减少, 发现取模运算太费神, 在深入浅出书上如此写</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p119code5'); return false;">View Code</a> C</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p1195"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code" id="p119code5"><pre class="c" style="font-family:monospace;">    __syncthreads<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
    <span style="color: #993333;">int</span> offset <span style="color: #339933;">=</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> mask <span style="color: #339933;">=</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span>offset <span style="color: #339933;">&lt;</span> THREAD_NUM<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span> 
        <span style="color: #b1b100;">if</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span>tid <span style="color: #339933;">&amp;</span> mask<span style="color: #009900;">&#41;</span> <span style="color: #339933;">==</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>  <span style="color: #666666; font-style: italic;">// tid = threadIdx.x</span>
            shared<span style="color: #009900;">&#91;</span>tid<span style="color: #009900;">&#93;</span> <span style="color: #339933;">+=</span> shared<span style="color: #009900;">&#91;</span>tid <span style="color: #339933;">+</span> offset<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span> 
        <span style="color: #009900;">&#125;</span> 
        offset <span style="color: #339933;">+=</span> offset<span style="color: #339933;">;</span> 
        mask <span style="color: #339933;">=</span> offset <span style="color: #339933;">+</span> mask<span style="color: #339933;">;</span> 
        __syncthreads<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
    <span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>其中mask在里面的古怪作用正好使得結果如此&#8230;还是不太理解= =<br />
6. 据说这方法会有问题..他说后面再说, 而采用另一种更简单方法, 大意是前半与后半相加不断递归&#8230;</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p119code6'); return false;">View Code</a> C</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p1196"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
</pre></td><td class="code" id="p119code6"><pre class="c" style="font-family:monospace;">    offset <span style="color: #339933;">=</span> THREAD_NUM <span style="color: #339933;">&gt;&gt;</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">;</span> 
    <span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span>offset <span style="color: #339933;">&gt;</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span> 
        <span style="color: #b1b100;">if</span><span style="color: #009900;">&#40;</span>tid <span style="color: #339933;">&lt;</span> offset<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>  <span style="color: #666666; font-style: italic;">//tid = threadIdx.x</span>
            shared<span style="color: #009900;">&#91;</span>tid<span style="color: #009900;">&#93;</span> <span style="color: #339933;">+=</span> shared<span style="color: #009900;">&#91;</span>tid <span style="color: #339933;">+</span> offset<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span> 
        <span style="color: #009900;">&#125;</span> 
        offset <span style="color: #339933;">&gt;&gt;=</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">;</span> 
        __syncthreads<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
    <span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>现在看求矩阵乘法了, cudaMemcpy 与 cudaMemcpy2D 区别还是没看懂&#8230; C语言2维数组与1维数组不是一样的&#8230;吗&#8230; int a[N*M] == int a[N][M] 那用cudaMemcpy 一下复制 N*M 不就得了&#8230;谁知道为何&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://myrice.me/2010/07/cuda%e6%b7%b1%e5%85%a5%e6%b5%85%e5%87%ba%e5%ad%a6%e4%b9%a0%e5%b0%8f%e7%bb%931/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

