Archive

Archive for the ‘Architecture’ Category

A Primitive Thought on Heterogeneous and Many Core Architecture

November 23rd, 2010 3 comments

Several years ago, when Intel tried to increase CPU frequency over 4GHz, some obstacles encountered. The main problem is that power and temperature increase rapidly. The commonly used air/fan cooling does not work. It is not feasible to force desktop computer users to use air-conditioner or water cooling to cool down CPU. To overcome this obstacle, Intel integrates more cores on one die. If two or more cores work together, workload of every core is decline thereby the frequency is down.

Figure 1: Performance and Application range

This method works well now. But it is not the final solution. Current commodity computers have to address more complex situations. Even if we integrate loads of same type of cores together, applications beyond center of normal distribution still cannot have a good performance (Figure 1, Red curve). This is the shortage of homogeneous architecture. We can see that different processors are suitable for different situations. Such as: CPU for common computation, GPU for graph processing. What if we combine different type of processors together? The answer is yellow curve in figure 1. Each processor has high throughput in particular area(Blue curves in figure 1). When they are united, application range is wider and performance is higher. Heterogeneous architecture is indubitable more competitive than homogeneous in multi-core design.

A professor from Technion gave us a primitive thought on heterogeneous architecture last Monday in Tsinghua University. The main points are: cache/share memory design, multi-core or multi-thread and scheduling in operating system.

Figure 2: City of Nahalal

Figure 3: circular cache/share memory

Previously we put core and share memory in separate place. However, it is not a good way to place cores together in multi-core architecture design. If a core wants to access data far from it (that means data has to go through more nodes between cores), that needs lots of time. A brilliant idea comes from a city named Nahalal (Figure 2). In Nahalal, factories and products serve in the middle of city. Residents live around center of city. They have their own farms around their house. The share memory in computer is like products in Nahalal. Every core needs it. And every core has its own private data which is similar to farms around house. A new model is proposed as figure 3.

With this kind of memory location, each core accesses data in share memory easily and private data area gives them more flexibility.

Figure 4: Performance and core number

The second is about multi-core and multi-thread selection. The difference between multi-core and multi-thread, in short, is that the former has shared cache but the latter doesn’t have. Multi-thread only uses private data and its own status register to maintain data and status. Multi-thread is a form of many cores. As shown in figure 4, when core number increase, performance first increases and then decreases. At last, performance increases again. There are three distinct stages in this curve. In the first stage, performance increase as core number increases. This is normal situation. But at some point, when we increase core number, locality of cache cannot be guaranteed. Cache is useless thus performance declines. This is the second stage of curve. At last stage, core number is large enough to conceal memory request latency so that performance increases again. A better way to solve performance loosing in the second stage is changing cache to share memory at the beginning of it.

The last thing is about process scheduling in operating system level. I didn’t catch the point, nothing to say. :)

给 @milandroid 上一堂课-8086使用偏移(分段管理)原因

October 16th, 2010 2 comments

呃,原来本来只想写后面副标题的,后来题目被 milandroid 提议, 然后就这样吧,上课这词用得太大了。

milandroid 说8086为什么有移4位那什么的,下午和她说了下,感觉说快了,不一定懂,后来自己理了下本来想写点简明的易理解文字发过去,結果写了好多,就发上来吧,正好已经好久没看这东西。。不过CSIC指令集的寻址方式真的好多。

以下是原文:

20位,每位就是0/1如下
19 18 17 16———> 1 0
0000 0000 0000 0000 0000

而因为8086是16位的处理器,所以一个寄存器只能存储16个2进制即上面的0-15位,所以用一种算法来表示20位,这是种类似一个只能存2位数的东西想表示234一样,一个存20,一个存34,最后用 20*10+34来表示 (总感觉这例子哪有问题,果然是,该是一个存2,一个存34,然后是2*100+34,相当于左移2位,而不是上面那组合)

而存20(2)的东西,在8086里面就是DS,然后用[34]来表示这个东西需要用DS*10+34来表示,当然如果AX=34,可以 用[AX]表示同样的东西

所以,如果转到2进制上,就是想表示1011 0111 1111 0001 0101
就是 1011 0111 1111 0001 * 10000 + 0101

这个东西太长,习惯上用16进制表示, 这个转成16进制表示法就是 B7F1 * 10 + 5 ,注意10是指16进制的10,10进制时为16,如果上面表达式按10进制为:47089 * 16 + 5
DS = B7F1 AX = 5
MOV BX, [AX] 就把B7F15地址的东西复制到BX里了。

注意,关于转换16-2转换以下东西需要知道
a. 16进制与2进制的对应
16 2
1 0001
2 0010
3 0011
4 0100
. ……….
. ………
. ……..
9 1001
A 1010
B 1011
C 1100
D 1101
E 1110
F 1111
10 10000
A-F分别代表10进制的10-15,注意,16进制的10代表10进制的16

b. 16进制与2进制转化
2进制从低位起每4位可以转成一个16进制,比如
11 1011 = 0011 1011 = 3B
所以1011 0111 1111 0001 0101
就是B—–7—–F—-1—–5—-

P.S. 既然一个寄存器只能有16bit,那为什么要搞这么复杂非要来表示20bit呢?因为16bit能表示地址是2^16=64KB大小内存,而这显然太小了,于是设计了20bit的地址线,就有了2^20=1MB内存,所以~

–上篇文章提到的Truth and Faith 还没写好,中间先插一篇…

Categories: Architecture Tags: , ,