注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

我的博客

 
 
 

日志

 
 
关于我

砍过人吸过粉站在路边接过吻 当过兵站过岗耍过流氓入过党 上过班下过岗打过领导得过奖

网易考拉推荐

多线程下变量-原子操作 __sync_fetch_and_add等等  

2012-06-29 17:44:46|  分类: 编程 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

http://blog.csdn.net/i_am_jojo/article/details/7591743

http://www.alexonlinux.com/multithreaded-simple-data-type-access-and-atomic-variables

最近编码需要实现多线程环境下的计数器操作,统计相关事件的次数。下面是一些学习心得和体会。不敢妄称原创,基本是学习笔记。遇到相关的引用,我会致谢。
    当然我们知道,count++这种操作不是原子的。一个自加操作,本质是分成三步的:
     1 从缓存取到寄存器
     2 在寄存器加1
     3 存入缓存。
    由于时序的因素,多个线程操作同一个全局变量,会出现问题。这也是并发编程的难点。在目前多核条件下,这种困境会越来越彰显出来。
    最简单的处理办法就是加锁保护,这也是我最初的解决方案。看下面的代码:
      pthread_mutex_t count_lock = PTHREAD_MUTEX_INITIALIZER;

      pthread_mutex_lock(&count_lock);
      global_int++;
      pthread_mutex_unlock(&count_lock);
    后来在网上查找资料,找到了__sync_fetch_and_add系列的命令,发现这个系列命令讲的最好的一篇文章,英文好的同学可以直接去看原文。 Multithreaded simple data type access and atomic variables

     __sync_fetch_and_add系列一共有十二个函数,有加/减/与/或/异或/等函数的原子性操作函 数,__sync_fetch_and_add,顾名思义,现fetch,然后自加,返回的是自加以前的值。以count = 4为例,调用__sync_fetch_and_add(&count,1),之后,返回值是4,然后,count变成了5.
    有__sync_fetch_and_add,自然也就有__sync_add_and_fetch,呵呵这个的意思就很清楚了,先自加,在返回。他们哥俩的关系与i++和++i的关系是一样的。被谭浩强他老人家收过保护费的都会清楚了。
    有了这个宝贝函数,我们就有新的解决办法了。对于多线程对全局变量进行自加,我们就再也不用理线程锁了。下面这行代码,和上面被pthread_mutex保护的那行代码作用是一样的,而且也是线程安全的。

__sync_fetch_and_add( &global_int, 1 );
    下面是这群函数的全家福,大家看名字就知道是这些函数是干啥的了。

在用gcc编译的时候要加上选项 -march=i686
type __sync_fetch_and_add (type *ptr, type value);
type __sync_fetch_and_sub (type *ptr, type value);
type __sync_fetch_and_or (type *ptr, type value);
type __sync_fetch_and_and (type *ptr, type value);
type __sync_fetch_and_xor (type *ptr, type value);
type __sync_fetch_and_nand (type *ptr, type value);
type __sync_add_and_fetch (type *ptr, type value);
type __sync_sub_and_fetch (type *ptr, type value);
type __sync_or_and_fetch (type *ptr, type value);
type __sync_and_and_fetch (type *ptr, type value);
type __sync_xor_and_fetch (type *ptr, type value);
type __sync_nand_and_fetch (type *ptr, type value);
    需要提及的是,这个type不能够瞎搞。下面看下__sync_fetch_and_add反汇编出来的指令,
804889d: f0 83 05 50 a0 04 08 lock addl $0x1,0x804a050
    我们看到了,addl前面有个lock,这行汇编指令码前面是f0开头,f0叫做指令前缀,Richard Blum
老爷子将指令前缀分成了四类,有兴趣的同学可以看下。其实我也没看懂,intel的指令集太厚了,没空看。总之老爷子解释了,lock前缀的意思是对内存区域的排他性访问。
? Lock and repeat prefixes
? Segment override and branch hint prefixes
? Operand size override prefix
? Address size override prefix

     前文提到,lock是锁FSB,前端串行总线,front serial bus,这个FSB是处理器和RAM之间的总线,锁住了它,就能阻止其他处理器或者core从RAM获取数据。当然这种操作是比较费的,只能操作小的内存 可以这样做,想想我们有memcpy ,如果操作一大片内存,锁内存,那么代价就太昂贵了。所以前文提到的_sync_fetch_add_add家族,type只能是int long  ,long long(及对应unsigned类型)。

     下面提供了函数,是改造的Alexander Sundler的原文,荣誉属于他,我只是学习他的代码,稍微改动了一点点。比较了两种方式的耗时情况。呵呵咱是菜鸟,不敢枉自剽窃大师作品。向大师致敬。
#define _GNU_SOURCE

#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <linux/unistd.h>
#include <sys/syscall.h>
#include <errno.h>
#include<linux/types.h>
#include<time.h>

#define INC_TO 1000000 // one million...

__u64 rdtsc()
{
  __u32 lo,hi;

    __asm__ __volatile__
    (
     "rdtsc":"=a"(lo),"=d"(hi)
    );

    return (__u64)hi<<32|lo;
}

 

int global_int = 0;
pthread_mutex_t count_lock = PTHREAD_MUTEX_INITIALIZER;


pid_t gettid( void )
{
    return syscall( __NR_gettid );
}

void *thread_routine( void *arg )
{
    int i;
    int proc_num = (int)(long)arg;
    __u64 begin, end;
    struct timeval tv_begin,tv_end;
    __u64 timeinterval;
    cpu_set_t set;

    CPU_ZERO( &set );
    CPU_SET( proc_num, &set );

    if (sched_setaffinity( gettid(), sizeof( cpu_set_t ), &set ))
    {
        perror( "sched_setaffinity" );
        return NULL;
    }

    begin = rdtsc();
    gettimeofday(&tv_begin,NULL);
    for (i = 0; i < INC_TO; i++)
    {
//     global_int++;
        __sync_fetch_and_add( &global_int, 1 );
    }
    gettimeofday(&tv_end,NULL);
    end = rdtsc();
    timeinterval =(tv_end.tv_sec - tv_begin.tv_sec)*1000000                    +(tv_end.tv_usec - tv_begin.tv_usec);
    fprintf(stderr,"proc_num :%d,__sync_fetch_and_add cost             %llu CPU cycle,cost %llu us\n",                             proc_num,end-begin,timeinterval);

    return NULL;
}


void *thread_routine2( void *arg )
{
    int i;
    int proc_num = (int)(long)arg;
    __u64 begin, end;

    struct timeval tv_begin,tv_end;
    __u64 timeinterval;
    cpu_set_t set;

    CPU_ZERO( &set );
    CPU_SET( proc_num, &set );

    if (sched_setaffinity( gettid(), sizeof( cpu_set_t ), &set ))
    {
        perror( "sched_setaffinity" );
        return NULL;
    }


    begin = rdtsc();
    gettimeofday(&tv_begin,NULL);

    for(i = 0;i<INC_TO;i++)
    {
        pthread_mutex_lock(&count_lock);
        global_int++;
        pthread_mutex_unlock(&count_lock);
    }

    gettimeofday(&tv_end,NULL);
    end = rdtsc();


    timeinterval =(tv_end.tv_sec - tv_begin.tv_sec)*1000000                   +(tv_end.tv_usec - tv_begin.tv_usec);
    fprintf(stderr,"proc_num :%d,pthread lock cost %llu CPU                    cycle,cost %llu us\n",proc_num,end-begin                    ,timeinterval);

 

    return NULL;
}
int main()
{
    int procs = 0;
    int i;
    pthread_t *thrs;

    // Getting number of CPUs
    procs = (int)sysconf( _SC_NPROCESSORS_ONLN );
    if (procs < 0)
    {
        perror( "sysconf" );
        return -1;
    }

    thrs = malloc( sizeof( pthread_t ) * procs );
    if (thrs == NULL)
    {
        perror( "malloc" );
        return -1;
    }

    printf( "Starting %d threads...\n", procs );

    for (i = 0; i < procs; i++)
    {
        if (pthread_create( &thrs[i], NULL, thread_routine,
            (void *)(long)i ))
        {
            perror( "pthread_create" );
            procs = i;
            break;
        }
    }

    for (i = 0; i < procs; i++)
        pthread_join( thrs[i], NULL );

    free( thrs );

    printf( "After doing all the math, global_int value is:              %d\n", global_int );
    printf( "Expected value is: %d\n", INC_TO * procs );

    return 0;
}
     通过我的测试发现:

Starting 4 threads...
proc_num :2,no locker cost 27049544 CPU cycle,cost 12712 us
proc_num :0,no locker cost 27506750 CPU cycle,cost 12120 us
proc_num :1,no locker cost 28499000 CPU cycle,cost 13365 us
proc_num :3,no locker cost 27193093 CPU cycle,cost 12780 us
After doing all the math, global_int value is: 1169911
Expected value is: 4000000
Starting 4 threads...
proc_num :2,__sync_fetch_and_add cost 156602056 CPU cycle,cost 73603 us
proc_num :1,__sync_fetch_and_add cost 158414764 CPU cycle,cost 74456 us
proc_num :3,__sync_fetch_and_add cost 159065888 CPU cycle,cost 74763 us
proc_num :0,__sync_fetch_and_add cost 162621399 CPU cycle,cost 76426 us
After doing all the math, global_int value is: 4000000
Expected value is: 4000000

Starting 4 threads...
proc_num :1,pthread lock cost 992586450 CPU cycle,cost 466518 us
proc_num :3,pthread lock cost 1008482114 CPU cycle,cost 473998 us
proc_num :0,pthread lock cost 1018798886 CPU cycle,cost 478840 us
proc_num :2,pthread lock cost 1019083986 CPU cycle,cost 478980 us
After doing all the math, global_int value is: 4000000
Expected value is: 4000000
1 不加锁的情况下,不能返回正确的结果
  测试程序结果显示,正确结果为400万,实际为1169911.

2 线程锁和原子性自加都能返回正确的结果。

3 性能上__sync_fetch_and_add,完爆线程锁。
  从测试结果上看, __sync_fetch_and_add,速度是线程锁的6~7倍







Multithreaded simple data type access and atomic variables

Posted on December 23, 2008, 1:55 pm, by Alexander Sandler, under Programming Articles.

Table of contents

IntroductionBACK TO TOC

In this article I would like to continue subject I started in my previous two posts (post 1 and post2). Question I am trying to answer is what is the most efficient, yet safe way of accessing simple data type variables from two or more threads. I.e. how to change a variable from two threads at the same time, without screwing its value.

In my first post I’ve shown how easy it is to turn variable value into garbage by modifying it from two or more threads. In my second post I am talking about spinlocks, a recent addition into pthread library. Spinlocks indeed can help to solve the problem. Yet spinlocks more suitable for protecting small data structures rather than simple data types such as int and long. Atomic variables, on the other hand, are perfect for the later task.

Key thing about atomic variables is that once someone starts reading or writing it, nothing else cannot interrupt the process and come in the middle. I.e. nothing can split the process of accessing atomic variable into two. This is why they called atomic.

On the practical side, atomic variables are the best solution for the problem of simultaneous access to a simple variable from two or more threads.

How atomic variables workBACK TO TOC

This is actually quiet simple. Intel x86 and x86_64 processor architectures (as well as vast majority of other modern CPU architectures) has instructions that allow one to lock FSB, while doing some memory access. FSB stands for Front Serial Bus. This is the bus that processor use to communicate with RAM. I.e. locking FSB will prevent from any other processor (core), and process running on that processor, from accessing RAM. And this is exactly what we need to implement atomic variables.

Atomic variables being widely used in kernel, but from some reason no-one bothered to implement them for user-mode folks. Until gcc 4.1.2.

Atomic variables size limitationsBACK TO TOC

From practical considerations, gurus at Intel did not implement FSB locking for every possible memory access. For instance, for quiet some time, Intel processors allow memcpy() and memcmp() implementation with one processor instruction. But locking FSB while copying large memory buffer can be too expensive.

In practice you can lock FSB while accessing 1, 2, 4 and 8 byte long integers. Almost transparently, gcc allows you to do atomic operations on int‘s, long‘s and long long‘s (and their unsigned counterparts).

Use casesBACK TO TOC

Incrementing a variable and knowing that no-one else screws its value is nice, but not enough. Consider following piece of pseudo-code.

decrement_atomic_value(); if (atomic_value() == 0)     fire_a_gun();

Let us imagine that the value of an atomic variable is 1. What happens if two threads of execution try to execute this piece of pseudo-C simultaneously?

Back to our simulation. It is possible that thread 1 will execute line 1 and stop, while thread 2 will execute line 1 and continue executing line 2. Later thread 1 will wake up and execute line 2.

two_threads

When this happens, no one of the threads will run fire_a_gun() routine (line 3). This is obviously wrong behavior and if we were protecting this piece of code with a mutex or a spinlock this would not have happened.

In case you’re wondering how likely something like this to happen, be sure that this is very likely. When I first started working with multithreaded programing I was amazed to find out that despite our intuition tells us that scenario I described earlier is unlikely, it happens overwhelmingly often.

As I mentioned, we could solve this problem by giving up on atomic variables and using spinlock or mutex instead. Luckily, we can still use atomic variables. gcc developers have thought about our needs and this particular problem and offered a solution. Lets see actual routines that operate atomic variables.

The real thing…BACK TO TOC

There are several simple functions that do the job. First of all, there are twelve (yes, twelve – 12) functions that do atomic add, substitution, and logical atomic or, and, xor and nand. There are two functions for each operation. One that returns value of the variable before changing it and another that returns value of the variable after changing it.

Here are the actual functions:

type __sync_fetch_and_add (type *ptr, type value); type __sync_fetch_and_sub (type *ptr, type value); type __sync_fetch_and_or (type *ptr, type value); type __sync_fetch_and_and (type *ptr, type value); type __sync_fetch_and_xor (type *ptr, type value); type __sync_fetch_and_nand (type *ptr, type value);

These are functions that return value of the variable before changing it. Following functions, on the other hand, return value of the variable after changing it.

type __sync_add_and_fetch (type *ptr, type value); type __sync_sub_and_fetch (type *ptr, type value); type __sync_or_and_fetch (type *ptr, type value); type __sync_and_and_fetch (type *ptr, type value); type __sync_xor_and_fetch (type *ptr, type value); type __sync_nand_and_fetch (type *ptr, type value);

type in each of the expressions can be one of the following:

  • int
  • unsigned int
  • long
  • unsigned long
  • long long
  • unsigned long long

These are so called built-in functions, meaning that you don’t have to include anything to use them.

Time to see some actionBACK TO TOC

Back to the example I started in the first post I mentioned earlier.

To remind you, this small program opens several of threads. Number of threads is as number of CPUs in the computer. Then it binds each one of the threads to one of the CPUs. Finally each thread runs a loop and increments a global integer 1 million times.

#include <stdio.h> #include <pthread.h> #include <unistd.h> #include <stdlib.h> #include <sched.h> #include <linux/unistd.h> #include <sys/syscall.h> #include <errno.h>  #define INC_TO 1000000 // one million...  int global_int = 0;  pid_t gettid( void ) {  return syscall( __NR_gettid ); }  void *thread_routine( void *arg ) {  int i;  int proc_num = (int)(long)arg;  cpu_set_t set;   CPU_ZERO( &set );  CPU_SET( proc_num, &set );   if (sched_setaffinity( gettid(), sizeof( cpu_set_t ), &set ))  {   perror( "sched_setaffinity" );   return NULL;  }   for (i = 0; i < INC_TO; i++)  { //  global_int++;   __sync_fetch_and_add( &global_int, 1 );  }   return NULL; }  int main() {  int procs = 0;  int i;  pthread_t *thrs;   // Getting number of CPUs  procs = (int)sysconf( _SC_NPROCESSORS_ONLN );  if (procs < 0)  {   perror( "sysconf" );   return -1;  }   thrs = malloc( sizeof( pthread_t ) * procs );  if (thrs == NULL)  {   perror( "malloc" );   return -1;  }   printf( "Starting %d threads...\n", procs );   for (i = 0; i < procs; i++)  {   if (pthread_create( &thrs[i], NULL, thread_routine,    (void *)(long)i ))   {    perror( "pthread_create" );    procs = i;    break;   }  }   for (i = 0; i < procs; i++)   pthread_join( thrs[i], NULL );   free( thrs );   printf( "After doing all the math, global_int value is: %d\n",   global_int );  printf( "Expected value is: %d\n", INC_TO * procs );   return 0; }

Note lines 36 and 37. Instead of simply incrementing the variable, I use built-in function __sync_fetch_and_add(). Running this code obviously produces expected results – i.e. value of global_int is 4,000,000 as expected (number of CPUs in the machine multiply 1 million – in my case this is a 4 core machine). Remember that when I ran this code snippet leaving line 36 as is, the result was 1,908,090 and not 4,000,000 as we’d expect.

PrecautionsBACK TO TOC

When using atomic variables, some extra precautions have to be taken. One serious problem with atomic variable implementation in gcc is that it allows you to do atomic operations on regular variables. I.e. there is no clear distinction between atomic variables and regular variables. There is nothing that prevents you from incrementing value of the atomic variable with __sync_fetch_and_add() as I just demonstrated and later in the code doing same thing with regular ++ operator.

Obviously this might be a serious problem. Things tend to be forgotten and it is a matter of time until someone in your project or even you yourself will modify value of the variable using regular operators, instead of atomic functions that gcc has.

To address this problem, I strongly suggest wrapping around atomic functions and variables with either ADT in C or C++ class.

ConslusionBACK TO TOC

This article concludes a series or articles and posts where we investigate and study newest techniques in the world of multithreaded programming for Linux. Hope you’ll find these posts and article useful. As usual, in case you have further questions please don’t hesitate to email me to alexander.sandler@gmail.com.

Related posts:

  1. C/C++ reference counting with atomic variables and gcc
  2. Do you need a mutex to protect an int?
  3. Aligned vs. unaligned memory access
  4. pthread mutex vs pthread spinlock
  5. Making writes durable – is your data on disk?
  评论这张
 
阅读(1145)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2016