Why Is Python Faster Than C When Concatenating Two Strings?
Solution 1:
Accumulated comments (mainly from me) converted into an answer:
- What happens if you use your knowledge of the lengths of the strings and use
memmove()
ormemcpy()
instead ofstrcpy()
andstrcat()
? (I note that thestrcat()
could be replaced withstrcpy()
with no difference in result — it might be interesting to check the timing.) Also, you didn't include<string.h>
(or<stdio.h>
) so you're missing any optimizations that<string.h>
might provide!
Marcus: Yes,
memmove()
is faster thanstrcpy()
and faster than Python, but why? Doesmemmove()
do a word-width copy at a time?
- Yes; on a 64-bit machine for nicely aligned data, it can be moving 64-bits at a time instead of 8-bits at a time; a 32-bit machine, likely 32-bits at a time. It also has
only onea simpler test to make on each iteration (count), not ('count or is it null byte') 'is this a null byte'.
Marcus: But
memmove()
is still working well even after I makeL=L-13
, andsizeof(s)
gives outL+1024-13
. My machine has asizeof(int)==4
.
- The code for
memmove()
is highly optimized assembler, possibly inline (no function call overhead, though for 100KiB of data, the function call overhead is minimal). The benefits are from the bigger moves and the simpler loop condition.
Marcus: So does Python use
memmove()
as well, or something magic?
- I've not looked at the Python source, but it is practically a certainty that it keeps track of the length of its strings (they're null terminated, but Python always knows how long the active part of the string is). Knowing that length allows Python to use
memmove()
ormemcpy()
(the difference being thatmemmove()
works correctly even if the source and destination overlap;memcpy()
is not obliged to work correctly if they overlap). It is relatively unlikely that they've got anything faster thanmemmove/memcpy
available.
I modified the C code to produce more stable timings for me on my machine (Mac OS X 10.7.4, 8 GiB 1333 MHz RAM, 2.3 GHz Intel Core i7, GCC 4.7.1), and to compare strcpy()
and strcat()
vs memcpy()
vs memmove()
. Note that I increased the loop count from 1000 to 10000 to improve the stability of the timings, and I repeat the whole test (of all three mechanisms) 10 times. Arguably, the timing loop count should be increased by another factor of 5-10 so that the timings are over a second.
#include<stdio.h>#include<string.h>#include<unistd.h>#include<sys/time.h>#define L (100*1024)char s[L+1024];
char c[2*L+1024];
staticdoubletime_diff( struct timeval et, struct timeval st ){
return1e-6*((et.tv_sec - st.tv_sec)*1000000 + (et.tv_usec - st.tv_usec ));
}
staticintfoo(void){
strcpy(c,s);
strcat(c+L,s);
return0;
}
staticintbar(void){
memcpy(c + 0, s, L);
memcpy(c + L, s, L);
return0;
}
staticintbaz(void){
memmove(c + 0, s, L);
memmove(c + L, s, L);
return0;
}
staticvoidtimer(void){
structtimeval st;
structtimeval et;
int i;
memset(s, '1', L);
foo();
gettimeofday(&st,NULL);
for( i = 0 ; i < 10000; i++ )
foo();
gettimeofday(&et,NULL);
printf("foo: %f\n", time_diff(et,st));
gettimeofday(&st,NULL);
for( i = 0 ; i < 10000; i++ )
bar();
gettimeofday(&et,NULL);
printf("bar: %f\n", time_diff(et,st));
gettimeofday(&st,NULL);
for( i = 0 ; i < 10000; i++ )
baz();
gettimeofday(&et,NULL);
printf("baz: %f\n", time_diff(et,st));
}
intmain(void){
for (int i = 0; i < 10; i++)
timer();
return0;
}
That gives no warnings when compiled with:
gcc -O3 -g -std=c99 -Wall -Wextra -Wmissing-prototypes -Wstrict-prototypes \
-Wold-style-definition cp100k.c -o cp100k
The timing I got was:
foo: 1.781506bar: 0.155201baz: 0.144501foo: 1.276882bar: 0.187883baz: 0.191538foo: 1.090962bar: 0.179188baz: 0.183671foo: 1.898331bar: 0.142374baz: 0.140329foo: 1.516326bar: 0.146018baz: 0.144458foo: 1.245074bar: 0.180004baz: 0.181697foo: 1.635782bar: 0.136308baz: 0.139375foo: 1.542530bar: 0.138344baz: 0.136546foo: 1.646373bar: 0.185739baz: 0.194672foo: 1.284208bar: 0.145161baz: 0.205196
What is weird is that if I forego 'no warnings' and omit the <string.h>
and <stdio.h>
headers, as in the original posted code, the timings I got are:
foo: 1.432378bar: 0.123245baz: 0.120716foo: 1.149614bar: 0.186661baz: 0.204024foo: 1.529690bar: 0.104873baz: 0.105964foo: 1.356727bar: 0.150993baz: 0.135393foo: 0.945457bar: 0.173606baz: 0.170719foo: 1.768005bar: 0.136830baz: 0.124262foo: 1.457069bar: 0.130019baz: 0.126566foo: 1.084092bar: 0.173160baz: 0.189040foo: 1.742892bar: 0.120824baz: 0.124772foo: 1.465636bar: 0.136625baz: 0.139923
Eyeballing those results, it seems to be faster than the 'cleaner' code, though I've not run a Student's t-Test on the two sets of data, and the timings have very substantial variability (but I do have things like Boinc running 8 processes in the background). The effect seemed to be more pronounced in the early versions of the code, when it was just strcpy()
and strcat()
that was tested. I have no explanation for that, if it is a real effect!
Followup by mvds
Since the question was closed I cannot answer properly. On a Mac doing virtually nothing, I get these timings:
(with headers)
foo: 1.694667 bar: 0.300041 baz: 0.301693foo: 1.696361 bar: 0.305267 baz: 0.298918foo: 1.708898 bar: 0.299006 baz: 0.299327foo: 1.696909 bar: 0.299919 baz: 0.300499foo: 1.696582 bar: 0.300021 baz: 0.299775
(without headers, ignoring warnings)
foo: 1.185880 bar: 0.300287 baz: 0.300483foo: 1.120522 bar: 0.299585 baz: 0.301144foo: 1.122017 bar: 0.299476 baz: 0.299724foo: 1.124904 bar: 0.301635 baz: 0.300230foo: 1.120719 bar: 0.300118 baz: 0.299673
Preprocessor output (-E
flag) shows that including the headers translates strcpy
into builtin calls like:
((__builtin_object_size (c,0)!=(size_t)-1)? __builtin___strcpy_chk (c, s, __builtin_object_size (c,2>1)): __inline_strcpy_chk (c, s));
((__builtin_object_size (c+(100*1024),0)!=(size_t)-1)? __builtin___strcat_chk (c+(100*1024), s, __builtin_object_size (c+(100*1024),2>1)): __inline_strcat_chk (c+(100*1024), s));
So the libc version of strcpy outperforms the gcc builtin. (using gdb
it is easily verified that a breakpoint on strcpy
indeed doesn't break on the strcpy()
call, if the headers are included)
On Linux (Debian 5.0.9, amd64), the differences seem to be negligible. The generated assembly (-S
flag) only differs in debugging information carried by the includes.
Solution 2:
I believe the reason for this is that Python strings are not null-terminated.
in Python the string length is stored alongside the string, allowing it to skip the implicit strlen() used by strcat() when concatenating strings.
Adding in the fact that string concatenation is implemented directly in C for Python is probably the cause.
Edit: well, now that I actually look at the C code and see that it uses static buffers, I'm mystified as well, as I don't see how Python could avoid dynamic allocations which should be much slower...
Post a Comment for "Why Is Python Faster Than C When Concatenating Two Strings?"