Skip to content
GitLab
Projects
Groups
Snippets
Help
Loading...
Help
What's new
7
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Open sidebar
Wim Lewis
nettle
Commits
16d2a186
Commit
16d2a186
authored
Feb 19, 2013
by
Niels Möller
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
ARM memxor: Delay push of registers. Accidentally slowed down memxor3.
parent
993ae2d6
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
36 additions
and
27 deletions
+36
-27
ChangeLog
ChangeLog
+4
-4
armv7/memxor.asm
armv7/memxor.asm
+32
-23
No files found.
ChangeLog
View file @
16d2a186
2013-02-19 Niels Möller <nisse@lysator.liu.se>
* armv7/memxor.asm (memxor): Software pipelining for the aligned case.
* armv7/memxor.asm (memxor): Software pipelining for the aligned
case. Runs at 6 cycles (0.5 cycles per byte). Delayed push of
registers until we know how many registers we need.
(memxor3): Use 3-way unrolling also for aligned memxor3.
Both loops benchmarked at 7 cycles (0.58 cycles per byte), but
memxor3 seems to have a strange dependency on instruction
alignment.
Runs at 8 cycles (0.67 cycles per byte)
2013-02-12 Niels Möller <nisse@lysator.liu.se>
...
...
armv7/memxor.asm
View file @
16d2a186
...
...
@@ -30,7 +30,7 @@ define(<DST>, <r0>)
define
(
<
SRC
>
,
<
r1
>
)
define
(
<
N
>
,
<
r2
>
)
define
(
<
CNT
>
,
<
r6
>
)
define
(
<
TNC
>
,
<
r
7
>
)
define
(
<
TNC
>
,
<
r
12
>
)
.syntax
unified
...
...
@@ -43,10 +43,7 @@ define(<TNC>, <r7>)
.align
4
PROLOGUE
(
memxor
)
cmp
N
,
#
0
beq
.Lmemxor_ret
C
FIXME
:
Delay
push
until
we
know
how
many
registers
we
need.
push
{
r4
,
r5
,
r6
,
r7
,
r8
,
r10
,
r11
,
r14
}
C
lr
is
the
link
register
beq
.Lmemxor_done
cmp
N
,
#
7
bcs
.Lmemxor_large
...
...
@@ -54,21 +51,19 @@ PROLOGUE(memxor)
C
Si
mple
byte
loop
.Lmemxor_bytes:
ldrb
r3
,
[
SRC
],
#
+
1
ldrb
r
4
,
[
DS
T
]
eor
r3
,
r
4
ldrb
r
12
,
[
DS
T
]
eor
r3
,
r
12
strb
r3
,
[
DS
T
],
#
+
1
subs
N
,
#
1
bne
.Lmemxor_bytes
.Lmemxor_done:
pop
{
r4
,
r5
,
r6
,
r7
,
r8
,
r10
,
r11
,
r14
}
.Lmemxor_ret:
bx
lr
.Lmemxor_align_loop:
ldrb
r3
,
[
SRC
],
#
+
1
ldrb
r
4
,
[
DS
T
]
eor
r3
,
r
4
ldrb
r
12
,
[
DS
T
]
eor
r3
,
r
12
strb
r3
,
[
DS
T
],
#
+
1
sub
N
,
#
1
...
...
@@ -79,7 +74,7 @@ PROLOGUE(memxor)
C
We
have
at
least
4
byte
s
left
to
do
here.
sub
N
,
#
4
ands
CNT
,
SRC
,
#
3
ands
r3
,
SRC
,
#
3
beq
.Lmemxor_same
C
Di
fferent
al
ignment
case.
...
...
@@ -93,7 +88,9 @@ PROLOGUE(memxor)
C
With
little
-
endian
,
we
need
to
do
C
DS
T
[
i
]
^
=
(
SRC
[
i
]
>>
CNT
)
^
(
SRC
[
i
+
1
]
<<
TNC
)
lsl
CNT
,
#
3
push
{
r4
,
r5
,
r6
}
lsl
CNT
,
r3
,
#
3
bic
SRC
,
#
3
rsb
TNC
,
CNT
,
#
32
...
...
@@ -120,12 +117,15 @@ PROLOGUE(memxor)
subs
N
,
#
8
bcs
.Lmemxor_word_loop
adds
N
,
#
8
beq
.Lmemxor_done
beq
.Lmemxor_
odd_
done
C
We
have
TNC
/
8
left
-
over
byte
s
in
r4
,
high
end
lsr
r4
,
CNT
ldr
r3
,
[
DS
T
]
eor
r3
,
r4
pop
{
r4
,
r5
,
r6
}
C
Store
byte
s
,
one
by
one.
.Lmemxor_leftover:
strb
r3
,
[
DS
T
],
#
+
1
...
...
@@ -134,10 +134,14 @@ PROLOGUE(memxor)
subs
TNC
,
#
8
lsr
r3
,
#
8
bne
.Lmemxor_leftover
b
.Lmemxor_bytes
.Lmemxor_odd_done:
pop
{
r4
,
r5
,
r6
}
bx
lr
.Lmemxor_same:
push
{
r4
,
r5
,
r6
,
r7
,
r8
,
r10
,
r11
,
r14
}
C
lr
is
the
link
register
subs
N
,
#
8
bcc
.Lmemxor_same_end
...
...
@@ -154,8 +158,9 @@ PROLOGUE(memxor)
ldmia
r14
!
,
{
r6
,
r7
,
r8
}
bcc
.Lmemxor_same_wind_down
C
7
cycles
per
iteration
,
0.58
cycles
/
byte
C
Loopmixer
could
perhaps
get
it
down
to
6
cycles.
C
6
cycles
per
iteration
,
0.50
cycles
/
byte
.
For
this
sp
eed
,
C
loop
starts
at
offset
0x11c
in
the
object
file.
.Lmemxor_same_loop:
C
r10
-
r12
contains
values
to
be
stored
at
DS
T
C
r6
-
r8
contains
values
read
from
r14
,
in
advance
...
...
@@ -188,16 +193,18 @@ PROLOGUE(memxor)
eor
r3
,
r6
eor
r4
,
r7
stmia
DS
T
!
,
{
r3
,
r4
}
pop
{
r4
,
r5
,
r6
,
r7
,
r8
,
r10
,
r11
,
r14
}
beq
.Lmemxor_done
b
.Lmemxor_bytes
.Lmemxor_same_lt_8:
pop
{
r4
,
r5
,
r6
,
r7
,
r8
,
r10
,
r11
,
r14
}
adds
N
,
#
4
bcc
.Lmemxor_same_lt_4
ldr
r3
,
[
SRC
],
#
+
4
ldr
r
4
,
[
DS
T
]
eor
r3
,
r
4
ldr
r
12
,
[
DS
T
]
eor
r3
,
r
12
str
r3
,
[
DS
T
],
#
+
4
beq
.Lmemxor_done
b
.Lmemxor_bytes
...
...
@@ -342,10 +349,12 @@ PROLOGUE(memxor3)
subs
N
,
#
8
bcc
.Lmemxor3_aligned_word_end
C
This
loop
runs
at
7
cycles
per
iteration
,
but
it
seems
to
C
have
a
strange
al
ignment
requirement.
For
this
sp
eed
,
the
C
loop
started
at
offset
0x2ac
in
the
object
file
,
and
al
l
C
other
offsets
made
it
slower.
C
This
loop
runs
at
8
cycles
per
iteration.
It
has
been
C
observed
running
at
only
7
cycles
,
for
this
sp
eed
,
the
loop
C
started
at
offset
0x2ac
in
the
object
file.
C
FIXME
:
consider
software
pipelining
,
si
milarly
to
the
memxor
C
loop.
.Lmemxor3_aligned_word_loop:
ldmdb
AP
!
,
{
r4
,
r5
,
r6
}
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment