Unauthorized distribution prohibited. Link to this page, not copy it.
I wrote this document after receiving large amounts of email from people who would like to write an emulator of one or another computer but do not know where to start. Any opinions and advices contained in the following text are mine alone and should not be taken for an absolute truth. The document mainly covers so-called "interpreting" emulators, as opposed to "compiling" ones, because I do not have much experience with recompilation techniques. It does have a pointer or two to the places where you can find information on these techniques.
If you think that this document is missing something or want to make a correction, feel free to email me your comments. I do not answer to flames, idiocy, and requests for ROM images though. I'm badly missing some important FTP/WWW addresses in the resources list in this document so if you know any worth putting there, tell me about them. Same goes for any frequently asked questions that are not in this document.
This document has been translated to Japanese by Bero. There is also Chinese translation available, courtesy of Jean-Yuan Chen, and a French translation made by Maxime Vernier. An older French translation by Guillaume Tuloup may or may not be available at the moment. Spanish translation of the HOWTO has been made by Santiago Romero.
It is necessary to note that you can emulate any computer system, even if it is very complex (such as Commodore Amiga computer, for example). The perfomance of such emulation may be very low though.
while(CPUIsRunning) { Fetch OpCode Interpret OpCode }Virtues of this model include ease of debugging, portability, and ease of synchronization (you can simply count clock cycles passed and tie the rest of your emulation to this cycle count).
A single, big, and obvious weakness is the low perfomance. The interpretation takes a lot of CPU time and you may require a pretty fast computer to run your code at a decent speed.
+ Generally, allow to produce faster code. + The emulating CPU registers can be used to directly store the registers of the emulated CPU. + Many opcodes can be emulated with the similar opcodes of the emulating CPU. - The code is not portable, i.e. it can not be run on a computer with different architecture. - It is difficult to debug and maintain the code.
+ The code can be made portable so that it works on different computers and operating systems. + It is relatively easy to debug and maintain the code. + Different hypothesis of how real hardware works can be tested quickly. - C is generally slower than pure assembly code.
Good knowledge of the chosen language is an absolute necessity for writing a working emulator, as it is quite complex project, and your code should be optimized to run as fast as possible. Computer emulation is definitely not one of the projects on which you learn a programming language.
comp.sys.msx MSX/MSX2/MSX2+/TurboR computers comp.sys.sinclair Sinclair ZX80/ZX81/ZXSpectrum/QL comp.sys.apple2 Apple ][ etc.Please, check the appropriate FAQs before posting to these newsgroups.
For those who want to write their own CPU emulation core or interested
to know how it works, I provide a skeleton of a typical CPU emulator in C
below. In the real emulator, you may want to skip some parts of it and add
some others on your own.
Counter=InterruptPeriod;
PC=InitialPC;
for(;;)
{
OpCode=Memory[PC++];
Counter-=Cycles[OpCode];
switch(OpCode)
{
case OpCode1:
case OpCode2:
...
}
if(Counter<=0)
{
/* Check for interrupts and do other */
/* cyclic tasks here */
...
Counter+=InterruptPeriod;
if(ExitRequired) break;
}
}
First, we assign initial values to the CPU cycle counter
(Counter
), and the program counter (PC
):
Counter=InterruptPeriod;
PC=InitialPC;
The Counter
contains the number of CPU cycles left to the
next suspected interrupt. Note that interrupt should not necessarily
occur when this counter expires: you can use it for many other purposes,
such as synchronizing timers, or updating scanlines on the screen. More on
this later. The PC
contains the memory address from which our
emulated CPU will read its next opcode.
After initial values are assigned, we start the main loop:
for(;;)
{
Note that this loop can also be implemented as
while(CPUIsRunning)
{
where CPUIsRunning
is a boolean variable. This has certain
advantages, as you can terminate the loop at any moment by setting
CPUIsRunning=0
. Unfortunately, checking this variable on
every pass takes quite a lot of CPU time, and should be avoided if
possible. Also, do not implement this loop as
while(1)
{
because in this case, some compilers will generate code checking whether
1
is true or not. You certainly don't want the compiler to
do this unnecessary work on every pass of a loop.
Now, when we are in the loop, the first thing is to read the next
opcode, and modify the program counter:
OpCode=Memory[PC++];
Note that while this is the simplest and fastest way to read from
emulated memory, it is not always feasible. A more universal
way to access memory is covered later in this
document.
After the opcode is fetched, we decrease the CPU cycle counter by a
number of cycles required for this opcode:
Counter-=Cycles[OpCode];
The Cycles[]
table should contain the number of CPU cycles
for each opcode. Beware that some opcodes (such as conditional
jumps or subroutine calls) may take different number of cycles depending
on their arguments. This can be adjusted later in the code though.
Now comes the time to interpret the opcode and execute it:
switch(OpCode)
{
It is a common misconception that the switch()
construct is
inefficient, as it compiles into a chain of if() ... else if()
...
statements. While this is true for constructs with a small
number of cases, the large constructs (100-200 and more cases) always
appear to compile into a jump table, which makes them quite efficient.
There are two alternative ways to interpret the opcodes. The first is to
make a table of functions and call an appropriate one. This method appears
to be less efficient than a switch()
, as you get the overhead
from function calls. The second method would be to make a table of labels,
and use the goto
statement. While this method is slightly
faster than a switch()
, it will only work on compilers
supporting "precomputed labels". Other compilers will not allow you to
create an array of label addresses.
After we successfully interpreted and executed an opcode, the comes a
time to check whether we need any interrupts. At this moment, you can also
perform any tasks which need to be synchronized with the system clock:
if(Counter<=0)
{
/* Check for interrupts and do other hardware emulation here */
...
Counter+=InterruptPeriod;
if(ExitRequired) break;
}
These cyclic tasks are covered later in this
document.
Note that we do not simply assign Counter=InterruptPeriod
,
but do a Counter+=InterruptPeriod
: this makes cycle counting
more precise, as there may be some negative number of cycles in the
Counter
.
Also, look at the
if(ExitRequired) break;
line. As it is too costly to check for an exit on every pass of the loop,
we do it only when the Counter
expires: this will still exit
the emulation when you set ExitRequired=1
, but it won't take
as much CPU time.
Data=Memory[Address1]; /* Read from Address1 */ Memory[Address2]=Data; /* Write to Address2 */Such simple memory access is not always possible for following reasons though:
Data=ReadMemory(Address1); /* Read from Address1 */ WriteMemory(Address2,Data); /* Write to Address2 */All special processing such as page access, mirroring, I/O handling, etc., is done inside these functions.
ReadMemory()
and WriteMemory()
usually put a
lot of overhead on the emulation because they get called very frequently.
Therefore, they must be made as efficient as possible. Here is an example
of these functions written to access paged address space:
static inline byte ReadMemory(register word Address)
{
return(MemoryPage[Address>>13][Address&0x1FFF]);
}
static inline void WriteMemory(register word Address,register byte Value)
{
MemoryPage[Address>>13][Address&0x1FFF]=Value;
}
Notice the inline
keyword. It will tell compiler to
embed the function into the code, instead of making calls to it. If your
compiler does not support inline
or _inline
, try
making function static
: some compilers (WatcomC, for example)
will optimize short static functions by inlining them.
Also, keep in mind that in most cases the ReadMemory()
is
called several times more frequently than WriteMemory()
.
Therefore, it is worth to implement most of the code in
WriteMemory()
leaving ReadMemory()
as short and
simple as possible.
ReadMemory()
, it is usually not desirable,
as ReadMemory()
gets called much more frequently than
WriteMemory()
. A more efficient way would be to implement
memory mirroring in the WriteMemory()
function.
In order to emulate such tasks, you should tie them to appropriate number of CPU cycles. For example, if CPU is supposed to run at 2.5MHz and the display uses 50Hz refresh frequency (standard for PAL video), the VBlank interrupt will have to occur every
2500000/50 = 50000 CPU cyclesNow, if we assume that the entire screen (including VBlank) is 256 scanlines tall and 212 of them are actually shown at the display (i.e. other 44 fall into VBlank), we get that your emulation must refresh a scanline each
50000/256 ~= 195 CPU cylesAfter that, you should generate a VBlank interrupt and then do nothing until we are done with VBlank, i.e. for
(256-212)*50000/256 = 44*50000/256 ~= 8594 CPU cyclesCarefully calculate numbers of CPU cycles needed for each task, then use their biggest common divisor for
InterruptPeriod
and
tie all other tasks to it (they should not necessarily execute on every
expiration of the Counter
).
Watcom C++ -oneatx -zp4 -5r -fp3 GNU C++ -O3 -fomit-frame-pointer Borland C++If you find a better set of options for one of these compilers or a different compiler, please, let me know about it.
GPROF
immediately comes to mind) may reveal a lot of
wonderful things you have never suspected before. You may find that
seemingly insignificant pieces of code are executed much more frequently
than the rest of it and slow the entire program down. Optimizing these
pieces of code or rewriting them in assembly language will boost the
perfomance.
int
ones as opposed to short
or
long
. This will reduce amount of code compiler generates to
convert between different integer lengths. It may also reduce the memory
access time, as some CPUs work fastest when reading/writing data of the
base size aligned to the base size address boundaries.
register
(most new compilers can automatically
put variables into registers though). This makes more sense for CPUs with
many general-purpose registers (PowerPC) than for ones with a few
dedicated registers (Intel 80x86).
J/128==J>>7
). They execute faster on most CPUs. Also,
use bitwise AND to obtain the modulo in such cases
(J%128==J&0x7F
).
0x12345678
on such CPU, the
memory will look like this:
0 1 2 3 +--+--+--+--+ |12|34|56|78| +--+--+--+--+
0 1 2 3 +--+--+--+--+ |78|56|34|12| +--+--+--+--+
When writing an emulator, you have to be aware of the endianess of both your emulated and emulating CPUs. Let's say that you want to emulate a Z80 CPU which is low-endian. That is, Z80 stores its 16-bit words with lower byte first. If you use a low-endian CPU (for example, Intel 80x86) for this, everything happens naturally. If you use a high-endian CPU (PowerPC) though, there is suddenly a problem with placing 16-bit Z80 data into memory. Even worse, if your program must work on both architectures, you need some way to sense the endiness.
One way to handle the endiness problem is given below:
typedef union
{
short W; /* Word access */
struct /* Byte access... */
{
#ifdef LOW_ENDIAN
byte l,h; /* ...in low-endian architecture */
#else
byte h,l; /* ...in high-endian architecture */
#endif
} B;
} word;
As you see, a word can be accessed as whole using W
. Every
time your emulation needs to access it as separate bytes though, you use
B.l
and B.h
which preserves order.
If your program is going to be compiled on different platforms, you may
want to test that it was compiled with correct endiness flag before
executing anything really important. Here is one way to perform such
a test:
int *T;
T=(int *)"\01\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0";
if(*T==1) printf("This machine is high-endian.\n");
else printf("This machine is low-endian.\n");
A typical emulator should repeat the original system design by implementing each subsystem functions in a separate module. First, this makes debugging easier as all bugs are localized in the modules. Second, the modular architecture allows you to reuse modules in other emulators. The computer hardware is quite standarized: you can expect to find the same CPU or video chip in many different computer models. It is much easier to emulate the chip once than implement it over and over for each computer using this chip.