Commit 470ed77b authored by Per Cederqvist's avatar Per Cederqvist

Document the database process. (Bug 144, partially)

* doc/lyskomd.texi (The Database): Translated old Swedish text
that describes how the database is implemented to English.
Updated the description to match the current implementation.
parent 2a6fc07b
2006-07-31 Per Cederqvist <>
Document the database process. (Bug 144, partially)
* doc/lyskomd.texi (The Database): Translated old Swedish text
that describes how the database is implemented to English.
Updated the description to match the current implementation.
2006-07-27 Per Cederqvist <>
Log changed names.
......@@ -1774,191 +1774,233 @@ information on how to restore the database.
@node The Database
@section The Database
This section is not translated to English yet. See a comment in the
@file{lyskomd.texi} for the raw Swedish text.
@subsection History of the database
When implementing the LysKOM system, we planned to write a custom
database. However, that module took long to write. In the mean time,
a very simple RAM-based database was created. Meanwhile, work on the
real database was more or less halted. We added the ability to save
the RAM-based database when the server was stopped, and read the old
database when the server started.
Later, we added code that periodically wrote the entire database do
disk. As the popularity of LysKOM increased, the database grew ever
larger, and it took longer and longer to save the database. The
database also needed huge amount of RAM memory (or, in practice, swap
memory, making the process slow).
@subsection Current database system
The current database is not saved all at once. Instead, a small part
of it is saved, and then the server processes a few protocol A
requests, saves a bit more of the database, et c.
The entire database is no longer kept in RAM. Only recently used
objects, and objects that have been altered but not yet saved to disk,
are kept in RAM. The other objects reside on disk, and are read into
RAM when needed. (An index of all existing objects are kept in RAM at
all times.)
To make sure that the saved database is consistent, we make sure that
the saved data is a snapshot of the database at a certain point in
@subsection The details
The server has one or two copies of the database open at any time:
@itemize @bullet
@item @dfn{File A} is the last complete file. It contains a snapshot
of the database, which typically is several minutes old.
@item @dfn{File B} is the file that is currently being written. It
contains a newer snapshot than file A, but it is not (yet) complete.
@end itemize
@dfn{File Z} also exists. It is an older complete file, which is
known to be copiable. If, during the creation of File B, it is found
that File A is broken, the administrator can manually revert to File
Z. (This should only be necessary if there is a hardware or OS
failure, or a bug in lyskomd.)
There are currently three kinds of objects that are saved to disk:
Text_stat, Person and Conference. The text below uses Person objects
to explain the process; the other objects are saved in an analogous
The @file{cache-node.h} and @file{simple-cache.c} files contains the
implementation of the database. The @code{Cache_node} structure is
used to keep metadata about a single object.
typedef struct cache_node @{
struct @{
unsigned int exists : 1;
unsigned int dirty : 1; /* Is *ptr modified? */
@} s;
void *snap_shot; /* Dirty data to be written to file B. */
/* (Dirty relative to file A). */
void *ptr; /* In-core data. */
long pos; /* Position of element in file A. */
long size; /* Size on disk. */
long pos_b; /* Position of element in file B. */
long size_b; /* Size in file B. */
struct cache_node *prev; /* Points towards most recently used. */
struct cache_node *next; /* Points towards least recently used. */
int lock_cnt;
@} Cache_node;
@end example
@subsubsection Startup
When the server starts, it scans through the data file (file A) and
creates @code{Cache_node} objects for all existing objects.
@code{s.exists} is set to @code{TRUE}, @code{pos} and @code{size} to
where the object exists. All other fields are set to @code{FALSE},
@code{NULL} or @code{0}.
@subsubsection Retrieval
When something wants to read an object, @code{cached_get_person_stat}
operates like this:
if (!node->s.exists) @{
Nonexisting person (maybe recently deleted); return NULL.
@} else if (node->ptr) @{
Put the node first in LRU-list.
return node->ptr;
@} else if (node->snap_shot) @{
Perform a deep copy from snap_shot to ptr.
Put the node first in LRU-list.
return node->ptr;
@} else @{
Read the object from file A.
Set ptr to the object.
Put the node first in LRU-list.
return node->ptr;
@end example
@subsubsection Modifications
Whenever a structure is changed, a function such as
@code{mark_person_as_changed} is called. This sets
@subsubsection Creation
When an object is created, @code{node->s.exists} and
@code{node->s.dirty} are both set. @code{node->ptr} points to the new
object, which is also inserted in the LRU-list. All other fields are
@code{NULL}, @code{FALSE} or @code{0}.
@subsubsection Deletion
When an object is deleted, @code{node->s.exists} is cleared. If
@code{node->ptr} points to an object, that object is deleted and
@code{node->ptr} is set to NULL. It is an error if an attempt to
delete an object where @code{node->lock_cnt > 0}.
@code{node->snap_shot} is not altered; if the object existed when the
snapshot was taken, it will be written file B to ensure that the saved
database is consistent.
@subsubsection Throw out unused objects
To keep the lyskomd process small, unused objects are discarded from
the in-core cache. Dirty elements are never discarded. The LRU-list
is used to ensure that the most recently used objects are retained in
RAM. Only those objects where @code{node->lock_cnt == 0} and
@code{!node->s.dirty} and @code{node->ptr == NULL} can be discarded.
@subsubsection Locking
Some objects are locked in RAM. (This may actually be a mistake,
performance-wise, but it is hard to fix, since pointers to those
objects may be kept by parts of lyskomd.) When an object is locked,
@code{node->lock_cnt} is increased. When it is later unlocked, the
count is decreased.
@subsubsection Pre-sync
The actual saving of the database to file B is implemented in three
steps. The first step, called the pre-sync, is made as an atomic
operation. This step is a sweep over all the Cache_node objects.
For each node with @code{node->s.dirty} set, the following is
if ( node->lock_cnt == 0 ) @{
node->snap_shot = node->ptr; (Pointer assignment, not copying.)
node->ptr = NULL;
Remove node->ptr from the LRU-list.
@} else @{
node->snap_shot = deep_copy(node->ptr);
node->dirty = FALSE;
@end example
Additionally, for all nodes, @code{node->b_exists} is set to
@subsubsection Sync
Step two is performed in many small steps, between servicing requests
from clients. For each node that is processed, this is performed:
if (node->b_exists == FALSE) @{
/* do nothing */
@} else if (node->snap_shot != NULL) @{
set node->pos_b
serialize the object pointed to by node->snap_shot to file B
set node->size_b
@} else if (node->dirty==FALSE && node->ptr!=NULL)
set node->pos_b
serialize the object pointed to by node->ptr to file B
set node->size_b
@} else @{
set node->pos_b
copy node->size bytes from node->pos in file A to file B
node->size_b = node->size
/* No serialization is necessary. We can perform a simple
verbatim copy of the block specified by pos and size. */
@end example
@subsubsection Post-sync
Once the entire file B has been written, the final post-sync step is
taken. All Cache_node objects are scanned:
node->pos = node->pos_b;
node->file_b = FALSE;
node->snap_shot = NULL;
@end example
Both file A and B are closed. File B is renamed to file A. The new
file A is opened for reading. And we are now ready to perform a
pre-sync operation again.
@subsubsection Remarks
@c FIXME: ramkomd är död! Länge leve LysKOM!
@c FIXME: Jag har tillsammans med Inge kommit på ett sätt att dels få ner
@c FIXME: väntetiden i samband med syncningar till <1 sekund, dels få ner
@c FIXME: storleken på serverprocessen till mer rimliga nivåer. Denna lösning
@c FIXME: lider dock av det stora problemet att den kräver dubbelt så mycket
@c FIXME: diskutrymme som den egentligen behöver. Det gör även ramkomd, så det
@c FIXME: är ingen försämring i det avseendet. Dock är detta bara en temporär
@c FIXME: lösning i väntan på ldb.
@c FIXME: Varför spara allt på en gång? Varför inte spara en liten del av filen
@c FIXME: i taget, och utföra några atomiska anrop mellan varje liten
@c FIXME: delsynkning? Ungefär så tänkte jag när jag kom på följande schema för
@c FIXME: hur man kan göra det hela bättre än det är nu.
@c FIXME: Den databas som ligger på fil är en ögonblickbild (snapshot) av det
@c FIXME: som finns i LysKOM. Så är det i ramkomd; så blir det i diskomd.
@c FIXME: (Bättre namn, någon? lyskomd tycker jag är reserverat för den version
@c FIXME: som har en riktig cache&ldb.) I ramkomd skrivs allt ut på disk
@c FIXME: samtidigt. I diskomd minns man bara vad som skall sparas, och sparar
@c FIXME: bara en bit i taget.
@c FIXME: I ramkomd finns allt inne i ram-minnet (i teorin. I praktiken är det
@c FIXME: mesta utswappat - något som märks varje gång det är dags att synca!).
@c FIXME: I diskomd ligger det mesta på disk. I minnet finns dels det som har
@c FIXME: använts nyligen, dels det som är ändrat och ännu ej syncat. Diskomd
@c FIXME: har alltid minst en, ofta två, databasfiler öppna:
@c FIXME: Fil A Senaste kompletta fil.
@c FIXME: Fil B Fil under uppbyggnad.
@c FIXME FIXME: Fil Z Näst senast kompletta fil
@c FIXME FIXME: (den här gick att kopiera en gång i
@c FIXME FIXME: tiden, även om A inte går att kopiera
@c FIXME FIXME: just nu.)
@c FIXME: (Dessutom textmassefilen, precis som ramkomd nuförtiden.)
@c FIXME: Så till detaljerna:
@c FIXME: Det finns tre typer av objekt som berörs av den här ändringen:
@c FIXME: Text_stat, Person och Conference. Jag använder Person som ett exempel
@c FIXME: nedan.
@c FIXME: I ram-cache.c finns en array
@c FIXME: Person *pers_arr[ MAX_CONF ];
@c FIXME: Den byts mot
@c FIXME: Cache_node *pers_arr[ MAX_CONF ];
@c FIXME: typedef struct cache_node @{
@c FIXME: Bool exists;
@c FIXME: Bool exists_b;
@c FIXME: Bool dirty; /* Är *ptr modifierad? */
@c FIXME: void *snap_shot;
@c FIXME: void *ptr;
@c FIXME: off_t pos;
@c FIXME: off_t pos_b;
@c FIXME: struct cache_node *lru_link;
@c FIXME: int lock_cnt;
@c FIXME: @} Cache_node;
@c FIXME: När servern startas scannar den igenom datafilen (fil A) och fyller i
@c FIXME: fältet exists till TRUE/FALSE och pos till att peka på början av det
@c FIXME: ställe i filen där data ligger. Övriga fält sätts till FALSE/NULL/0.
@c FIXME: När ovanliggande rutiner vill läsa en person händer följande:
@c FIXME: !exists Returnera NULL
@c FIXME: ptr != NULL Lägg noden först i lru_link. Returnera ptr.
@c FIXME: snap_shot != NULL Kopiera snap_shot till ptr. Lägg noden först i
@c FIXME: lru_link. Returnera ptr.
@c FIXME: annars Läs in från fil A, sätt ptr till den inlästa
@c FIXME: structen, lägg noden först i lru_link,
@c FIXME: returnera ptr.
@c FIXME: När något har ändrats sätts dirty-flaggan till TRUE.
@c FIXME: Sätt exists=TRUE, dirty=TRUE, ptr och lru.
@c FIXME: Sätt exists=FALSE. ptr=NULL. Troligtvis error om lock_cnt != 0.
@c FIXME: För att inte diskomd ska bli för stor slängs saker ut ur cachen.
@c FIXME: Algoritm: tag första elementet i lru_list. Om dirty==FALSE och
@c FIXME: ptr!=NULL och lock_cnt==0 så frigör ptr. Upprepa tills antalet noder
@c FIXME: med ptr!=NULL och dirty==FALSE är mindre än antalet "rena"
@c FIXME: element man vill ha inne i minnet. (Smutsiga element slängs
@c FIXME: aldrig ut.)
@c FIXME: Öka lock_cnt.
@c FIXME: Minska lock_cnt.
@c FIXME: Utsparningen till fil sker i tre steg. Först sveper man över alla
@c FIXME: Cache_noder. För alla som har dirty=TRUE gör man följande:
@c FIXME: if ( lock_cnt == 0 ) @{
@c FIXME: snap_shot = ptr; (Pekartilldelning, ej kopiering.)
@c FIXME: ptr = NULL;
@c FIXME: Ta bort ptr ur lru-kedjan.
@c FIXME: @} else @{
@c FIXME: snap_shot = copy(ptr);
@c FIXME: @}
@c FIXME: dirty = FALSE;
@c FIXME: För _alla_ noder görs dessutom följande:
@c FIXME: b_exists==exists;
@c FIXME: Steg två utförs en liten bit i taget. Till exempel så skulle
@c FIXME: man kunna spara en person efter varje atomiskt anrop, eller så.
@c FIXME: b_exists==FALSE? Sätt pos_b. Skriv "@@\n" till fil B.
@c FIXME: Är snap_shot != NULL? Sätt pos_b. Skriv ut innehållet i snap_shot
@c FIXME: till fil B.
@c FIXME: dirty==FALSE && ptr!=NULL Skriv ut innehållet i ptr till fil B.
@c FIXME: annars: Kopiera från fil A till fil B. (Eftersom man
@c FIXME: vet både var blocket börjar och slutar kan
@c FIXME: man kopiera blocket utan att bry sig om vad
@c FIXME: som står i det -> väldigt lite CPU går åt).
@c FIXME: När alla Person:er har hanterats som i SYNC ovan är det dags för det
@c FIXME: tredje steget. Då går man igenom alla Cache_node:er och gör följande:
@c FIXME: pos = pos_b;
@c FIXME: file_b = FALSE;
@c FIXME: free(snap_shot);
@c FIXME: snap_shot = NULL;
@c FIXME: Fil B som man förut hade öppen för skrivning öppnar man i stället för
@c FIXME: läsning som fil A.
@c FIXME: Innehållet i snap_shot är alltid "smutsigt" jämfört med innehållet i
@c FIXME: fil A. Det som snap_shot pekar på finns aldrig med i lru-kedjan.
The object pointed to by snap_shot is always dirty compared to the
contents of file A. That object is never present on the LRU-list.
@subsection The texts
The above structure is only used for the metadata in the database
(such as the text status, conference status, et c). The texts
themselves are saved in a separate file. New texts are appended to
the tail of the file. Old texts are never deleted. Never? Well, you
can delete them by shutting down lyskomd, running "dbck -g", and
starting lyskomd again. You must do this periodically (as explained
in @pxref{Administration}), or the text file will keep growing
@node Adding Configuration Parameters
@section Adding Configuration Parameters
......@@ -3247,7 +3289,7 @@ in the deletion record is the ID of the object being deleted.
@node Version 2
@section Data File Version 2
Version 2 is used by lyskomd version 2.0.
Version 2 is used by lyskomd version 2.0.0 and newer.
The structure of the data file is similar to version 1. The header has
been extended with a timestamp contaning the time when the database file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment