Improve README

ed63ab97 · Per Cederqvist · 573844db · ed63ab97
Commit ed63ab97 authored 6 months ago by Per Cederqvist
--- a/README
+++ b/README
 Testing the remote installation of rdiff-backup:
    ssh somehost -a -k -x -i /root/.ssh/backupkey /opt/LYSrdiff/bin/rdiff-backup --version
+Setting up a new system
+=======================
+Where to back up data
+---------------------
+The lysrdiff system assumes that you back up data to a set of disks.
+Each disk can be split in several partitions.  Additionally, there can
+be several copies of each disk.  Disks are given a number starting
+with 1.  There is always a copy named "perm" which is permanently
+mounted.  Other copies, named A, B, C and so on, can be unmounted for
+off-site backups.  Partitions are given a number starting with 0.
+The backup jobs are distributed among the partitions.
+Putting this together, a system may have 2 disks, each with 2
+partitions, and 3 copies named "perm", "A" and "B".  These disks would
+then be named like this:
+    1/perm/0
+    1/perm/1
+    1/A/0
+    1/A/1
+    1/B/0
+    1/B/1
+    2/perm/0
+    2/perm/1
+    2/A/0
+    2/A/1
+    2/B/0
+    2/B/1
+When a disk i present, it should be mounted on
+e.g. /lysrdiff/1/perm/0.
+The lysrdiff-label-disk script labels a disk (or, rather, a filesystem
+on a partition).  Labeling a disk writes the partition name to
+/lysrdiff/$disk/$copy/$part/lysrdiff.id; this ID is checked to ensure
+the disks have not been mounted on the wrong mountpoint.
+A README file is also created, and a few spacefiller files that can be
+removed in an emergency if the disk becomes full.  (rdisk-backup can
+get in a state where it can't remove data because the disk is full, so
+it is good to have some spare data to remove.)
+The file /opt/LYSrdiff/var/newtasks should name the partition where
+new backup jobs should be stored by default, on the "$disk/$part"
+format.  For a new system, you would likely start by doing:
+    echo 1/0 > /opt/LYSrdiff/var/newtasks
+The /lysrdiff/$disk/perm/$part/lysrdiff/tasks file contains a list of
+all backup jobs that are stored on that particular partition.  It has
+the same format as /opt/LYSrdiff/var/tasks, which is defined in the
+next section.
+What to back up
+---------------
+The scripts fetch-backup-work and fetch-work-pcfritz are two
+site-specific scripts that dynamically figures out what to backup.
+For a new site, the first task is to write such a script.  The end
+result of running the script is the file /opt/LYSrdiff/var/tasks.
+This file specifies all the backup tasks.
+Each task is defined by a single line with four white-space separated
+fields:
+- category: often a host name, but it may also be "virtual" category
+  such as "homes" for all home directories or "mail" for mailspools.
+- subcategory: for the "homes" or "mail" category, this would
+  typically be a username.  It can be any string, really: the
+  important thing is that the combination of category and subcategory
+  is unique.
+- host: the host name of the host that holds the data.  It must be
+  possible to log in as root via ssh using the /root/.ssh/backupkey
+  identity.
+- directory: the directory on host that should be backed up.
+Distributing tasks to partitions
+--------------------------------
+The distribute-tasks script, when given the -i flag, reads the
+/opt/LYSrdiff/var/tasks file and all the
+/lysrdiff/$disk/perm/$part/lysrdiff/tasks files.  It finds any new
+backup job and assigns it to the partition specified by the "newtasks"
+file (see above).  It also detects if a backup job no longer exists,
+and removes it from the partition task list.
+Whenever you use a *fetch* script to figure out what to back up, you
+should run "distribute-tasks -i $disk/$part" to assign the jobs to a
+partition.
+Without the "-i" flag, the job just reorders the task file on the
+specified partitions, to ensure a fair distribution of backup times.
+Backups
+=======
+Performing a backup
+-------------------
+backup-all is the normal way to perform a backup.  Name the disk and
+partition on the command line:
+    backup-all 1/0
+You can also use the --failed, --retry, --new or --continue options
+(see --help).
+The recommended setup is to run backup-all from cron, once per day.
+If you have situation where backups take more than 24 hours, you may
+wish to instead use backup-repeatedly.  It will start a new backup
+cycle as soon as the previous on ended.  But, really, you should find
+a better solution.  Perhaps you can set up multiple LYSrdiff systems
+that back up part of your data, so that each system can finish in less
+than 24 hours?
+Interrupting a backup
+---------------------
+To avoid data loss, you should avoid interrupting rdiff-backup, which
+is what is used to backup each work.  But LYSrdiff provides a few ways
+to stop a backup in a controlled fashion.
+** Holding **
+Create /opt/LYSrdiff/etc/hold to temporarily pause a backup.  If that
+file exists when backup-all is about to start a backup, it will print
+a message and wait until the file disappears.  This can be useful if
+you want to unmount a copy of a disk for transport to offsite storage.
+** Stopping **
+Create /opt/LYSrdiff/etc/stop to cause backup-all to exit cleanly once
+the current backup job has finished.
+** Finishing **
+Create /opt/LYSrdiff/etc/finish to cause backup-repeately to exit
+cleanly once the current backup cycle has finished.
+Backup status
+-------------
+The "lysrdiff-status" progam reports on the status of the backup
+system.  First, it runs "df -h" on all mounted backup partitions so
+you get an overview of how full the partitions are.  Then, it provides
+a one-line summary of the backup jobs of each partition.  This can
+look like this:
+1/0: Tasks: 324 Fresh: 320 1day: 2 Stale: 1 Tot: 323 Warn: 1 Err: 1
+This means that there are 324 tasks assigned to disk 1, partition 0.
+320 backups completed in the last 24 hours, 2 are between 24 and 48
+hours old, and 1 backup is older than that.  In total, there are 323
+backup tasks that has completed.  (Since there are 324 tasks, that
+means that 1 task never completed -- hopefully because it was just
+assigned to the disk, but perhaps because there is a problem that
+prevents it from being completed.)
+1 backup job produced a warning, which might mean that 1 or more of
+the files were not properly backed up.  1 backup job failed
+completely, meaning no files were backed up.
+The command then lists all backup jobs that resulted in a warning,
+followed by all jobs that failed, followed by the 5 oldest backups,
+the oldest backup per partition, and the newest backup per partition.
+Finally, it lists all backups that are currently in progress.
+Progress console
+----------------
+lysrdiff-monitord.py is an optional monitoring daemon.  If it is
+running, the backup job will report progress information to it.  This
+can be viewed by running "telnet localhost 9934".  As backups are
+running, this will display some information.  This can be a nice way
+to see what is going on.
+Adding more sources
+===================
+To start backing up a new system, or a new directory on an old system,
+you have run your *fetch* script again.  Perhaps you also need to edit
+it first.
+After running the *fetch* script, you need to assign the new task(s)
+to a lysrdiff backup partition.  First, check that
+/opt/LYSrdiff/var/newtasks mentions the proper partition.  Then, run
+"distribute-tasks -i $disk/$part" to assign the task to that partition.