DIASER beta-2 - Technical Manual v 1.0.9

Damian L Brasher - 02/01/2011

<img
alt="Creative Commons License" style="border-width:0"
src="http://i.creativecommons.org/l/by-sa/2.0/uk/88x31.png" />This
<span xmlns:dc="http://purl.org/dc/elements/1.1/"
href="http://purl.org/dc/dcmitype/Text" rel="dc:type">work by <a
xmlns:cc="http://creativecommons.org/ns#"
href="http://www.diaser.org.uk/manual.html" property="cc:attributionName"
rel="cc:attributionURL">Damian L Brasher is licensed under a <a
rel="license"
href="http://creativecommons.org/licenses/by-sa/2.0/uk/">Creative Commons
Attribution-Share Alike 2.0 UK: England &amp; Wales License.



Index

1 Introduction
1.1 Feature overview
2 Explanation of the overall design
2.1 Design philosophy
2.2 The storage architecture
2.3 Integrated approach
2.4 Limitations
2.5 Why Linux?
3 The package and contents
3.1 Downloading and unpacking
3.2 Main source file
3.3 Configuration files
3.4 Example backup software configuration
3.5 Licence
3.6 Documentation
4 Requirements
4.1 Hardware
4.2 Software
4.3 Skills
5 Primary scripts
5.1 diaser
5.2 tab_$.pl
5.3 hvautoc_$.pl
5.4 fill_diaser.pl
6 Explanation of features
6.1 Geographical distribution
6.2 Security
6.3 SE Linux and AppArmor
6.4 Upgrade and modify
6.5 Filling or loading
6.6 Non distinct binary volumes
6.7 Logging
6.8 Archive retrieval
6.9 Data and node migration
6.10 Reporting and monitoring
6.11 Multiple instances
6.12 Extending operation
6.13 Pruning old volumes
6.14 Time zone compensation and leap years
6.15 Digital volume check-sum or stamp
6.16 Complete removal
7 Configuration
7.1 diaser.conf
7.2 Number of years of expected operation
7.3 First year of operation
7.4 Start time of phases
7.5 Node IP address's
7.6 OpenSSH ports
7.7 Dry run mode
7.8 Lowest maximum bandwidth (LMB)
7.9 Time zone compensation
7.10 Working diaser account name
7.11 Time out
7.12 Home directories
7.13 Fill start time
7.14 Volume directory
7.15 Differential or constant name prefix
7.16 Collect Full volume or not
7.17 Collect Full volume on which day
7.18 Full volume prefix
7.19 More than one configuration file
8 Installation
9 Command Line Options
9.1 --help
9.2 --bandwidth
9.3 --configure
9.4 --extend
9.5 --install
9.6 --list
9.7 --lock
9.8 --logs
9.9 --migrate
9.10 --modify
9.11 --pause
9.12 --recreate
9.13 --remove
9.14 --resume
9.15 --retrieve
9.16 --stats
9.17 --stop
9.18 --upgrade
9.19 --version
10 Operation
10.1 Stop
10.2 Pause
10.3 Resume
10.4 Hard Lock
10.5 Migrate node
11 The Code
11.1 Why Perl?
11.2 Style
11.3 Modules
11.4 Error handling
11.5 Contribute
12 On-line resources
12.1 Website
12.2 SourceForge
12.3 Mailing list
12.4 DIAP/LTASP and early project memory
APPENDIX
A Tables and calculations
B Glossary of terms
C Appliances

1 Introduction

DIASER is for long term digital archive storage, it securely...

1) Accumulates
2) Geo-Duplicates
3) Manages

<img src="diaser_overview.jpg" alt="diaser overview" width="502"
height="518">

DIASER has been created to solve mid-range and below, long term
archiving requirements of the SME, a data vault application. Where tape
has been deployed in the past DIASER now offers an alternative solution
designed to be more robust and manageable in the long term than simple
NAS devices or disk based storage alone. This manual is designed to
assist the systems administrator providing; a detailed technical
overview of the system and it's components parts, how to plan
deployment, installation, storage space calculations, an overview of
the code base and other available resources.

1.1 Feature overview


Engineered storage architecture
&nbsp;&nbsp;- for high performance, quality and reliability
Exists and operates in dedicated user accounts
&nbsp;&nbsp;- self-contained, sealed and easy to migrate environment
Flat, human readable storage structure
&nbsp;&nbsp;- to ensure data is retrievable and code is comprehensible
Highly resilient and robust
&nbsp;&nbsp;- to minimise the risk of data loss over many years
Large volume capacity (TB's)
&nbsp;&nbsp;- extremely low cost storage
Low operational and maintenance overheads
&nbsp;&nbsp;- reduced cost of ownership
Manage independently from a Perl enabled workstation
&nbsp;&nbsp;- for a highly manageable solution 
Manage long-term archives
&nbsp;&nbsp;- software that will guarantee retrieval over time
Migratable nodes
&nbsp;&nbsp;- replace hardware without changing the backup software infrastructure
Multiple configuration files for multiple installations
&nbsp;&nbsp;- to simplify multiple installation management
Perl installer and configurator
&nbsp;&nbsp;- for a stable and mature cross-platform environment 
Powered by rsync and OpenSSH
&nbsp;&nbsp;- to utilise the powerfull rsync data transfer algorithm
Repair tool
&nbsp;&nbsp;- allowing broken nodes to be rebuilt
Scalable
&nbsp;&nbsp;- grows with your disk and network capacity
Secure design
&nbsp;&nbsp;- to prevent data compromises and minimise the risk of
vulnerabilities
Simple configuration file and format
&nbsp;&nbsp;- to ease installation and maintenance overheads
Standards compliant
&nbsp;&nbsp;- allowing tight systems integration and interoperability
Stats and analysis tools built-in
&nbsp;&nbsp;- assisting the deployment manager and administrators
Straightforward upgrade procedure
&nbsp;&nbsp;- to allow new features, enhancements and fixes to be deployed
quickly
Use commodity disks for robust storage
&nbsp;&nbsp;- reduce long-term storage costs
UTC Time Zone compensation mechanism
&nbsp;&nbsp;- nodes can exist across time zones
Works with existing backup infrastructures
&nbsp;&nbsp;- seamlessly integrate without duplicating deployment costs
3 replicating storage nodes
&nbsp;&nbsp;- for optimal performance vs maximum data redundancy


Cloud based computing has taken off the last few years. DIASER is an
ideal application for cloud computing deployment as well as an archiving 
framework solution. Once implemented the system is invisible to users but
allows them to do more. Cloud computing is a popular term, a useful way of 
communicating a complex collection of technologies. The use of virtual 
machines in a distributed environment has many advantages. The problem many 
people foresee with cloud computing is lock-in-in and loss of control of 
data and increased cost of services. DIASER allows an organisation to build 
private storage clouds using existing resources as you will see in this 
technical manual. The result is control over your long term could based 
storage in terms of administration and resources as soon as the system is 
deployed and beyond. This means that data can be migrated when you want to 
without penalties from a 3rd party provider. 

With security in mind at all times DIASER is based
on a carefully designed robust storage architecture called LTASP,
Long Term Archive Storage Protocol. This means consistency is
ensured now and in the future. The design phase involved four years of
careful evaluation and testing. DIASER is open source software using
GPL the GPL v3 licence model so users can enjoy the benefits the of
open development methodology. Simplicity of design and reuse of code
and readily available resources is key to power of this system. A strong 
design philosophy has been cultivated and adhered to for the benefit of all
users. DIASER is written by a systems administrator for systems
administrators but potential benefits to an SME, it's IT manager, CEO and 
committee have been the highest priority throughout all stages of the design
process. The DIASER implementation is targeted primarily at education,
hence the name Distributed Internet Archive System for Educational
Repositories however the system can be downloaded and deployed by any
SME. DIASER is designed to be extremely future proof. As an Open Source
product minimise the risks associated with vendor lock-in and data
retrieval.

More features are planned for the future and the most current development 
road-map can be viewed here:
http://diaser.svn.sourceforge.net/viewvc/diaser/ROADMAP_DEV.

2 Explanation of the overall design

2.1 Design philosophy 

Archiving and backup is art and science. For me a philosophy has 
evolved over the years I have been a systems administrator and I applied them
to the design of LTASP and DIASER:

Maximise: 
Storage capacity, availability of data, data restoration and recovery speed,
scalability, modularity, cross-platform deployment, resilience and
robustness.
Minimise:
Operating bandwidth overhead, impact of network outages, management overheads,
support costs.
Simplify: 
Development cycle, deployment, data recovery, operation, integration with
existing systems. 

2.2 The storage architecture 

To maintain archives over a number of years requires organisation. For this
reason DIASER builds a set of slots/directories on each node in advance
which correspond to date. This is done in advance and not generated as
required for a number of reasons. As the system operates across
networks and network connections can have variable performance or be down
completely creating a year or more slots (slots roughly equate to a single
tape) of storage upon installation ensures that the directories are named and
therefore dated correctly. This ensures if data is not copied correctly we can
identify failure even without log data. Log data may or may not be created on
a node but empty slots are indicative of copy or network failure. Computers 
are not the best time keepers left to their own devices. If the storage
structure creation is undertaken when all nodes are known to be synchronised
then accuracy of the storage structure is ensured. If slots were created on
the fly and node time was not synchronised for any reason, bios changes,
other software changes the time inadvertently and so on, inaccuracies could
occur. The structure is human readable too and simply put; empty slots are
easier to read and parse than missing slots.

Storing old archives in a well defined data storage structure is very
important. This means DIASER can be deployed in the past, i.e. 2007 onwards.
Then the system can be manually filled with old data like a filling cabinet
and default automatic operation simply continues.

The system is optimised to store a combination of Full and differential
volumes. Fulls created at the beginning of the month and Diffs during the
month. However this does not preclude storage of constant volume sizes, i.e.
the storage of CCTV video footage but calculations must reflect this kind of
storage mode. The recommended data vault operation will make use of certain
directory structures in each month; Full01 and Full02. Full01 will store
a Full volume at the beginning of the month and skip d1. Full02 is there for
additional redundancy and to cope with the scenario where the current month is
the last (this is not default behaviour). 

There are two parts to the architecture, that described above and the data 
transfer mechanism. The data transfers are initiated by an internal structure
called the hyper virtual auto-changer, a virtual concept drawn from the 
mechanical tape changer. The well used tool rsync is a key component of this
mechanism and it's features are utilised fully. DIASER installs onto three
Linux nodes for optimal data storage resilience. No parity is used, this means 
complete data can be stored and retrieved if a single node is isolated from
others.

DIASER can be managed from any Perl 5.8.8 and network enabled workstation or 
from a node if preferred.

This section can be skipped and here for the very technically minded. Taking a
deeper look at the architecture, also see section 6.4 filling or
loading. Nodes A and B both contain d0's. This structure allow copy phases to
simply and accurately span different days, if data was set to be copied
directly from Node A d5 then midnight passed +-1 day will have to be factored
depending on the point of reference - node$. Filling of DIASER can then occur
well in advance thus keeping the copy phases operationally contained and
therefore greater control over operation, implementation and readability. The
filling occurs outside the LMB calculations and can be at a much slower rate.
This means LMB calculations remain applicable to both phases. d0 also acts as
a buffer; original copies exists if an internal copy fails, allows
simultaneous copies i.e. A; d0-&gt;d5 and B; d0-&gt;d0 otherwise the second copy 
would have to wait and begin safely only after completion. d0 can be tested 
for a successful fill before phases begin.

The concept of node role assists towards an optimised architecture, which differs 
depending on the node role. To allow the roles to be practically 
changed and for simplified fail-over implementation the directory structure is
identical on each node, whether it is node A, B or C. The difference between
roles is subtle but important:

Role A: uses d0 contain in each month, designed to be closest to original
backup volume source. Utilised in phase1 only.
Role B: Utilised in phase1 and phase2. Only accepts data during phases.
Role C: Utilised in phase2 only. Only accepts data during phases.


2.3 Integrated approach 

DIASER makes use of existing resources where possible. This results in 
streamlined software tightly integrated with the POSIX, Linux computing
environment. Using Perl for this task ensures GNU tools are used for tasks
instead of re-writing functionality unnecessarily. Use of the the common Linux
home directory environment, cron, OpenSSH and rsync. Perl is commonly 
installed on most Linux operating systems by default and only the core is 
required on the storage nodes. This allows for very simple installation and
management. By using user space the system is contained and a layer away from
it's host root environment which has many positive implications not least
better security and deployment modularity. DIASER will store backup volumes
generated by most backup software products, at least all those that can write
volumes to disk lessening operation, integration and installation overheads.
Volumes are are defined as resembling a single tape entity.

2.4 Limitations 

Storage space is limited by bandwidth. At my reference installation site I 
spent half an hour with the IT manager to decide the relative importance
of the organisations data. To this end we managed to select about 30% of all
data generated on a regular basis and pipe this into DIASER. This practical 
approach coupled with compression, data-data de-duplication may be available, means
that the organisations critical data is stored using DIASER. Node A is a
single point of failure. This is the node in network terms closest to the 
backup server and if it failed data will cease to transfer. However plans
exist to allow node A bypass. Even if node A did prevent data transfer it is
expected the systems administrator has the skills and access to resolve any
issues.

2.5 Why Linux 

Linux should not be underestimated for its appropriateness as a storage
platform for many reasons. The cost of obtaining Linux is very low and 
essentially free as in libre and to obtain and use, supported versions can be very good value too. Linux is widely available and
has lightweight resource requirements. Licence issues are avoided.
Organisations that need the flexibility of deployment with low initial 
purchase costs can do so when they deploy Linux. Linux is extremely robust 
under most circumstances, i.e. the ext3 file system under normal circumstances
does not require regular de-fragmentation which makes it ideally suited to
storage environments. Many of the tools required to enable DIASER are included
in standard distributions, even small installations without a GUI or a 
windowing system. This means DIASER is streamlined, lightweight and does not 
attempt to needlessly duplicate existing code, i.e. rsync.

3 The package and contents

3.1 Downloading and unpacking 

DIASER is currently supplied by anonymous download from SourceForge as a
diaser-1.0.$.tar.gz (this contains everything in the subversion repository), rpm, dist-tarball or deb installation. rpm dependencies will be 
automatically installed with yum. Makefile as root will allow installation; 
make, make install. deb package still requires extra dependencies. See INSTALL 
and section 8 of this manual.

3.2 Main source file 

diaser - this file unpacks more embedded scripts which are sent to the 
nodes upon installation, modification and upgrade.

3.3 Configuration files

diaser.conf - this is the main configuration file. See section 7 for
configuration guidance. A second configuration file can be created manually 
for development or second deployments. Keep your configuration files in separate 
directories or rename them. If no configuration file is
present then the default values set in diaser will be used, this will not
lead to successful deployment.

Also see section 7.19 for use of more than one configuration file. 

3.4 Example backup software configuration 

helper_scripts/bacula-dir.conf.extract

To fill DIASER with backup volumes created by backup software you need to
name volumes in a certain way. This example configuration comes from
the Open Source backup software called Bacula. If you use Bacula you can
implement volume creation is an identical fashion. If not then use this
file as a guide. The scripts generated by the installer residing on node A
are called fill_diaser.pl. As the names suggest these collect volumes
generated by your backup software, perhaps stored on a share mounted by
node A or directly backed up to node A, and fill DIASER with
pre-defined named volumes.

3.5 Licence 

This software is licenced under GPL V3 - gpl.txt and fdl-1.2.txt. The website
is licenced under fdl-1-2.txt
The manual, DiaserSystem.png and  DiaserDocsv1.1.pdf are licenced under Creative Commons 
Attribution-Share Alike 2.0 UK: England & Wales Licence.

3.6 Documentation 

Located in directory docs. This includes this technical manual 
docs/manual.txt .html or .pdf and diagrammatic overview
docs/overview.png.

Importantly INSTALL contains a quick start guide.
More theoretical documentation is available from
http://www.diap.org.uk and don't forget to check http://www.diaser.org.uk for
up to project date news and other information.

A man page is also installed.

4 Requirements

4.1 Hardware 

Workstation, 1GHz CPU or above, 500MB Ram and network connection. You can
also use a node as as the installation platform but you need to ensure all
the Perl modules listed below for the workstation are available.

3 x Linux storage nodes (can use VM's) with root access for initial setup.
Anything above 1GHz 32bit or 64bit with 500MB Ram. Enough disk space. I'll 
make all this much simpler to calculate when I have finished subroutine 
calculate_lmb, see appendix A, tables and calculations.

LAN or WAN connection between each server and workstation, the 3 machines 
must be able to, at least notionally, ping one another. Nodes can be connected
across a Virtual Private Network if necessary.

4.2 Software 

Minimum Perl v5.8.8 enabled (Perl v5.10.0 is recommended for best performance)
workstation with Perl modules:

Net::SSH::Perl, Net::SFTP, Getopt::Long, AppConfig, Term::ReadKey and
Data::Password. Optional for the --bandwidth tool gnuplot v4.2.

Install modules i.e. as root ]#yum -y install perl-Net-SSH-Perl
or cpan&gt;install Net::SSH::Perl

Automatic module installation occurs when installing using the rpm release.
               
Nodes Perl Core (v5.8.8 or above) File::Find (installed as default with most
distributions). SSH server on each node, not necessarily port 22.

Each node must run services; sshd, crontab, iptables ssh port open, ntpd,
rsync (non daemon).

4.3 Skills 

It is recommended the administrator have at least these skills:
Bash command line - ability to move around directories, create files and
directories, set permissions and add and remove user accounts. Knowledge of
SSH logins, text editor and adding and removing software. Basic knowledge of 
rsync and the ability to effectively use scp. Use of commands less and cat.
Ability to install Perl modules and check versions.

Less important are some Perl scripting abilities, Basic bash scripting skills
may also help.

5 Primary scripts

5.1 diaser 

The primary script containing most of the DIASER code. Code embedded within
diaser is unpacked and copied to nodes with variables set by the user.
For upgrades and configuration changes code is again unpacked and copied
over to nodes as required. 

5.2 tab_$.pl 

One for each node and contains the crontab definitions which trigger the 
internal diaser data copies managed by the scripts hvauto_$.pl. The cron
job run every hour i.e. 0 * * * * ~/hvautoc_a.pl and the script reads the
local system time, compares the the user set copy phase and if there is
a match will initiate data transfer. The script logs to the node, log_$, as 
does rsync. 

5.3 hvautoc_$.pl 

Each node has a single hvautoc_$.pl script. This script is triggered every
hour and depending on the times set by the user variable, HOUR1 and HOUR2
they initiate the rsync data transfers. If the user modifies variables then
these updates can be copied to the nodes by replacing the hvautoc_$.pl 
scripts.

5.4 fill_diaser.pl 

This script resides only on node A. This is responsible for filling the
correct slot with data fed into DIASER by the user. The script is
called by cron job set when configuring or modifying DIASER. The 
script copies the latest created of either Full, Differential or constant
volume types to the DIASER directory to either Full01 or d0. Aside from the 
cron job time there are a number of variables that can be user configured
including the volume directory, that is where your backup software stores 
volumes and the volume prefix, i.e. fullbackup... for  Full volumes. 


Filling is designed to be as simple as possible. Volumes on your file store 
are assumed to be read/write by user id: $your_diaser_uid. This flow chart
provides a detailed overview of the fill process, everything apart from the
node A-&gt;B copy check has been implemented:

<img src="fill_diaser_flow.png" alt="fill diaser flow chart" width="640"
height="740">


fill_diaser.pl automatically clears out the drop off directory ad0 after the 
contents of which would normally have been transferred to other slots as 
specified by the architecture. 

6 Explanation of features
6.1 Geographical distribution
Tapes can be moved from site to site and often are. To emulate this
ability
distributing data provides geographical redundancy. A simple mirror of
a NAS
device is one way to achieve this but to spread over three nodes can be
difficult to manage. DIASER is a self contained wrapper around the long
term
archiving across three nodes. We believe the extra resilience provided
by
storing in three geographical locations give your archives the
protection
needed for long term planning and data retrieval. Ensuring your
archives are safe means a better chance of recovering data when you
need it. Being a disk based solution will help render your data more
accessible in may scenarios. Planning your installation is important
and as the system may run for years spending time before deployment
will pay off. DIASER is ready for trail and evaluation. Your chosen
storage nodes may also be equipped with RAID. This is highly
recommended. 

6.2 Security These security precautions
have been implemented: The primary script,
diaser, does not store any passwords on file. Passwords are stored
in memory temporarily while the script runs. When a password is
requested the entry view is hidden. New DIASER account passwords are
quality checked and a
warning given if not secure. Root passwords are only requested when the
system is installed and removed. DIASER exists and runs in user space.
All network
communication is handled by OpenSSH. A unique RSA certificate is
generated so the nodes can use password-less logins to transfer data
and communication during normal operation. Password-less login
certificates can be regenerated using the modify switch --upgrade. A
kind of emergency account lock can be
initiated with the switch --lock. 

The perl module Net::SSH::Perl and Net::SFTP are used for all
SSH communications and file transfers initiated by the system. Rsync uses SSH
to transfer data. It is possible to use different port to the standard SSH 
port 22 and individually set these for each node.

An sha256sum checksum and a date stamp file is created a every volume enters DIASER
in a format similar to:

4865c5bdf3cf64709acd797688db5b337e7c8643 
2009/mth7/Full01/fullbackup7
Tue Jul 21 07:10:28 BST 2009

For extra security DIASER can run within a Virtual Private Network. It
is recommended encrypted partitions are used for DIASER, i.e. when deploying
an external USB hard drive.

/dev/sdb can be an externally attached USB2 hard disk drive i.e. replace with
the disk chosen on your system.


# Create a new partition on the disk

fdisk /dev/sdb

# Generate a mapping and LUKS partition

cryptsetup --verbose --verify-passphrase luksFormat /dev/sdb1

cryptsetup luksOpen /dev/sdb1 sdb1

# Format the partition

mkfs.ext3 -j /dev/mapper/sdb1 

# Mount the partition for the first time

mount /dev/mapper/sdb1 /mnt/crypt/

df -h

# Open and mount the device after reboot or disk removal

cryptsetup luksOpen /dev/sdb1 sdb1

mount /dev/mapper/sdb1 /mnt/crypt/

# Umount and close

umount /mnt/crypt/
cryptsetup luksClose sdb1


6.3 SE Linux and AppArmor

No problems observed during either installation or operation.



6.4 Upgrade and modify 

Currently modify switch, see below, is still under review. For now the 
upgrade switch sends modifications and upgrades to the nodes. This does not
and will not modify the archive storage directory structure. Changes to settings and development improvements
can be sent using this option. If you use newer version than your previous 
then follow these steps:

1) rename your current diaser_rel
2) unpack the download, see section 3.1 
3) copy your previous diaser.conf to the new diaser_rel
4) run ]$diaser --upgrade to update your DIASER installation

6.5 Filling or loading 

See section 5.4.

The initial entry point for data, d0 (node A, directory 0), resides in each
monthly segment and not a single d0 in the root directory. This lessens the
risk of deleting or overwriting archive data that may not, for whatever reason, have been
successfully transferred to the other nodes. If connection to node B fails there will be at least
two copies of the file in d0 and d30 or whatever the last day of the month
happens to be, before another Full is generated and the next months d0 is
cleared and filled. This adds more resilience at little extra cost. Also, if
copies are only set to occur once a month and the copy failed as before and
this was not noticed until after the next copy last months data will have
been deleted and only a single copy stored. 

6.6 Non distinct binary volumes

The volumes which have been described are binary files, like those created by
Bacula. Other backup software generate directories which need to some
processing before they can be collected by DIASER. 

There are a number of problems to avoid to ensure DIASER operates
non-destructively, so instead of manipulating the directories in your data 
store I suggest you use a script to create tar volumes of the archives you 
want to be collected. Here is a psudo code suggestion of how this might be 
achieved.


    # non distinct binary volume alternative collection
    # run as a cron job independently of DIASER

    sub non_full_binary {
        look for directories, if directories
        ls
        if($directories) {

        check for a previous tar Full
         
        -> if no Full this month then tar/shasum/date 
        any directories collected for Full -> Full01 slot and
        name with the chosen Full volumes prefix.
        
        check for a previous tar Diff
        
        -> if Full this month then create a 
        tar/shasum/date Diff against it for the day slot
        name with the chosen Diff volumes prefix.
    }



6.7 Logging 

Log files are kept on all nodes and named log_$ where $ is the node; a, b or
c. The scripts hvautoc_$.pl, fill_diaser.pl and all rsync transfers log to
these files. The log files are created automatically as soon as the system
begins operation. All entries are contain [diaser_hvautoc_$] or [diaser_fill]
where $ is the node; a, b or c.


6.8 Archive retrieval 

Either use the simple tool provided using the --retrieve option, which also
has additional command line options or login to nodes
directly and use scp. The retrieval tool will walk you through a set of
questions then list files for you to pick and transfer. The file will retain
it's name and be located in the diaser_rel directory.

If using cp, scp, rsync or other native tools.  The directory 
structure is human readable and matching the required date to directories can
be easily achieved i.e on node $ the archives stored on date June 25th 2009 
can be found in 
../diaser/2009/mth6/d26. Navigate to the directory and copy the contents to
the required recovery destination. It is assumed you have the tools to extract
your data provided by your backup software vendor. It is recommended you also
archive any backup catalogues or tools generated and provided with your 
usual backup software.

6.9 Data and node migration 

Node migration can be achieved using the --migrate tool.

6.10 Reporting and monitoring 

Bandwidth throughput calculations can been made using the --bandwidth tool. 
See section 9.3 for more details. This is an example screenshot of the
ouput: 



6.11 Multiple instances

Share disk space with other organisations or groups by using a different
account name and staggering or alternating the transfer times (phases) or
lowering the LMB - lowest maximum bandwidth between nodes. See diaser.conf.
diaser will allow the use of more than one configuration file. See section
7.19.

Also if more than one pair of phases is required, i.e. a morning session and
an night session than two instances on the same nodes will archive at
alternative phase times. If one instance contains FULL volumes then the second
does not necessarily need to archive these as well thus saving disk
space.

6.12 Extending operation

Operation can be extended. Minimum recommended is two years. You can set 
DIASER to install to 10 or even 20 years, which means 10-20 years of archive 
directory structure will be created. Deployment can represent the past if 
required then manually filled with previously generated archive data. 

6.13 Pruning old volumes

Not yet implemented. This will allow the user to remove old archives from
DIASER freeing up disk space.

6.14 Time zone compensation and leap years

Time zone compensation allows all the nodes to work together across time
zones. The user is asked for the time zone in UTC+(integer). 
UTC +/- integer value for node A, B and C; if node A is BST = UTC+1, 
so use 0 as daylight saving is usually automatic on most systems. For three
servers in the same time zone use the same offset integer value for each node.


The scripts hvautoc_$.pl all contain an algorithm that will ensure proper
interpretation of leap year occurrences.

6.15 Digital volume check-sum or stamp

Generating a unique check-sum or stamp and date stamp as a volume enters DIASER 
to be stored along side the volume.

6.16 Complete removal

This will completely remove all DIASER components and all archive data stored 
within the system. Data recovery is not possible after this operation has been
performed.

7 Configuration

7.1 diaser.conf 

This supplied configuration can be adjusted to suit your deployment 
requirements. Each parameter is in uppercase the name of which must not 
change. Change the values to the right of each parameter with a space 
in between. The default values are there to guide you for your choice. 
i.e. NODE_A 0.0.0.0 can be interpreted as NODE_A 192.168.2.1. Use the same 
case and value type for your chosen values as the defaults.

7.2 Number of years of expected operation 
NUM_YEARS

Minimum recommended 2 the default is 3.

7.3 First year of operation 
START_YEAR

This is the year when DIASER begins operation. Would usually be the current 
year.

7.4 Start time of phases 
HOUR1
HOUR2

DIASER operates in two phases. Phase one identified by HOUR1 and phase two
identified by the variable HOUR2. The operation is split into two phases, 
these can be at any time over a 24 hour period. It is assumed that the start
time is based on your local timezone, i.e. BST or UTC+1. It is recommended to set the phases to early in the morning to avoid using day time bandwidth resources.
Once set the operation can be reset by sending a new configuration from
diaser. The operation is fixed for at the same time every day once set. 
Using two phases optimises the use of resources when transferring internally
on a node and between nodes and prevents simultaneous transfers from
interfering with each other as well as simplifying the management and tracking
of transfers.

7.5 Node IP address's 
NODE_A
NODE_C
NODE_B

7.6 OpenSSH ports 
PORT_A
PORT_B
PORT_C

Change from the default port 22.

7.7 Dry run mode 
DRY_RUN

Copies are initiated but no archive data is transferred. This can be used
for testing, debugging and trails. Can be toggled at any time and the
new setting transferred as for all settings in this section.

7.8 Lowest maximum bandwidth (LMB) 
LOW_MAX_BW

BANDWIDTH control, please enter the Maximum speed in KBPS of your slowest 
network connection between either A-&gt;B or B-&gt;C or C-&gt;B. I recommend you run 
some test transfers between nodes using scp, also don't assume the bandwidth 
will remain constant throughout the cycle so you may need to run some long 
term viability tests. This feature will be implemented automatically with
the subroutine calculate_lmb(). Adjust if you install more than one diaser 
instance on a single disk or machine. Default is 12500 KBytes per 
second / 100 Mbits per second

7.9 Time zone compensation 
For deployments that span different time zones. UTC +/- integer value for 
node A, B and C; if node A is BST = UTC+1, so use 1.
TZONE_A
TZONE_B
TZONE_C

7.10 Working diaser account name 
USER_ACC

Choose a name for your DIASER user accounts. The same name will be used on
all three nodes. Limit this to between 5-10 lower case characters for
simplicity. I use diasertest for example.

7.11 Time out
TOUT

The copy timeout used by rsync for transfers. Set lower than phase periods.


7.12 Home directories
DIR_A
DIR_B
DIR_C
Home directory of diaser account, you may need to adjust if a large
partition is not in the usual home directory place i.e. /mnt/big/ will
evaluate as /mnt/big/diaser.

7.13 FILL_START_TIME
Time to initiate the daily filling script this should be set in advance of the 
DIASER archive transfer phases to ensure DIASER is filled before the phases 
begin.

7.14 VOLUME_DIR
Location of volume storage directory is where you store backup volumes created 
by your backup software.

7.15 DIFF_CONST_PREFIX
Differential or constant volume name prefix.

7.16 COLLECT_FULL
Choose whether full volumes are collected or not you want to simply collect 
constant sized volumes, like CCTV footage.

7.17 COLLECT_FULL_DAY
Day of moth when full volumes are collected.

7.18 FULL_PREFIX
Full volume name prefix

7.19 More than one configuration file 

It is possible to force diaser to read a particular configuration file by
executing ]$diaser diaser.conf --opts

The configuration file can named as the user chooses i.e.

]$diaser my.config --opts

Currently, changes will always be written to diaser.conf from the directory
diaser was executed in. The user is free to change the name of the
configuration file and read it into diaser as described above. This feature is
particularly useful when there us more than one installation being managed
from a single user account.

8 Installation

]$./diaser --install

Use after you have configured diaser.conf as a normal user. As each task
is completed you will be informed. At the end of installation you will need
to one time only - you will need to login from the diaser account
on each node to accept the certificates between nodes, like the 1st time you
SSH into a box. A-&gt;B, A-&gt;C, B-&gt;A, B-&gt;C, C-&gt;A and C-B. Afterwards logins
between nodes are password-less, this step will allow DIASER to begin work.
This step may be automated depending on user feedback.

9 Command Line Options

Please note, not all of these operations have been implemented. Please view
the most current development road-map:
http://diaser.svn.sourceforge.net/viewvc/diaser/ROADMAP_DEV. As such some
of these items may change or be removed altogether or others added. Later in
the development cycle I plan to extend command line options so configuration
changes can be set using the diaser command.

Run all commands from a prompt as a normal user, i.e.

]$diaser --install

9.1 --help 

Display menu and command line options.

 DIASER Usage: diaser_setup.pl 

    --help                 help|-?

    --bandwidth            calculate real bandwidth throughput between nodeX-Y
    --configure            question driven configuration tool
    --extend               extend maximum storage structure date 
    --install              install
    --list                 list all volumes in storage
    --lock                 lock all DIASER node accounts
    --logs                 condensed log readings from nodes
    --migrate              migrate node 
    --modify   [opts]      send modified configuration to nodes either
                           from conf file or command options or both
    --pause                pause operation 
    --recreate             recreate a single node from scratch
    --remove               remove from nodes, all data will be lost 
    --resume               resume operation 
    --retrieve [opts]      retrieve archive data 
    --stats                generate statistics 
    --stop                 stop operation 
    --upgrade              apply upgrades    
    --version              show version  
                                              
 For more information please use man diaser or the more detailed
 online manual: http://diaser.org.uk/manual.html

 Please send any FEEDBACK to dbrasher@interlinux.co.uk.
 I'm especially interested in how DIASER may be of use to you now or in the future.
 Thank you.


9.2 --bandwidth

This option will allow you to view the real, not theoretical, data 
throughput between two of your chosen storage nodes. You will need to have 
the OpenSource tool, gnuplot, installed on the system from which you are 
running this application.
    
This tool will attempt to download and compile the binary NPtcp from the
NetPIPE utility suite: http://bitspjoule.org/netpipe/. The tool operates 
over port 5002 and stats will be collected from the sender.

9.3 --configure

Question driven configure tool for new and existing diaser
deployments with input validation. 

9.4 --extend 

Extend maximum storage structure beyond the currently installed year. 

9.5 --install 

Install DIASER. See the section 8 Installation above.

9.6 --list 

This option lists all volumes stored in DIASER.

9.7 --lock 

Lock all DIASER node accounts. The systems administrator will need to reset
the passwords for each diaser user account manually. 

9.8 --logs                 

Condensed log readings from nodes.

9.9 --migrate 

Migrate node to a different server. 

9.10 --modify 

Apply modified settings to the running DIASER on your designated
nodes. Any changed settings will also be written to diaser.conf.

9.11 --pause 

Pause any currently running data transfers on all nodes. Sends kill -STOP. 

9.12 --recreate

In case you need to rebuild a node. You should only need to rebuild a node in
the event of a disk failure or other non-recoverable node loss. In all other
cases please consider using the --migrate (node) option. 


--numyear   years of operation required
--startyear year to begin storing archives, this can be in the past
--phase1    hour between 0 and 23
--phase2    hour between 0 and 23
--nodea     ip address in format 0.0.0.0
--nodeb     ip address in format 0.0.0.0
--nodec     ip address in format 0.0.0.0
--dryrun    boolean 1(y) or 0(n)
--lmb       lowest maximum bandwidth, KBytes per second
--tzone     [not yet implemented]
--tout      copy time out in seconds
--fillstarttime     time to run DIASER fill operation, hour between 0 and 23
--volumedir         the directory where your backup volumes reside
--diffconstprefix   prefix given to your Differential or constant volumes
--collectfull   are Full volumes to be collected or not, boolean 1(y) or 0(n)
--fullprefix    prefix given to your Full volumes


9.13 --remove 

Completely remove DIASER from your previously designated nodes. Please use
with caution as all archive data stored in DIASER will be permanently deleted.


9.14 --resume 

Resume paused data transfers. Sends kill -CONT. 

9.15 --retrieve 

Fetch archived data volumes.
A simple tool provided which also has additional command line options. The 
retrieval tool will walk you through a set of questions then list files for 
you to pick and transfer. The file will retain it's name and be located in the
diaser_rel directory.
--r_year    which year
--r_month   which month
--r_day     which day
--r_full    if not a day name a full directory - leave as default
--nodea     ip address in format 0.0.0.0
--nodeb     ip address in format 0.0.0.0
--nodec     ip address in format 0.0.0.0
--porta     int 
--portb     int
--portc     int
--user_acc  user account name, usually default set previously


9.16 --stats 

Displays for each node in GiB; disk space, total daily volumes, total full 
volumes and total data stored on each node and average differential volume size.


9.17 --stop 

Discontinue data transfers. Sends kill -9. 

9.18 --ugrade 

Apply product upgrades to an existing nodes with a DIASER installation.
Your DIASER account password will be requested.

9.19 --version 

Show current DIASER and currently installed Perl version. 

10 Operation

10.1 Stop 

This option will stop DIASER copies currently in operation, until the next set 
of transfer operations are initiated. This will kill any rsync processes.


10.2 Pause 
    
This option will pause DIASER copies currently in operation, until the resume options is used.



10.3 Resume 
    
This option will resume DIASER copies currently in operation.



10.4 Hard Lock

Lock all DIASER node accounts. This is a security feature. Enables the
operator with root access to lock all DIASER node accounts immediately.
Only by logging in to the nodes as root and re-enabling the DIASER account
password will access from node to node and hence operation resume.


10.5 Migrate node

Migrate will assist you in moving an existing node from the current machine, 
server or workstation, to a new one. This may be located anywhere as long as 
it satisfies the requirements for DIASER inter-node-visibility.
            
The procedure may take anywhere from minutes to hours depending on the
amount of data stored on the existing node and network bandwidth available.


11 The Code

11.1 Why Perl? 

The language is very well suited to the Linux POSIX environments. It is well
supported, has good network programming capabilities. Perls is very flexible
and allows a simple yet robust coding environment. Cross platform properties
are extremely valuable and ensures the code base is portable. Perls inherent
text parsing abilities are also valuable and set the language apart from many
other contenders.

11.2 Style 

Style is based as much as possible on the excellent O'Reilley Perl Best 
Practises by Damian Conway. A modular approach is used to code DIASER. All 
subroutines take parameters derived from the configuration mechanisms. Only 
three global variables are used, the rest are passed directly to subroutines 
and returns read back.

11.3 Modules 

Popular modules are used where possible. Only modules that are shipped with 
popular Linux distributions. The installer use a number of modules, the code 
deployed on nodes only use File::Find (shipped as default with most
distributions) and the core Perl shipped as default by most Linux
distributions.

11.4 Error handling 

Under review.

11.5 Contribute 

Please see http://www.diaser.org.uk/contribute.html. All contributions are
received under MIT/X licence terms.

12 Online resources

12.1 Website 

http://www.diaser.org.uk

12.2 SourceForge 

http://sourceforge.net/projects/diaser

12.3 Mailing list 

https://lists.sourceforge.net/lists/listinfo/diaser-devel

12.4 DIAP/LTASP and early project memory 

http://www.diap.org.uk

APPENDIX

A Tables and calculations
    Bandwidth and capacity lookup table
    ===================================
    BW      Hours GB (Decimal)
    Mbit/s  1    2   3    4    5     6
    1       0.45 0.9 1.35 1.8  2.25  2.7
    10      4.5  9   13.5 18   22.5  27
    100     45   90  135  180  225   270
    1000    450  900 1350 1800 2250  2700

    Disk space lookup table
    =======================
    BW      Month   1xYr    2xYr
    Mbit/s
    1       20GiB   240GiB  480GiB
    10      67GiB   804GiB  9.6TiB
    100     542GiB  6.5TiB  78TiB
    1000    5.2TiB  62.4TiB 748.8TiB


For more calculations information please use the --bandwidth tool.

Include more calculation examples.

B Glossary of terms

Under review

C Applicances

DIASER-appliance-3node-OVF-test-pak

Getting started:
----------------
Welcome to this 3 node pre-configured DIASER appliance, test pack. 


Unzip and import the three appliances into your virtual machine hypervisor. 
The network is internal only. Images were created using the freely available, 
cross-platform, VirtualBox. You can also test DIASER whilst using Windows.


Things to try:
--------------
Test data is read from /mnt/backup on nodeA and generated by a cron job, then 
distributed. You can view logs and other activity by running 
#diaser diaser.conf --logs from nodeA (logged in as diaser-user with password 
diaser-user.) Use diasertest when the node password is requested.
Also run $man diaser for more options. Explore the working accounts too.


Leave the system running for a few days and watch the test data inside DIASER 
using --list. 


Pack contents:
--------------
3 x OVF images; based on Ubuntu 32bit 10.04.1 LTS
    diaser-appliance-nodeA
    diaser-appliance-nodeB
    diaser-appliance-nodeC
diaser.conf - node construction is based on this config file
appliance_instructions.txt
manual.pdf
--list screenshot


General node specs:
-------------------
256MiB Ram (PAE CPU mode) 
Upto 2TB dynamically expanding disk
Internal network intnet
Hostname - diaser
DIASER working account/pass - diasertest/diasertest


Node specific:
--------------
A) IP 10.20.0.1
DIASER user account/pass, diaser-user/diaser-user


B) IP 10.20.0.2


C) IP 10.20.0.3


Security precautions:
---------------------
This is a test pack. Please, if you do decide to put the appliance into a
production environment you must change all user account passwords.


NB: The nodeA Perl build has not been performance tuned.


Index


