
I have a windows VM running under Xen, and want to take snapshots of all the LV's it is using as it's physical disks, eg: vg0/lv0 - OS vg0/lv1 - FILESTREAM data vg1/lv2 - Databases vg2/lv3 - Database Logs vg's 0, 1, and 2 are on different sets of spindles to give higher IOPS. The problem is that the 4 snapshots need to be created atomically or else I'm going to get my database and transaction logs out of sync as there could be a few ms between the snapshot of lv2 and lv3. I think this isn't possible based on the documentation I've read, but maybe someone here knows different? Being Windows, I can use VSS to fudge it, but making an automatic procedure to do that in conjunction with LVM is more complicated than I want to think about right now! Thanks James

On 22/05/12 14:41, James Harper wrote:
I have a windows VM running under Xen, and want to take snapshots of all the LV's it is using as it's physical disks, eg:
vg0/lv0 - OS vg0/lv1 - FILESTREAM data vg1/lv2 - Databases vg2/lv3 - Database Logs
vg's 0, 1, and 2 are on different sets of spindles to give higher IOPS. The problem is that the 4 snapshots need to be created atomically or else I'm going to get my database and transaction logs out of sync as there could be a few ms between the snapshot of lv2 and lv3.
I think this isn't possible based on the documentation I've read, but maybe someone here knows different?
I'm not sure which database you're using.. but if it's PostgreSQL, then there are some specific functions you can use to help in these situations. Toby

On 22/05/12 14:41, James Harper wrote:
I have a windows VM running under Xen, and want to take snapshots of all the LV's it is using as it's physical disks, eg:
vg0/lv0 - OS vg0/lv1 - FILESTREAM data vg1/lv2 - Databases vg2/lv3 - Database Logs
vg's 0, 1, and 2 are on different sets of spindles to give higher IOPS. The problem is that the 4 snapshots need to be created atomically or else I'm going to get my database and transaction logs out of sync as there could be a few ms between the snapshot of lv2 and lv3.
I think this isn't possible based on the documentation I've read, but maybe someone here knows different?
I'm not sure which database you're using.. but if it's PostgreSQL, then there are some specific functions you can use to help in these situations.
It's MSSQL and I can use VSS to give me a consistent copy of all the volumes, but I was hoping I could do it all transparently to Windows. James

On Tue, 22 May 2012, James Harper <james.harper@bendigoit.com.au> wrote:
vg's 0, 1, and 2 are on different sets of spindles to give higher IOPS. The problem is that the 4 snapshots need to be created atomically or else I'm going to get my database and transaction logs out of sync as there could be a few ms between the snapshot of lv2 and lv3.
If a database is designed correctly shouldn't it be possible to be consistent even if the snapshots are at different times? The purpose of the log is that logged transactions can be replayed if they didn't get to the main data store. So if you snapshot the database filesystem and then snapshot the log a bit later then the DB server should just replay the newer transactions and everything should be fine - according to my knowledge of databases. Am I wrong in my understanding of databases? FWIW I have my MySQL VMs setup with a /mysql filesystem that contains database files under /mysql/db and logs under /mysql/log. It's generally claimed that the benefits of having a separate filesystem for the database log at the low end involve different RAID levels. For high end databases I've seen an Oracle document about how to tune different ZFS filesystems for different parts of an Oracle database. But both those cases probably aren't great options for a VM. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Tue, 22 May 2012, James Harper <james.harper@bendigoit.com.au> wrote:
vg's 0, 1, and 2 are on different sets of spindles to give higher IOPS. The problem is that the 4 snapshots need to be created atomically or else I'm going to get my database and transaction logs out of sync as there could be a few ms between the snapshot of lv2 and lv3.
If a database is designed correctly shouldn't it be possible to be consistent even if the snapshots are at different times?
The purpose of the log is that logged transactions can be replayed if they didn't get to the main data store. So if you snapshot the database filesystem and then snapshot the log a bit later then the DB server should just replay the newer transactions and everything should be fine - according to my knowledge of databases.
Am I wrong in my understanding of databases?
You mmight be wrong if you hadn't qualified your statement with "if designed correctly :) But my example was illustrative... it would also be possible for filegroups from the same db to be scattered across 10 different logical volumes, and any inconsistency between them would be disastrous. In addition, MSSQL is a bit funny wrt transaction logs. You can replay the backed up logs against a restored backup of a database that you took, provided you have an unbroken sequence of logfile backups, but you can't replay the current logfile against a previous live copy of the database. MS Exchange is much better in this regard and works as you describe with zero-loss-recovery provided the database or the logfiles survive the disaster, but I think it still relies on the action of a restore of the database to be able to replay the logs - you can't just pick up where you left off if one of the volumes suddenly does a time warp. James

On 22/05/12 17:17, Russell Coker wrote:
On Tue, 22 May 2012, James Harper<james.harper@bendigoit.com.au> wrote:
vg's 0, 1, and 2 are on different sets of spindles to give higher IOPS. The problem is that the 4 snapshots need to be created atomically or else I'm going to get my database and transaction logs out of sync as there could be a few ms between the snapshot of lv2 and lv3.
If a database is designed correctly shouldn't it be possible to be consistent even if the snapshots are at different times?
The purpose of the log is that logged transactions can be replayed if they didn't get to the main data store. So if you snapshot the database filesystem and then snapshot the log a bit later then the DB server should just replay the newer transactions and everything should be fine - according to my knowledge of databases.
Am I wrong in my understanding of databases?
I think you're missing a bit of the understanding of how they work, but the end result is the same -- if you capture the main datastore and enough surrounding logs, then you're usually OK.. In case you're interested about how PostgreSQL (and to a lesser degreee, SQLite) work with logs.. Keep in mind that your database needs to maintain quite a few sets of the *same* info. ie. If a user has started a transaction and is working on some data, but another user has come along and changed that data in a separate transaction. Firstly, you need to keep uncommitted changes to data around without committing it to the primary data store. Then once it's committed, you still can't write over the old data until every other transaction looking at it has finished too. Thirdly, databases were written with spinning disks in mind, which suck at random write i/o. So they tend to write everything out to log files, sequentially, then write it to the main datastore in the background as time allows. So, the "log" files are actually quite important -- they contain data that may *not be allowed (yet)* to be written to the main store.
FWIW I have my MySQL VMs setup with a /mysql filesystem that contains database files under /mysql/db and logs under /mysql/log.
It's generally claimed that the benefits of having a separate filesystem for the database log at the low end involve different RAID levels. For high end databases I've seen an Oracle document about how to tune different ZFS filesystems for different parts of an Oracle database. But both those cases probably aren't great options for a VM.
Indeed -- as mentioned above, the logs are all sequential writes, whereas the data store is random writes. (Both then feature random and sequential reads, but the logs are usually still in your file system cache so it doesn't matter for them)

On Tue, 22 May 2012 17:17:44 +1000, Russell Coker <russell@coker.com.au> wrote:
On Tue, 22 May 2012, James Harper <james.harper@bendigoit.com.au> wrote:
vg's 0, 1, and 2 are on different sets of spindles to give higher IOPS. The problem is that the 4 snapshots need to be created atomically or else I'm going to get my database and transaction logs out of sync as there could be a few ms between the snapshot of lv2 and lv3.
If a database is designed correctly shouldn't it be possible to be consistent even if the snapshots are at different times?
No. Typically, the log tail is trimmed as things are written back as a checkpoint is performed (writing back things from log to data file). If you have copied 1/2 the data file and checkpoint gets done that is in the first half of the data file and the log wraps... you're pretty hosed no matter what.
The purpose of the log is that logged transactions can be replayed if they didn't get to the main data store. So if you snapshot the database filesystem and then snapshot the log a bit later then the DB server should just replay the newer transactions and everything should be fine - according to my knowledge of databases.
This would be incredibly DBMS specific, and there would be caveats and possibilities where you end up with either missing, corrupted or completely useless data.
FWIW I have my MySQL VMs setup with a /mysql filesystem that contains database files under /mysql/db and logs under /mysql/log.
it depends what logs. If it's the binary logs used for replication, then these have a looser connection than the logs for innodb. Luckily (spruking the open source product my employer develops) XtraBackup is there to take consistent non-blocking backups of InnoDB.
It's generally claimed that the benefits of having a separate filesystem for the database log at the low end involve different RAID levels. For high end databases I've seen an Oracle document about how to tune different ZFS filesystems for different parts of an Oracle database. But both those cases probably aren't great options for a VM.
I've never seen any benchmark that says ZFS should even be considered for databases (nor btrfs for that matter). Basically, it's still; step 1 for database performance: use XFS. -- Stewart Smith

On Wed, 23 May 2012, Stewart Smith <stewart@flamingspork.com> wrote:
Typically, the log tail is trimmed as things are written back as a checkpoint is performed (writing back things from log to data file). If you have copied 1/2 the data file and checkpoint gets done that is in the first half of the data file and the log wraps... you're pretty hosed no matter what.
Thanks for the explanation.
I've never seen any benchmark that says ZFS should even be considered for databases (nor btrfs for that matter).
Basically, it's still; step 1 for database performance: use XFS.
Does XFS have features equivalent to L2ARC and ZIL? How can a filesystem without such features compare? Also for a database like MySQL that doesn't have anything in the way of data checksums then surely the only thing to do is to use a filesystem like BTRFS or ZFS. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Wed, 23 May 2012 16:16:42 +1000, Russell Coker <russell@coker.com.au> wrote:
Basically, it's still; step 1 for database performance: use XFS.
Does XFS have features equivalent to L2ARC and ZIL? How can a filesystem without such features compare?
Some people use FlashCache with MySQL/Percona Server, this has the benefit of providing a L2 cache to a block device. The big performance boost comes from the file system getting out of the way of the database and having very little impact on the IO paths. e.g. XFS is one of the few (only?) filesystem that allows multiple simultaneous O_DIRECT writers on the same inode. The O_DIRECT write-in-place workload of database servers with many fsync() calls for whatever reason (well, there's several) don't mix well with ZFS/btrfs style filesystems. Well, they do for correctness, just not for performance.
Also for a database like MySQL that doesn't have anything in the way of data checksums then surely the only thing to do is to use a filesystem like BTRFS or ZFS.
innodb has a checksum for each page. MyISAM doesn't, but it's not crash safe anyway. The replication log, in 5.6 does have a checksum now (IIRC) -- Stewart Smith

If a database is designed correctly shouldn't it be possible to be consistent even if the snapshots are at different times?
No.
Typically, the log tail is trimmed as things are written back as a checkpoint is performed (writing back things from log to data file). If you have copied 1/2 the data file and checkpoint gets done that is in the first half of the data file and the log wraps... you're pretty hosed no matter what.
That's one mode of logging, but isn't the only one. MSSQL can function in 'FULL' or 'SIMPLE' logging modes. 'SIMPLE' functions in much the way you describe where the log is thrown away once the transaction is committed, but in 'FULL' logging mode the logs are kept until backed up or truncated manually. This enables point-in-time restore if the backups are done correctly, eg you can restore to right before you ran a DELETE but forgot the WHERE clause. Exchange has similar options. Neither will allow for seamlessly recovering from disk snapshots taken at different points in time though. James

James Harper wrote:
I have a windows VM running under Xen, and want to take snapshots of all the LV's it is using as it's physical disks, eg:
vg0/lv0 - OS vg0/lv1 - FILESTREAM data vg1/lv2 - Databases vg2/lv3 - Database Logs
vg's 0, 1, and 2 are on different sets of spindles to give higher IOPS. The problem is that the 4 snapshots need to be created atomically or else I'm going to get my database and transaction logs out of sync as there could be a few ms between the snapshot of lv2 and lv3.
AFAIK you can't do that. In IIRC Scalix there was an option to say "get your shit in order and be quiescent until further notice, or fifteen seconds, whichever elapses first." Then you'd make your snapshots, then (since it was probably <<15s) say to scalix "OK, you can buffer writes again". I have no idea how well that actually worked, since I (thankfully) never had to restore that box from backups. Plan B, of course, would be to dump the database, then backup that.

vg's 0, 1, and 2 are on different sets of spindles to give higher IOPS. The problem is that the 4 snapshots need to be created atomically or else I'm going to get my database and transaction logs out of sync as there could be a few ms between the snapshot of lv2 and lv3.
AFAIK you can't do that.
Thanks for the confirmation... that's pretty much what google told me but I thought maybe someone else might know something more.
In IIRC Scalix there was an option to say "get your shit in order and be quiescent until further notice, or fifteen seconds, whichever elapses first." Then you'd make your snapshots, then (since it was probably <<15s) say to scalix "OK, you can buffer writes again".
I have no idea how well that actually worked, since I (thankfully) never had to restore that box from backups.
This is where Windows is _really_ good. Pretty much all the backup infrastructure these days is built around VSS which allows you to take, at the very least, a crash-consistent copy of all volumes in the system. On top of that, application-specific VSS Writers ensure that their databases are shutdown-consistent before the snapshot is taken. From the snapshot you can back up files or blocks as you see fit.
Plan B, of course, would be to dump the database, then backup that.
I want to take a consistent snapshot of the whole machine and then bring it up in another VM for testing. A procedure more complex than 'take snapshot, start VM' is not the desired outcome. Thanks again James
participants (5)
-
James Harper
-
Russell Coker
-
Stewart Smith
-
Toby Corkindale
-
Trent W. Buck