Discussion:
how to backup/restore catalog and content without reindexing
(too old to reply)
Ed Devlin
2004-09-10 09:33:03 UTC
Permalink
Raw Message
We are building an application which involves large amounts of indexed HTML
content - say 100,000 HTML files with associated graphics. Our
development/test system uses a content generator to create large amounts of
sample content, which is then indexed for searching. Indexing this content
can take a long time. Inevitably we have to redeploy the application every
so often, which could involve rebuilding the whole system.

We would like to be able to backup the content folders and their indexing
catalogs so that Indexing Service does not have to re-index the content every
time. But my guess is that if we restore the content folders, Indexing
Service will see that they have been touched, and want to re-index them.

Is there a way to backup and restore content with its catalog so that
Indexing Service does not feel a need to re-index them from scratch?

Any ideas much appreciated!

Ed
Hilary Cotter
2004-09-10 18:22:00 UTC
Permalink
Raw Message
no, but you can burn a catalog onto an index if your drive is a fat drive.
--
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Post by Ed Devlin
We are building an application which involves large amounts of indexed HTML
content - say 100,000 HTML files with associated graphics. Our
development/test system uses a content generator to create large amounts of
sample content, which is then indexed for searching. Indexing this content
can take a long time. Inevitably we have to redeploy the application every
so often, which could involve rebuilding the whole system.
We would like to be able to backup the content folders and their indexing
catalogs so that Indexing Service does not have to re-index the content every
time. But my guess is that if we restore the content folders, Indexing
Service will see that they have been touched, and want to re-index them.
Is there a way to backup and restore content with its catalog so that
Indexing Service does not feel a need to re-index them from scratch?
Any ideas much appreciated!
Ed
Ed Devlin
2004-09-13 09:13:21 UTC
Permalink
Raw Message
Thanks for your reply Hilary. I should have given more information.

The reason we chose Indexing Service as our search engine is because it
respects the underlying NTFS permissions on the content when searching (we
use impersonation to search within the security context of the authenticated
web user). This works fine.

But this means we have to use an NTFS drive - FAT is not an option. If we
were to copy the catalog and content onto a separate FAT drive as an archive,
we would lose the NTFS permissions. And if we restored the catalog and
content from the FAT drive, we would have to reapply all the NTFS permissions
- thus touching every file and causing Indexing Service to re-index all the
content.
Post by Hilary Cotter
burn a catalog onto an index if your drive is a fat drive.
[I thought that the catalog was the store of indexes]
Do you have any instructions on this?

But I suspect it wouldn't help for the reasons above.

Does this make sense, and do you have any other ideas?

Your help is very much appreciated.

Ed
Hilary Cotter
2004-09-13 12:40:55 UTC
Permalink
Raw Message
What I was suggesting is that if you need to backup and recover an index,
the only way to do this without having the catalog rebuilt, is to put your
content and catalog on a fat drive. Build the catalog. When it is built stop
the catalog, and copy the contents of catalog.wci to the cd. Then copy the
files you have indexes to the cd preserving the directory structure. Burn
the
cd.

When you place the cd in a new machine it will come up as removabledrive_Z
in ciadv.msc.

You can through the MMC or a web page.

This will not be an option for you as the security is not preserved.
--
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Post by Ed Devlin
Thanks for your reply Hilary. I should have given more information.
The reason we chose Indexing Service as our search engine is because it
respects the underlying NTFS permissions on the content when searching (we
use impersonation to search within the security context of the
authenticated
Post by Ed Devlin
web user). This works fine.
But this means we have to use an NTFS drive - FAT is not an option. If we
were to copy the catalog and content onto a separate FAT drive as an archive,
we would lose the NTFS permissions. And if we restored the catalog and
content from the FAT drive, we would have to reapply all the NTFS permissions
- thus touching every file and causing Indexing Service to re-index all the
content.
Post by Hilary Cotter
burn a catalog onto an index if your drive is a fat drive.
[I thought that the catalog was the store of indexes]
Do you have any instructions on this?
But I suspect it wouldn't help for the reasons above.
Does this make sense, and do you have any other ideas?
Your help is very much appreciated.
Ed
Hilary Cotter
2004-09-13 13:27:07 UTC
Permalink
Raw Message
Not sure about this one.

I believe the catalog.wci folder must be in the root, ie d:\catalog.wci
where d is your cd drive letter.

The paths stored are relative to the cd. I am not sure if this is the case
for all paths indexed by IS.

The Site Search installation cd shipped with such a catalog burned on the
cd.
--
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Thanks for the reply Hilary. You're right that because we need to
preserve
NTFS permissions, the burn to CD option is not going to help.
But just out of interest...
If we burned the catalog.wci folder and the content folder tree
(preserving
the folder structure) to a CD, would Indexing Service be able to find the
original content? I would have thought that the indexes would contain
absolute filepaths to each file, and therefore even if the catalog folder
and
content folder were in the same relative positions, they wouldn't match.
Or does the catalog contain relative file paths? In which case, surely
this
would break if one ever relocated the catalog folder? Does your technique
rely on the catalog folder being in the same place as the content? e.g.
catalog at d:\mycontenttree\catalog.wci
content under d:\mycontenttree\contentA
content under d:\mycontenttree\contentB
content under d:\mycontenttree\contentC
Cheers
Ed
Stu
2004-09-14 09:51:47 UTC
Permalink
Raw Message
Post by Ed Devlin
But this means we have to use an NTFS drive - FAT is not an option. If we
were to copy the catalog and content onto a separate FAT drive as an archive,
we would lose the NTFS permissions. And if we restored the catalog and
content from the FAT drive, we would have to reapply all the NTFS permissions
- thus touching every file and causing Indexing Service to re-index all the
content.
We backup documents using winrar. There is an option to backup security
data. then you can burn to a cd/dvd (CDFS) then restore to an NTFS drive.

Cheers
Stu
Stu
2004-09-14 14:45:53 UTC
Permalink
Raw Message
I guess I was hoping for an Indexing Server option or hack that says: "you
have already indexed this stuff, honestly, so don't re-index it, even
though
it appears to have changed, and only index new content which is
subsequently
added or updated after date x" - a kind of "freeze the catalog up to this
point" thing.
Ahh, now I understand what you want!

Dont know how good your VB or C++ is but what about this for an idea? Hook
the windows message that IS uses (e.g FindFirstChangeNotification) and
redirect it or just stop it. Just need to find out which windows message IS
uses to signify that a file has changed.

Do-able? Anybody?

Cheers
Stu
Ton Plooy
2004-09-14 16:59:40 UTC
Permalink
Raw Message
Post by Ed Devlin
I guess I was hoping for an Indexing Server option or hack that says: "you
have already indexed this stuff, honestly, so don't re-index it, even
though
it appears to have changed, and only index new content which is
subsequently
added or updated after date x" - a kind of "freeze the catalog up to this
point" thing.
The assumption seems to be that restoring the content folders will mark them
as changed and hence the index service will start to re-index. I'm not sure
is this is valid anyway, but if it is, you could perhaps make sure that all
attributes are reset after a restore (and before the index service is
started). You need to find out on which atributes the IS reacts (e.g. not
the archive bit I guess), or maybe someone can tell. But, maybe there's more
to it, see the next part below.
Post by Ed Devlin
Ahh, now I understand what you want!
Dont know how good your VB or C++ is but what about this for an idea?
Hook
Post by Ed Devlin
the windows message that IS uses (e.g FindFirstChangeNotification) and
redirect it or just stop it. Just need to find out which windows message IS
uses to signify that a file has changed.
Do-able? Anybody?
I don't think that IS uses FindFirstChangeNotification, this wouldn't work
when a file is touched when IS is stopped. I believe I read somewhere that
IS uses file journalling, e.g. see FSCTL_QUERY_USN_JOURNAL.
I don't have any experience with this API, but you may be able to query and
clear the journal on your files yourself, before IS is restarted. That way,
any changes (from the restore or otherwise) won't have an effect on IS.

Ton
Ton Plooy
2004-09-14 18:37:23 UTC
Permalink
Raw Message
Post by Ton Plooy
....
I don't think that IS uses FindFirstChangeNotification, this wouldn't work
when a file is touched when IS is stopped. I believe I read somewhere that
IS uses file journalling, e.g. see FSCTL_QUERY_USN_JOURNAL.
I don't have any experience with this API, but you may be able to query and
clear the journal on your files yourself, before IS is restarted. That way,
any changes (from the restore or otherwise) won't have an effect on IS.
I looked into the NTFS Change Journal a bit and it seems that deleting
existing entries is not an option (and resetting attributes is not an
option either). The only thing that can be deleted is the complete change
log, but that seems a bit drastic. Here's some journal output from a simple
test:

D:\..\IndSvc\Index\KB\da3\Content\Shared\Email\1067.txt
SourceInfo = 0
BasicInfoChange
BasicInfoChange, *Close*
DataTruncation
DataTruncation, *Close*
DataTruncation
DataExtend, DataTruncation
DataOverwrite, DataExtend, DataTruncation
DataOverwrite, DataExtend, DataTruncation, BasicInfoChange
DataOverwrite, DataExtend, DataTruncation, BasicInfoChange, *Close*
DataOverwrite

I first changed the file's date and time (one hour earlier) for this file,
that resulted in the BasicInfoChange lines. After that I copied a backup
over this file, resulting in the additinal journal entries. A little while
later I start seeing journal entries like this:

D:\..\IndSvc\catalog.wci\CiSP0000.000
SourceInfo = 0
DataOverwrite
...

This is the catalog for my Index directory above, the IS kicks in due to the
changes.

Ton
Hilary Cotter
2004-09-23 11:19:01 UTC
Permalink
Raw Message
I'm not sure if this will be helpful, but if you want to freeze the catalog,
ie stop it from indexing, but still have it respond to queries all you do is
pause the cisvc service
--
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Post by Ton Plooy
Post by Ton Plooy
....
I don't think that IS uses FindFirstChangeNotification, this wouldn't work
when a file is touched when IS is stopped. I believe I read somewhere that
IS uses file journalling, e.g. see FSCTL_QUERY_USN_JOURNAL.
I don't have any experience with this API, but you may be able to query
and
Post by Ton Plooy
clear the journal on your files yourself, before IS is restarted. That
way,
Post by Ton Plooy
any changes (from the restore or otherwise) won't have an effect on IS.
I looked into the NTFS Change Journal a bit and it seems that deleting
existing entries is not an option (and resetting attributes is not an
option either). The only thing that can be deleted is the complete change
log, but that seems a bit drastic. Here's some journal output from a simple
D:\..\IndSvc\Index\KB\da3\Content\Shared\Email\1067.txt
SourceInfo = 0
BasicInfoChange
BasicInfoChange, *Close*
DataTruncation
DataTruncation, *Close*
DataTruncation
DataExtend, DataTruncation
DataOverwrite, DataExtend, DataTruncation
DataOverwrite, DataExtend, DataTruncation, BasicInfoChange
DataOverwrite, DataExtend, DataTruncation, BasicInfoChange, *Close*
DataOverwrite
I first changed the file's date and time (one hour earlier) for this file,
that resulted in the BasicInfoChange lines. After that I copied a backup
over this file, resulting in the additinal journal entries. A little while
D:\..\IndSvc\catalog.wci\CiSP0000.000
SourceInfo = 0
DataOverwrite
...
This is the catalog for my Index directory above, the IS kicks in due to the
changes.
Ton
chris
2007-06-19 22:10:49 UTC
Permalink
Raw Message
I spoke to microsoft.
Indexing service uses the USN journal to keep track of what it needs to index.
If you pause the service, as soon as it starts up again, it will rescan the
whole journal and start indexing where it last left off. I am having a hard
time with my backup software. They backup my files and then rewrite the last
accessed date so it appears like they never accessed the file. The problem is
that any change to a file will get recorded in the USN journal. I thought
about deleting the usn journal right after my backup and then recreate it.
However if you do that then indexing service will rescan all the documents
because it thinks that needs to scan for the first time since the USN journal
is new. I found out that it is a lot faster to just empty the catalog and
have it rebuild, instead of having it reindex all the files again. Does
anyone know of any backup software that is index service friendly?
Post by Hilary Cotter
I'm not sure if this will be helpful, but if you want to freeze the catalog,
ie stop it from indexing, but still have it respond to queries all you do is
pause the cisvc service
--
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Post by Ton Plooy
Post by Ton Plooy
....
I don't think that IS uses FindFirstChangeNotification, this wouldn't
work
Post by Ton Plooy
Post by Ton Plooy
when a file is touched when IS is stopped. I believe I read somewhere
that
Post by Ton Plooy
Post by Ton Plooy
IS uses file journalling, e.g. see FSCTL_QUERY_USN_JOURNAL.
I don't have any experience with this API, but you may be able to query
and
Post by Ton Plooy
clear the journal on your files yourself, before IS is restarted. That
way,
Post by Ton Plooy
any changes (from the restore or otherwise) won't have an effect on IS.
I looked into the NTFS Change Journal a bit and it seems that deleting
existing entries is not an option (and resetting attributes is not an
option either). The only thing that can be deleted is the complete change
log, but that seems a bit drastic. Here's some journal output from a
simple
Post by Ton Plooy
D:\..\IndSvc\Index\KB\da3\Content\Shared\Email\1067.txt
SourceInfo = 0
BasicInfoChange
BasicInfoChange, *Close*
DataTruncation
DataTruncation, *Close*
DataTruncation
DataExtend, DataTruncation
DataOverwrite, DataExtend, DataTruncation
DataOverwrite, DataExtend, DataTruncation, BasicInfoChange
DataOverwrite, DataExtend, DataTruncation, BasicInfoChange, *Close*
DataOverwrite
I first changed the file's date and time (one hour earlier) for this file,
that resulted in the BasicInfoChange lines. After that I copied a backup
over this file, resulting in the additinal journal entries. A little while
D:\..\IndSvc\catalog.wci\CiSP0000.000
SourceInfo = 0
DataOverwrite
...
This is the catalog for my Index directory above, the IS kicks in due to
the
Post by Ton Plooy
changes.
Ton
Loading...