Discussion:
Indexing Office docs with mixed languages
(too old to reply)
Charlotte
2007-03-10 16:46:50 UTC
Permalink
Hi,

We have a file server with mainly office docs and PDF files that sometime
contains mixed languages e.g. text in English and a few names in both
transcribed and orginal formats (Russian, Arabic, Chinese etc). When I
search in english the hits are fine, but using the other char sets to set up
a question returns nothing.
I suspect this has to do with some limited support for these languages. Is
that true and would it in such case be possible to design your own word
breaker/ noise word files to use?

Many thanks for your support!
Gang_Warily
2007-03-14 12:31:18 UTC
Permalink
Hi Charlotte

Searching for "Language Resources" might help
I found
Extending Language Resources for Indexing Service
http://msdn2.microsoft.com/en-gb/library/ms693185.aspx
which looks like a good starting-point !

Ther is also mention of third-party word breakers, but I don't know of any
that exist.

Can you copy text from the PDFs in the other languages ?

Good luck !

Eric
Post by Charlotte
Hi,
We have a file server with mainly office docs and PDF files that sometime
contains mixed languages e.g. text in English and a few names in both
transcribed and orginal formats (Russian, Arabic, Chinese etc). When I
search in english the hits are fine, but using the other char sets to set up
a question returns nothing.
I suspect this has to do with some limited support for these languages. Is
that true and would it in such case be possible to design your own word
breaker/ noise word files to use?
Many thanks for your support!
Hilary Cotter
2007-03-14 12:55:26 UTC
Permalink
You will need to set the culture tags appropriately in your web config file
for this to work. In ixsso and ADO you will also have to set the localeid
appropriately.
--
Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
Post by Charlotte
Hi,
We have a file server with mainly office docs and PDF files that sometime
contains mixed languages e.g. text in English and a few names in both
transcribed and orginal formats (Russian, Arabic, Chinese etc). When I
search in english the hits are fine, but using the other char sets to set
up a question returns nothing.
I suspect this has to do with some limited support for these languages. Is
that true and would it in such case be possible to design your own word
breaker/ noise word files to use?
Many thanks for your support!
Loading...