Discussion:
Why are NN1, NN2, NN3 ... treated as Noise Words ?
(too old to reply)
Gang_Warily
2006-11-08 16:15:02 UTC
Permalink
Hi

I run an intranet on Windows 2003 for a local Government organisation based
in Northamptonshire, England.
All of our PostCodes (equivalent to Zip Codes) are like 'NN1 2XY'.

Using Indexing Service to search seems to discriminate against us most
unfairly,
since any search for a postcode fragment 'NN1' returns

Microsoft OLE DB Provider for Indexing Service error '80041605'
The query contained only ignored words.

NOISE.ENG does not contain any words containing 'NN'.
'NN' and any single digit is treated as 'noise'
'NN10' and above are OK.
all other letter+letter+digit combinations I've tried are OK.
(I haven't tried them all ...)

Does anyone know why this happens ?
Has anyone else noticed anything similar ?
Would changing language help ?
Can I modify the noise-word 'noise.eng' file in any way to help ?
Is there a list of 'signal words' - the opposite of 'noise words' ?

Thanks for your attention .

Eric
Hilary Cotter
2006-11-11 00:18:13 UTC
Permalink
using lrtest and the UK English word breaker I see that these words are
indexed and queried as a unit, ie NN1

I am not sure why you are getting the results you are - perhaps there is
something else in your query?
--
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.

This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
Post by Gang_Warily
Hi
I run an intranet on Windows 2003 for a local Government organisation based
in Northamptonshire, England.
All of our PostCodes (equivalent to Zip Codes) are like 'NN1 2XY'.
Using Indexing Service to search seems to discriminate against us most
unfairly,
since any search for a postcode fragment 'NN1' returns
Microsoft OLE DB Provider for Indexing Service error '80041605'
The query contained only ignored words.
NOISE.ENG does not contain any words containing 'NN'.
'NN' and any single digit is treated as 'noise'
'NN10' and above are OK.
all other letter+letter+digit combinations I've tried are OK.
(I haven't tried them all ...)
Does anyone know why this happens ?
Has anyone else noticed anything similar ?
Would changing language help ?
Can I modify the noise-word 'noise.eng' file in any way to help ?
Is there a list of 'signal words' - the opposite of 'noise words' ?
Thanks for your attention .
Eric
Gang_Warily
2006-11-13 11:03:02 UTC
Permalink
Hi

I've reduced the query down to
SELECT vpath FROM WWWroot..SCOPE('DEEP TRAVERSAL OF "/" ') WHERE
CONTAINS(contents,'nn4')

Interestingly, googling for LRtest finds
http://support.microsoft.com/default.aspx/kb/890613

...

To view information about how the word breaker processes the "1.1.4322.910"
string during indexing, type the following command line, and then press ENTER:
lrtest /b /c:{188D6CC5-CB03-4C01-9 12E-47D21295D77E} /m:langwrbk.dll
"1.1.4322.910"

...

IWordSink::PutAltWord: cwcSrcLen 1, cwcSrcPos 0, cwc 1, '1'
IWordSink::PutWord: cwcSrcLen 1, cwcSrcPos 0, cwc 3, 'NN1'
...
IWordSink::PutAltWord: cwcSrcLen 3, cwcSrcPos 9, cwc 3, '910'
IWordSink::PutWord: cwcSrcLen 3, cwcSrcPos 9, cwc 5, 'NN910'

Which looks as though 'NN' happens to be used internally as a prefix to
indicate a Numeric value !

Another thread
"Different wildcard/hyphen behaviour in Windows 2003 Server?"
http://tinyurl.com/yjchuy
implies it's a new 'feature' with Windows 2003.

This might help to explain why 'NN1' to 'NN9' are not good search words,
but not why 'NN10' upwards seem to work !

Although searching for 'NN10' also seems to find documents that contain '10'
but not 'NN10' !
(often '10' as part of a date '07/10/06', '11-DEC-06' or time '10:55',
'10hrs' ?)
We can handle that by detecting NNxx postcodes and adding to the query ie
"CONTAINS(contents,'nn14') AND NOT CONTAINS(contents,'14')"

Interesting !
Does it help us to resolve the problem where 'NN1' is an important 'word' in
our specific context ?
I am hoping that there is a registry entry somewhere 'NumericPrefix = NN'
that we could change to 'NumericPrefix = ZZ', but I would be very surprised !

Or can we hack something to let NN1-NN9 work again ?

Those of us that have NN? postcodes ( in Northamptonshire, UK ) should not
feel paranoid about this - it may be related to a certain web-browser
instead, which is often abbreviated to NN4, NN8 etc ?
I feel another conspiracy theory coming on ...
Post by Hilary Cotter
using lrtest and the UK English word breaker I see that these words are
indexed and queried as a unit, ie NN1
I am not sure why you are getting the results you are - perhaps there is
something else in your query?
--
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.
This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
Post by Gang_Warily
Hi
I run an intranet on Windows 2003 for a local Government organisation based
in Northamptonshire, England.
All of our PostCodes (equivalent to Zip Codes) are like 'NN1 2XY'.
Using Indexing Service to search seems to discriminate against us most
unfairly,
since any search for a postcode fragment 'NN1' returns
Microsoft OLE DB Provider for Indexing Service error '80041605'
The query contained only ignored words.
NOISE.ENG does not contain any words containing 'NN'.
'NN' and any single digit is treated as 'noise'
'NN10' and above are OK.
all other letter+letter+digit combinations I've tried are OK.
(I haven't tried them all ...)
Does anyone know why this happens ?
Has anyone else noticed anything similar ?
Would changing language help ?
Can I modify the noise-word 'noise.eng' file in any way to help ?
Is there a list of 'signal words' - the opposite of 'noise words' ?
Thanks for your attention .
Eric
Gang_Warily
2006-11-20 14:51:01 UTC
Permalink
Hi

'NN1' is also regarded as a noise word in the MMC search page,
which pretty much eliminates my code.

http://www.google.com/search?q=inurl:advquery.asp
finds many Indexing Service pages on the Web.
I've tested about 20 of them - all return 'nn1 is an ignored word'.

Does anyone out there have a current Indexing Service implementation that
will search for 'nn1' ? (you don't have to generate content with 'nn1' in it
- the error will even come from an empty catalog)

Can anyone confirm or deny whether the same problem will occur with the
other Microsoft search products - SQL server 2005 Full-Text Search,
Sharepoint, MSCMS etc ? I believe the same search engine is behind them all
...
Post by Gang_Warily
Hi
I've reduced the query down to
SELECT vpath FROM WWWroot..SCOPE('DEEP TRAVERSAL OF "/" ') WHERE
CONTAINS(contents,'nn4')
Interestingly, googling for LRtest finds
http://support.microsoft.com/default.aspx/kb/890613
...
To view information about how the word breaker processes the "1.1.4322.910"
lrtest /b /c:{188D6CC5-CB03-4C01-9 12E-47D21295D77E} /m:langwrbk.dll
"1.1.4322.910"
...
IWordSink::PutAltWord: cwcSrcLen 1, cwcSrcPos 0, cwc 1, '1'
IWordSink::PutWord: cwcSrcLen 1, cwcSrcPos 0, cwc 3, 'NN1'
...
IWordSink::PutAltWord: cwcSrcLen 3, cwcSrcPos 9, cwc 3, '910'
IWordSink::PutWord: cwcSrcLen 3, cwcSrcPos 9, cwc 5, 'NN910'
Which looks as though 'NN' happens to be used internally as a prefix to
indicate a Numeric value !
Another thread
"Different wildcard/hyphen behaviour in Windows 2003 Server?"
http://tinyurl.com/yjchuy
implies it's a new 'feature' with Windows 2003.
This might help to explain why 'NN1' to 'NN9' are not good search words,
but not why 'NN10' upwards seem to work !
Although searching for 'NN10' also seems to find documents that contain '10'
but not 'NN10' !
(often '10' as part of a date '07/10/06', '11-DEC-06' or time '10:55',
'10hrs' ?)
We can handle that by detecting NNxx postcodes and adding to the query ie
"CONTAINS(contents,'nn14') AND NOT CONTAINS(contents,'14')"
Interesting !
Does it help us to resolve the problem where 'NN1' is an important 'word' in
our specific context ?
I am hoping that there is a registry entry somewhere 'NumericPrefix = NN'
that we could change to 'NumericPrefix = ZZ', but I would be very surprised !
Or can we hack something to let NN1-NN9 work again ?
Those of us that have NN? postcodes ( in Northamptonshire, UK ) should not
feel paranoid about this - it may be related to a certain web-browser
instead, which is often abbreviated to NN4, NN8 etc ?
I feel another conspiracy theory coming on ...
Post by Hilary Cotter
using lrtest and the UK English word breaker I see that these words are
indexed and queried as a unit, ie NN1
I am not sure why you are getting the results you are - perhaps there is
something else in your query?
--
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.
This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
Post by Gang_Warily
Hi
I run an intranet on Windows 2003 for a local Government organisation based
in Northamptonshire, England.
All of our PostCodes (equivalent to Zip Codes) are like 'NN1 2XY'.
Using Indexing Service to search seems to discriminate against us most
unfairly,
since any search for a postcode fragment 'NN1' returns
Microsoft OLE DB Provider for Indexing Service error '80041605'
The query contained only ignored words.
NOISE.ENG does not contain any words containing 'NN'.
'NN' and any single digit is treated as 'noise'
'NN10' and above are OK.
all other letter+letter+digit combinations I've tried are OK.
(I haven't tried them all ...)
Does anyone know why this happens ?
Has anyone else noticed anything similar ?
Would changing language help ?
Can I modify the noise-word 'noise.eng' file in any way to help ?
Is there a list of 'signal words' - the opposite of 'noise words' ?
Thanks for your attention .
Eric
Gang_Warily
2006-11-22 16:29:02 UTC
Permalink
Hi again

Windows Sharepoint Services v3 seems to have good news & bad news
www.wssdemo.com/Shared%20Documents/Search.aspx?k=nn1

'NN1' is not treated as a noise word:
but it returns all documents with the single digit '1' !

'NN1 -1' should find all documents with 'NN1' but not '1' ?
www.wssdemo.com/Shared%20Documents/Search.aspx?k=nn1%20-1
"No results matching your search were found"

But presumably if there were a document containing 'NN1' then it would be
found !

There is hope !
Post by Gang_Warily
Hi
'NN1' is also regarded as a noise word in the MMC search page,
which pretty much eliminates my code.
http://www.google.com/search?q=inurl:advquery.asp
finds many Indexing Service pages on the Web.
I've tested about 20 of them - all return 'nn1 is an ignored word'.
Does anyone out there have a current Indexing Service implementation that
will search for 'nn1' ? (you don't have to generate content with 'nn1' in it
- the error will even come from an empty catalog)
Can anyone confirm or deny whether the same problem will occur with the
other Microsoft search products - SQL server 2005 Full-Text Search,
Sharepoint, MSCMS etc ? I believe the same search engine is behind them all
...
Post by Gang_Warily
Hi
I've reduced the query down to
SELECT vpath FROM WWWroot..SCOPE('DEEP TRAVERSAL OF "/" ') WHERE
CONTAINS(contents,'nn4')
Interestingly, googling for LRtest finds
http://support.microsoft.com/default.aspx/kb/890613
...
To view information about how the word breaker processes the "1.1.4322.910"
lrtest /b /c:{188D6CC5-CB03-4C01-9 12E-47D21295D77E} /m:langwrbk.dll
"1.1.4322.910"
...
IWordSink::PutAltWord: cwcSrcLen 1, cwcSrcPos 0, cwc 1, '1'
IWordSink::PutWord: cwcSrcLen 1, cwcSrcPos 0, cwc 3, 'NN1'
...
IWordSink::PutAltWord: cwcSrcLen 3, cwcSrcPos 9, cwc 3, '910'
IWordSink::PutWord: cwcSrcLen 3, cwcSrcPos 9, cwc 5, 'NN910'
Which looks as though 'NN' happens to be used internally as a prefix to
indicate a Numeric value !
Another thread
"Different wildcard/hyphen behaviour in Windows 2003 Server?"
http://tinyurl.com/yjchuy
implies it's a new 'feature' with Windows 2003.
This might help to explain why 'NN1' to 'NN9' are not good search words,
but not why 'NN10' upwards seem to work !
Although searching for 'NN10' also seems to find documents that contain '10'
but not 'NN10' !
(often '10' as part of a date '07/10/06', '11-DEC-06' or time '10:55',
'10hrs' ?)
We can handle that by detecting NNxx postcodes and adding to the query ie
"CONTAINS(contents,'nn14') AND NOT CONTAINS(contents,'14')"
Interesting !
Does it help us to resolve the problem where 'NN1' is an important 'word' in
our specific context ?
I am hoping that there is a registry entry somewhere 'NumericPrefix = NN'
that we could change to 'NumericPrefix = ZZ', but I would be very surprised !
Or can we hack something to let NN1-NN9 work again ?
Those of us that have NN? postcodes ( in Northamptonshire, UK ) should not
feel paranoid about this - it may be related to a certain web-browser
instead, which is often abbreviated to NN4, NN8 etc ?
I feel another conspiracy theory coming on ...
Post by Hilary Cotter
using lrtest and the UK English word breaker I see that these words are
indexed and queried as a unit, ie NN1
I am not sure why you are getting the results you are - perhaps there is
something else in your query?
--
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.
This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
Post by Gang_Warily
Hi
I run an intranet on Windows 2003 for a local Government organisation based
in Northamptonshire, England.
All of our PostCodes (equivalent to Zip Codes) are like 'NN1 2XY'.
Using Indexing Service to search seems to discriminate against us most
unfairly,
since any search for a postcode fragment 'NN1' returns
Microsoft OLE DB Provider for Indexing Service error '80041605'
The query contained only ignored words.
NOISE.ENG does not contain any words containing 'NN'.
'NN' and any single digit is treated as 'noise'
'NN10' and above are OK.
all other letter+letter+digit combinations I've tried are OK.
(I haven't tried them all ...)
Does anyone know why this happens ?
Has anyone else noticed anything similar ?
Would changing language help ?
Can I modify the noise-word 'noise.eng' file in any way to help ?
Is there a list of 'signal words' - the opposite of 'noise words' ?
Thanks for your attention .
Eric
Loading...