strange output from pdf extracts

Hilary Cotter

2006-10-19 17:58:15 UTC

Is this a problem with all pdf's or just a few of them. It could be that the
problem pdf's are largely binary. Use filtdump to determine if they are.
--
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.

This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

Post by s_m_b
Whilst our ifilter appears to be working ok - its pulling documents out -
the extract (characterization) appears to be mangled, or perhaps just not
PDF-1.3 ???? 44 0 obj Linearized 1 O 46 H [ 1354 398 ] L 113568 E 82643 N 4
T 112570 endobj xref 44 46 0000000016 00000 n 0000001267 00000 n 0000001752
00000 n 0000001974 00000 n 0000002225 00000 n 0000002643 00000 n 0000003226
00000 n 0000003450 00000 n 0000003850 00000 n 0000004066 00000 n 00000
is a sample from a pdf doc.
Office docs seem unaffected by this, and until a while ago, the pdfs were
ok too.
I've installed ifilter 6 recently - would that be having any effect?
system is w2k/IIS5, with content based in acrobat v3 to v6 + word, excel,
etc
using ixsso.query/ixsso.util for the engine