IPhraseSink question

Discussion:

(too old to reply)

WALDO

2007-05-30 19:07:00 UTC

I am developing a custom library for WordBreaking and Stemming using
Microsoft's IWordBreaker, IWordSink, IWordFormSink, IStemmer, and
IPhraseSink interfaces. I am reasonably certain my code works, but I can't
verify the IPhraseSink implementation.

I feed my library bits of text that just roll off the top of my head, which
parse very well (breaking and generating alternate words), like:

"I am the very model of a modern major general."
"You're either with me or against me."

but the word breaker never steps into the IPhraseSink. I can't distinguish
whether the text I'm feeding it is just not phraseable [I realize that's not
a word :)], or if I've implemented the IPhraseSink incorrectly. There isn't
much out there on the web about IPhraseSink specifically.

Does anyone out there have a test bit of text that they know will be broken
into phrase(s) using the default en-us locale (1033)?

Or should I just trust my code?
Or should I just keep feeding it more text?

Any help is appreciated.
Thanks in advance.

WALDO

Mike C#

2007-05-30 20:11:53 UTC

Permalink

Not familiar with building out Wordbreakers and Stemmers (but starting to
play around with them). I was just wondering, what happens if you put
quotes around your phrase, since a "phrase search" on the SQL side is
quoted?

"I am the model", "You're either", etc. but with the quotes?

Post by WALDO
I am developing a custom library for WordBreaking and Stemming using
Microsoft's IWordBreaker, IWordSink, IWordFormSink, IStemmer, and
IPhraseSink interfaces. I am reasonably certain my code works, but I can't
verify the IPhraseSink implementation.
I feed my library bits of text that just roll off the top of my head,
"I am the very model of a modern major general."
"You're either with me or against me."
but the word breaker never steps into the IPhraseSink. I can't distinguish
whether the text I'm feeding it is just not phraseable [I realize that's
not a word :)], or if I've implemented the IPhraseSink incorrectly. There
isn't much out there on the web about IPhraseSink specifically.
Does anyone out there have a test bit of text that they know will be
broken into phrase(s) using the default en-us locale (1033)?
Or should I just trust my code?
Or should I just keep feeding it more text?
Any help is appreciated.
Thanks in advance.
WALDO

WALDO

2007-05-30 21:18:19 UTC

Permalink

Tried that. No dice.

I am the very model of a modern "major general"

Post by Mike C#
Not familiar with building out Wordbreakers and Stemmers (but starting to
play around with them). I was just wondering, what happens if you put
quotes around your phrase, since a "phrase search" on the SQL side is
quoted?
"I am the model", "You're either", etc. but with the quotes?

Post by WALDO
I am developing a custom library for WordBreaking and Stemming using
Microsoft's IWordBreaker, IWordSink, IWordFormSink, IStemmer, and
IPhraseSink interfaces. I am reasonably certain my code works, but I can't
verify the IPhraseSink implementation.
I feed my library bits of text that just roll off the top of my head,
"I am the very model of a modern major general."
"You're either with me or against me."
but the word breaker never steps into the IPhraseSink. I can't
distinguish whether the text I'm feeding it is just not phraseable [I
realize that's not a word :)], or if I've implemented the IPhraseSink
incorrectly. There isn't much out there on the web about IPhraseSink
specifically.
Does anyone out there have a test bit of text that they know will be
broken into phrase(s) using the default en-us locale (1033)?
Or should I just trust my code?
Or should I just keep feeding it more text?
Any help is appreciated.
Thanks in advance.
WALDO

Mike C#

2007-05-30 21:21:32 UTC

Permalink

Post by WALDO
Tried that. No dice.
I am the very model of a modern "major general"

Sorry then. Maybe Hilary will pick this one up. I know he has experience
with the interfaces and programming wordbreakers/stemmers.

WALDO

2007-05-31 18:47:05 UTC

Permalink

OK, I got one so far.

Heart-Shaped Box

Post by Mike C#

Post by WALDO
Tried that. No dice.
I am the very model of a modern "major general"

Sorry then. Maybe Hilary will pick this one up. I know he has experience
with the interfaces and programming wordbreakers/stemmers.

WALDO

2007-05-31 21:50:53 UTC

Permalink

So it seems the Phrases get triggered on hyphenated strings. I'm going to
see what else I can get to trigger them.

Post by WALDO
OK, I got one so far.
Heart-Shaped Box

Post by Mike C#

Post by WALDO
Tried that. No dice.
I am the very model of a modern "major general"

Sorry then. Maybe Hilary will pick this one up. I know he has
experience with the interfaces and programming wordbreakers/stemmers.

WALDO

2007-06-01 18:48:32 UTC

Permalink

Actually, I have yet to trigger the IPhraseSink methods, but this is what
I've discovered about the IWordSink.StartAltPhrase, IWordSink.EndAltPhrase
and IWordSink.PutAltWord

Phrasing:
When a phrase is encountered, StartAltPhrase is called. StartAltPhrase may
be called many more times before EndAltPhrase is called. The first time
IWordSink.PutWord is called after StartAltPhrase, the entire buffer passed
is the entire phrase AND the first alternate word of the phrase. You should
start tracking the phrase at this point if you're making any kind of
association between the phrase and any words from the phrase at this point.
Each subsequent time PutWord is called is a chunk of the phrase until
EndAltPhrase is called.

Alternate words (From IWordSink, not IWordFormSink):
Alternate words from the IWordSink operate very similarly. The first time
PutAltWord is called signals the start of an alternate word session. The
entire buffer is the base word AND the chunk of the buffer is an alternate
word. PutAltWord will be called every for every subsequent alternate form of
the base word, except for the last alternate word. For the last alternate
word, PutWord will be called, signaling the end of the alternate word form
session.

This is what I've discovered from throwing apprixmately 30,000 entries at my
word breaker. Does this sound accurate? I still haven't gotten my
IPhraseSink to fire yet.

Post by WALDO
So it seems the Phrases get triggered on hyphenated strings. I'm going to
see what else I can get to trigger them.

Post by WALDO
OK, I got one so far.
Heart-Shaped Box

Post by Mike C#

Post by WALDO
Tried that. No dice.
I am the very model of a modern "major general"

Sorry then. Maybe Hilary will pick this one up. I know he has
experience with the interfaces and programming wordbreakers/stemmers.