Lexing and parsing text queries

Input:

hello world OR alice -bob

Output:

OR( AND( TERM(hello), TERM(world) ), AND( TERM(alice), NOT( TERM(BOB) ) ) )

1. Lexing

How should we turn hello world OR alice -bob into tokens?

One thought would be to split on blank spaces. But what would that mean for -bob? There’s technically two concepts in -bob: NOT and bob.

We could defer handling of that until the next stage. But I think it would be easier for the parser to take in a stream of single, independent, individual concepts.

Are there any other concepts we need to handle, other than -?

What about "?

Handling " as a separate token would let us do things like search for the actual string -bob by surrounding it with quotes, "-bob", rather than having it turn into NOT bob.

But using " as a special character means we need to do something to allow the " to be included in the search string.

That leads me to want to introduce \ as an escape character. So the sequence of characters "-\"bob\"" would search for NOT "bob" (where the " is part of the character sequence included in the string we are avoiding in our search).

Hmmm…

But now that I’m talking about using \ as an escape character, I’m thinking my initial idea of using " to search for the character sequence -bob is redundant. Those quotes that surrounded -bob to turn off the negation were basically “escaping” the negation. If I have an escape character, then that’s redundant.

Ok. New plan. Every character is what it is and does what it does. If it’s a special character, like -, then you can turn off its “specialness” by escaping it; \-bob If you want to combine hello world into a single term so that it is treated as TERM(hello world) rather than AND(TERM(hello) TERM(world)), then you can use \ to escape the space; hello\ world. That simplifies our handling of special cases. And if you want to search for a \ then you can just escape it; \\.

Even OR and AND behavior can be disabled by simple escaping one of the characters; \OR or AN\D or whatever.

This lexing strategy is probably parseable with a regular expression. But I think I’d spend more time debugging the regular expression than if I wrote something from scratch that parsed tokens character-by-character.

2. The tokens

The tokens and the grammar seem to go hand in hand. I had a hard time thinking about the tokens until after I put some thought into the grammar.

For example, hello world should parse to AND(TERM(hello), TERM(world)). The space character there parses as an AND.

But hello OR world should parse to OR(TERM(hello), TERM(world)). The spaces there don’t parse to anything. Or maybe you could consider them part of the OR token such that a complete OR token is, in regex, #\Space+OR#\Space+.

I’m not sure if overloading spaces by having #\Space+ tokenize as AND and #\Space+OR#\Space+ tokenize as OR is going to come back to bite me.

The actual tokens

AND   = `#\Space+`
OR    = `#\Space+OR#\Space+`
NOT   = `#\Space+-`
CHARS = `.+`

I’m using . to match everything on CHARS but to prevent it from matching #\Space or - I’m only including those in the CHARS match if they are escaped with \.

3. The tokenizer

I think this it’s going to be easiest to parse this one character at a time so I’m going to use a struct to keep track of the state of the world; what character I’m on, whether or not “escape” has been activated, etc…

lexer defstruct
(defstruct lexer
  in     ; Input stream. Read one character at a time.
  out    ; Output list of tokens.
  cur    ; Token currently being read (as list of chars).
  state  ; Was last character an escape character.
  )

The tokenize function will take an instance of that struct and a stream. It will read and handle one character at a time based on the rules described in the above token table.

4. Take 2

The rules for tokenizing were too inelegant. Spaces sometimes being an AND other times being part of an OR… I didn’t like it.

New plan. Spaces are going to be a token.

Tokens, take 2

SPACER = `#\Space+`
OR    = `OR`
NOT   = `-`
OPAREN = '('
CPAREN = ')'
CHARS = `.+`

This will make scanning a lot easier. I don’t think it will have much of a change in the grammar.

The AND will appear in the grammar as just a space surrounded on both sides by anything other than an OR.

Ok. Continuing.

5. The real tokenizer

tokenize
(defun tokenize (lxr strm)
  (do ((c (getc strm) (getc strm)))
      ((null c)
       (progn
         (when (lexer-cur lxr)
           (setf (lexer-out lxr)
                 (cons (coerce (reverse (lexer-cur lxr)) 'string)
                       (lexer-out lxr))))
         (setf (lexer-cur lxr) nil)
         (setf (lexer-out lxr)
               (reverse (lexer-out lxr)))))
    (if (lexer-state lxr)
        (escaped lxr c)
        (unescaped lxr c))))
getc
(defun getc (strm)
  (read-char strm nil))

6. The parser

I found it easiest to think about the grammar from the bottom up in order of highest to lowest precedence.

The highest precedence most fundamental “thing” is the search term; a sequence of characters.

grammar
searchterm      →     <a sequence of characters>

Parenthesis form the next highest precedence thing (in my mind at least). They might be at the same level. The sequence of characters is irreducible (or reducible to itself if you choose to word it like that). The parenthetical is always reducible to whatever the things inside evaluate to. Both will always reduce the same way regardless of what surrounds them to the left or right.

grammar +=
searchgroup     →     ( ... )

What’s the name of the thing that should go inside the parenthesis there? We’ve only defined searchterm so far. But that’s too restrictive. Basically, anything can go inside the parenthesis there and it will be correct. So, although we haven’t given it a name yet, the thing that will go inside the parenthesis of this rule will probably be the highest-level, most generic thing in our grammar. Maybe something like expression.

The next thing we’ll handle is negation.

negation grammar?
negation        →     - < ... >

Hmm… What grammar thing is going to go on the right side of this rule? (By the way, Unlike the ( and ) from the rule above, I’m treating < and > as things that are outside of the rules of grammar. They are just to communicate in this document a placeholder of something.)

Negation is the first interesting one because it’s the first one where precedence comes into play.

For example, although we haven’t considered OR yet, imagine we had and consider the following:

- foo OR bar

Do we intend (NOT foo) OR par or do we intend NOT (foo OR bar)?

I’m choosing negation to bind more tightly than disjuction. So whatever grammar thing makes up the foo OR bar structure can’t go to the right of the negation rule.

Either way… that doesn’t come into play yet because the only thing we have defined so far is searchgroup and searchterm, so let’s use those.

grammar +=
negation        →     - <searchgroup | searchterm>

We want AND to bind more tightly than OR, so I think it should come next.

grammar +=
conjunction     →     <negation | searchgroup | searchterm> AND <negation | searchgroup | searchterm>

I’m beginning to see why we might want to add grammar rules that do nothing but group a bunch of choices into a single grammar term.

If we had something like:

simpler grammar?
unary       →     negation | searchgroup | searchterm

Then the AND conjuction could be the much more consise:

simpler grammar...
conjuction      →     unary AND unary

Ok. Quick recap. We have searchterm, searchgroup, negation, conjuction.

Now let’s add disjunction.

grammar +=
disjunction     →     conjuction OR conjuction

And lastly, going back to that thing inside our parenthetical that we never gave a name too.

This is just going to be one or more of any of the previously defined grammar rules.

grammar +=
expression      →     <disjuncion | conjuction | negation | searchgroup | searchterm>+

That makes the final grammar:

final grammar
expression      →     <disjuncion | conjuction | negation | searchgroup | searchterm>+
disjunction     →     conjuction OR conjuction
conjunction     →     <negation | searchgroup | searchterm> AND <negation | searchgroup | searchterm>
negation        →     - <searchgroup | searchterm>
searchgroup     →     ( expression )
searchterm      →     <a sequence of characters>

7. Why the above is broken.

broken grammar
expression      →     <disjuncion | conjuction | negation | searchgroup | searchterm>+

That just doesn’t make sense.

The most obvious thing is that a single - should not be a valid expression. The bare minimum thing you need to have a valid expression is at least a search term somewhere in there.

I think I ended up at that grammar by just blindly trusting that a parent rule could be a conjunction of some descendant rules.

8. Trying again... this time from the top down.

Thinking about it from the bottom up might be reasonable but it led me to some make some careless assumptions about what could/should work, so I’m going to try again from the top down and see what happens.

The bare minimum thing we need in order to have a valid expression is at least a search term somewhere in the expression. (I’m going to take it as a given that we’ll call the top-level all-encompassing grammar term an expression.)

grammar take 2
expression      →     <...> [ ___ <...> ]*

I’ll fill in the ... and ___ later. I basically want to say that an expression is “some thing, possibly followed by some type of ‘joiner’ followed by the same type of thing”.

The ‘joiner’ needs to be the lowest precedence thing. Otherwise, although I can’t prove it, I think we’ll hit ambiguity.

I think the next weakest binding is AND. So the <...> in the block above will be replaced by whatever we call the following rule. Let’s call it a conjuction.

grammar take 2 +=
conjuction      →     <...> [ AND <...> ]*

Looking at that rule, I feel like conjuction is a bit of a misnomer because the AND <...> is optional. That optionality is necesarry though, unless I wanted to take the grammar a different route. It’s necessary because if this conjuction is taking the place of the <...> in the expression then I don’t want to require AND in an expression.

I’ll accept it as a misnomer because I think it will simplify things.

Side-note: it’s pretty obvious now that the ___ in the [ ___ <...> ] from the expression rule is going to be OR.

Anyways, moving on… The next weakest binding might be shared between - (negation) and (...) (grouping).

Since this next thing is going to replace the <...> in the conjuction, it might be nice to be able to refer to it by a single name rather than a structure like <negation | parenthetical>. But I’m not going to give it a single name. I care more about reducing the number of grammar rules.

grammar take 2 +=
negation        →     - <...>

And what does negation operate on? What’s in that <...>? It’s at least a parenthetical. But it’s also something further down that we haven’t gotten to yet; a simple “search term”.

grammar take 2 +=
searchgroup     →     ( <...> )

And lastly, our most atomic, the “search term”.

grammar take 2 +=
searchterm      →     <sequence of characters>

Grammar, take 2, finalized

What does that grammar look like when we put it all together?

grammar take 2 finalized
expression      →     conjunction [ OR conjunction ]*
conjuction      →     <negation | searchgroup | searchterm> [ AND <negation | searchgroup | searchterm ]*
negation        →     - <searchgroup | searchterm>
searchgroup     →     ( <searchterm> )
searchterm      →     <sequence of characters> | expression

9. Nope! That's still wrong!

I’m having a hard time clearly explaining this.

But think about ___ OR ___ AND ___ OR ___.

We want the ... ___ AND ___ ... in the middle to get evaluated first.

But given the grammar above, I don’t think that wouldn’t happen.

When I think about parsing this grammar, I think about it being “left recursive”.

Bah. I’m stuck again.

I think an issue is that I keep trying to construct grammars with a “tree” structure in mind. But there’s something different about this. Like… the left nodes need to have slightly different rules than the right nodes. And it’s not a tree because certain children reference ancestors. And there’s certain rules about when/where to put those child->ancestor references in order to not run into issues.

10. Oh I think I figured it out.

grammar take 3
conjuction      →     ___ AND ___

The leftmost ___ there can’t have something that binds weaker than AND.

That’s all there is to it.

When you have an operator with a left and a right side, the left side can’t have any descendants that have operators that bind weaker than the operator.

working grammar
expression      →     conjunction [ OR expression ]*
conjuction      →     <negation | searchgroup | searchterm> [ AND conjunction ]*
negation        →     - <searchgroup | searchterm>
searchgroup     →     ( expression )
searchterm      →     <sequence of characters>