visit
LIKE
operator in this or that form
.
, ,
, ?
, !
, -
etc.Orange
by the keyword orange
.
Manticore does it all automatically. For example, text “What do I have? The list is: a cat, a dog and a parrot.
” gets tokenized into:
mysql> drop table if exists t;
mysql> create table t(f text);
mysql> call keywords('What do I have? The list is: a cat, a dog and a parrot.', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | what | what |
| 2 | do | do |
| 3 | i | i |
| 4 | have | have |
| 5 | the | the |
| 6 | list | list |
| 7 | is | is |
| 8 | a | a |
| 9 | cat | cat |
| 10 | a | a |
| 11 | dog | dog |
| 12 | and | and |
| 13 | a | a |
| 14 | parrot | parrot |
+------+-----------+------------+
14 rows in set (0.00 sec)
Here comes the first problem: in some cases separators are considered regular word characters, for example in “Is c++ the most powerful language?
” it’s obvious that c++
is a separate word. It’s easy to understand for people, but not for full-text search algorithms, since it sees the plus sign, doesn’t find it in it’s list of word characters and removes it from the token, so you end up with:
mysql> drop table if exists t;
mysql> create table t(f text);
mysql> call keywords('Is c++ the most powerful language?', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | is | is |
| 2 | c | c |
| 3 | the | the |
| 4 | most | most |
| 5 | powerful | powerful |
| 6 | language | language |
+------+-----------+------------+
6 rows in set (0.00 sec)
OK, but what’s the problem?
The problem is that after this tokenization if you search for c#
, for example, you will find the above sentence:
mysql> drop table if exists t;
mysql> create table t(f text);
mysql> insert into t values(0,'Is c++ the most powerful language?');
mysql> select highlight() from t where match('c#');
+-------------------------------------------+
| highlight() |
+-------------------------------------------+
| Is <b>c</b>++ the most powerful language? |
+-------------------------------------------+
1 row in set (0.01 sec)
It happens because c#
is also tokenized to just c
and then the c
from the search query matches the c
from the document and you get it.
OK, why don’t I put + and # to the word characters list?
mysql> drop table if exists t;
mysql> create table t(f text) charset_table='non_cjk,+,#';
mysql> call keywords('Is c++ the most powerful language?', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | is | is |
| 2 | c++ | c++ |
| 3 | the | the |
| 4 | most | most |
| 5 | powerful | powerful |
| 6 | language | language |
+------+-----------+------------+
6 rows in set (0.00 sec)
It works, but +
in the list immediately starts affecting other words and searches, for example:
mysql> drop table if exists t;
mysql> create table t(f text) charset_table='non_cjk,+,#';
mysql> call keywords('What is 2+2?', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | what | what |
| 2 | is | is |
| 3 | 2+2 | 2+2 |
+------+-----------+------------+
3 rows in set (0.00 sec)
You wanted c++
to be a separate word, but not 2+2
, didn’t you?
Right, so what can we do?
To treat c++
special way you can make it an exception.
So, exceptions
(also known as synonyms) allow to map one or more terms (including terms with characters that would normally be excluded) to a single keyword.
Let’s make c++
an exception by putting it into an exceptions file:
➜ ~ cat /tmp/exceptions
c++ => c++
mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> call keywords('Is c++ the most powerful language? What is 2+2?', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | is | is |
| 2 | c++ | c++ |
| 3 | the | the |
| 4 | most | most |
| 5 | powerful | powerful |
| 6 | language | language |
| 7 | what | what |
| 8 | is | is |
| 9 | 2 | 2 |
| 10 | 2 | 2 |
+------+-----------+------------+
10 rows in set (0.01 sec)
Hooray, c++
is now a separate word and the plus signs are not lost, and all is ok with 2+2
too.
For example, people write c++
both in lower and upper case. Let’s try the above exception with the upper case?
mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> call keywords('Is C++ the most powerful language? How about c++?', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | is | is |
| 2 | c | c |
| 3 | the | the |
| 4 | most | most |
| 5 | powerful | powerful |
| 6 | language | language |
| 7 | what | what |
| 8 | is | is |
| 9 | 2 | 2 |
| 10 | 2 | 2 |
+------+-----------+------------+
10 rows in set (0.00 sec)
Oops, C++
was tokenized as just c
, because the exception is c++
(lower case), not C++
(upper case).
But did you notice the exception constitutes a pair of items, not a single one: c++ => c++
. The left part is what triggers the exceptions algorithm in the text, the right part is a resulting token. Let’s try to add mapping of C++
to c++
?
➜ ~ cat /tmp/exceptions
c++ => c++
C++ => c++
mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> call keywords('Is C++ the most powerful language? How about c++?', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | is | is |
| 2 | c++ | c++ |
| 3 | the | the |
| 4 | most | most |
| 5 | powerful | powerful |
| 6 | language | language |
| 7 | how | how |
| 8 | about | about |
| 9 | c++ | c++ |
+------+-----------+------------+
9 rows in set (0.00 sec)
Alright, now it’s fine again, since both C++
and c++
are tokenized into token c++
. So satisfying.
AT&T => AT&T
and at&t => AT&T
.M&M's => M&M's
and m&m's => M&M's
and M&m's => M&M's
U.S.A. => USA
and US => USA
us => USA
, because we don’t want each us
become USA
.
Manticore Search users also often call exceptions
synonyms, because another use case for them is not to just retain the special character and letter case, but to map terms written absolutely differently to the same token, for example:
MS Windows => ms windows
Microsoft Windows => ms windows
Why is this important? Because it enables to easily find documents with Microsoft Windows
by MS Windows
and vice versa.
mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> insert into t values(0, 'Microsoft Windows is one of the first operating systems');
mysql> select * from t where match('MS Windows');
+---------------------+---------------------------------------------------------+
| id | f |
+---------------------+---------------------------------------------------------+
| 68976139 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.00 sec)
So at first glance, it works fine, but thinking further about it and recalling the exceptions are case and byte sensitive you can ask yourself: “Can’t people write MicroSoft windows
, MS WINDOWS
, microsoft Windows
and so on?”
It looks no good at all. What can we do about it?
Another tool which is similar to the exceptions is wordforms
. Unlike the exceptions, the word forms are applied after tokenizing incoming text. So they are:
charset_table
enables case sensitivity)
➜ ~ cat /tmp/wordforms
walks => walk
walked => walk
walking => walk
mysql> drop table if exists t;
mysql> create table t(f text) wordforms='/tmp/wordforms';
mysql> call keywords('walks _WaLkeD! walking', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | walks | walk |
| 2 | walked | walk |
| 3 | walking | walk |
+------+-----------+------------+
3 rows in set (0.00 sec)
As you can see all the 3 words were converted to just walk
and, note, the 2nd word _WaLkeD!
even being very deformed was also normalized fine. Do you see where I’m going with this? Yes, the MS Windows
example. Let’s test if the wordforms can be useful to solve that issue.
➜ ~ cat /tmp/wordforms
ms windows => ms windows
microsoft windows => ms windows
mysql> drop table if exists t;
mysql> create table t(f text) wordforms='/tmp/wordforms';
mysql> insert into t values(0, 'Microsoft Windows is one of the first operating systems'), (0, 'porch windows'),(0, 'Windows are rolled down');
mysql> select * from t;
+---------------------+---------------------------------------------------------+
| id | f |
+---------------------+---------------------------------------------------------+
| 68976166 | Microsoft Windows is one of the first operating systems |
| 68976167 | porch windows |
| 68976168 | Windows are rolled down |
+---------------------+---------------------------------------------------------+
3 rows in set (0.00 sec)
mysql> select * from t where match('MS Windows');
+---------------------+---------------------------------------------------------+
| id | f |
+---------------------+---------------------------------------------------------+
| 68976166 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.00 sec)
✅ MS Windows
finds Microsoft Windows
fine.
mysql> select * from t where match('ms WINDOWS');
+---------------------+---------------------------------------------------------+
| id | f |
+---------------------+---------------------------------------------------------+
| 68976166 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.01 sec)
✅ ms WINDOWS
works fine too.
mysql> select * from t where match('mIcRoSoFt WiNdOwS');
+---------------------+---------------------------------------------------------+
| id | f |
+---------------------+---------------------------------------------------------+
| 68976166 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.00 sec)
✅ And even mIcRoSoFt WiNdOwS
finds the same document.
mysql> select * from t where match('windows');
+---------------------+---------------------------------------------------------+
| id | f |
+---------------------+---------------------------------------------------------+
| 68976166 | Microsoft Windows is one of the first operating systems |
| 68976167 | porch windows |
| 68976168 | Windows are rolled down |
+---------------------+---------------------------------------------------------+
3 rows in set (0.00 sec)
✅ Just basic windows
finds all the documents.
So indeed, wordforms
helps to solve the issue.
Let’s take a look at another example: we want to improve search for the brand name Floor & Decor
. We can assume people can write this name in the following forms:
Floor & Decor
Floor & decor
floor & decor
Floor and Decor
floor and decor
Floor & Decor Holdings
Floor & Decor Holdings, inc.
Now that we know how exceptions
and wordforms
work what do we do to cover this brand name?
First of all we can easily notice that the canonical brand name is Floor & Decor
, i.e. it includes a special character which is normally considered a word separator, so should we use exceptions
? But the name is long and can be written in many ways. If we use exceptions
we can end up with a huge list of all the combinations. Moreover there are extended forms Floor & Decor Holdings
and Floor & Decor Holdings, inc.
which can make the list even longer.
The most optimal solution in this case is to just use wordforms
like this:
➜ ~ cat /tmp/wordforms
floor & decor => fnd
floor and decor => fnd
floor & decor holdings => fnd
floor and decor holdings => fnd
floor & decor holdings inc => fnd
floor and decor holdings inc => fnd
Why does it include &
? Actually you can skip it:
floor decor => fnd
floor and decor => fnd
floor decor holdings => fnd
floor and decor holdings => fnd
floor decor holdings inc => fnd
floor and decor holdings inc => fnd
because wordforms
anyway ignores non-word characters, but just for the sake of ease of reading it was left.
As a result you’ll get each combination tokenized as fnd
which will be our shortkey for this brand name.
mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms';
mysql> call keywords('Floor & Decor', 't')
+------+-------------+------------+
| qpos | tokenized | normalized |
+------+-------------+------------+
| 1 | floor decor | fnd |
+------+-------------+------------+
1 row in set (0.00 sec)
mysql> call keywords('floor and Decor', 't')
+------+-----------------+------------+
| qpos | tokenized | normalized |
+------+-----------------+------------+
| 1 | floor and decor | fnd |
+------+-----------------+------------+
1 row in set (0.00 sec)
mysql> call keywords('Floor & Decor holdings', 't')
+------+----------------------+------------+
| qpos | tokenized | normalized |
+------+----------------------+------------+
| 1 | floor decor holdings | fnd |
+------+----------------------+------------+
1 row in set (0.00 sec)
mysql> call keywords('Floor & Decor HOLDINGS INC.', 't')
+------+--------------------------+------------+
| qpos | tokenized | normalized |
+------+--------------------------+------------+
| 1 | floor decor holdings inc | fnd |
+------+--------------------------+------------+
1 row in set (0.00 sec)
mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms';
mysql> insert into t values(0,'It\'s located on the 2nd floor. Decor is also nice');
mysql> select * from t where match('Floor & Decor Holdings');
+---------------------+---------------------------------------------------+
| id | f |
+---------------------+---------------------------------------------------+
| 68976231 | It's located on the 2nd floor. Decor is also nice |
+---------------------+---------------------------------------------------+
1 row in set (0.00 sec)
We can see here that Floor & Decor Holdings
finds the document which has floor
in the end of the first sentence and the following one starts with Decor
. This happens because floor. Decor
also gets tokenized to fnd
since we use just wordforms
that are insensitive to letter case and special characters:
mysql> call keywords('floor. Decor', 't');
+------+-------------+------------+
| qpos | tokenized | normalized |
+------+-------------+------------+
| 1 | floor decor | fnd |
+------+-------------+------------+
1 row in set (0.00 sec)
mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms' index_sp='1';
mysql> insert into t values(0,'It\'s located on the 2nd floor. Decor is also nice');
mysql> select * from t where match('Floor & Decor Holdings');
Empty set (0.00 sec)
Floor & Decor
, as we remember is converted into fnd
by wordforms
index_sp='1'
splits text into sentencesfloor.
and Decor
end up in different sentencesfnd
and therefore all the original forms of it anymore
exceptions
- wordforms
-