Eelasticsearch 多字段特性及 Mapping 中配置自定义 Analyzer详解

seo · 发表于 2022-5-31 12:57:18

文章目录1. 多字段特性2. Excat values(精词) v.s Full Text（全文本）3. Excat values不需要分词4. 自定义分词器4.1 Character Filters4.2 Tokenizer4.3 Token Filters5. Tokenizer5.1 Tokenizer与char_filter5.1.1 使用keyword与html_strip方法处理网页5.1.2 使用 `standard`与`mapping` 方法进行替换5.1.3 standard与mapping替换表情符号5.1.4 standard与正则表达式pattern_replace5.2 tokenizer与text5.3 tokenizer与filter（token_filters）6. 设置标准的 analyzer6.1 官网自定义分词器的标准格式6.2 定义自己的分词器

1. 多字段特性1.厂家名字实现精确匹配
增加一个 keyword 字段2.使用不同的 analyzer
不同语言pinyin 字段的搜索还支持为搜索和索引指定不同的 analyzer

2. Excat values(精词) v.s Full Text（全文本）Excat Values ：包括数字 / 日期 / 具体一个字符串（例如 “Apple Store”）
Elasticsearch 中的 keyword全文本，非结构化的文本数据
Elasticsearch 中的 text

3. Excat values不需要分词Elaticsearch 为每一个字段创建一个倒排索引
Exact Value 在索引时，不需要做特殊的分词处理

4. 自定义分词器当 Elasticsearch 自带的分词器无法满足时，可以自定义分词器。通过自组合不同的组件实现：
Character FilterTokenizerToken Filter
4.1 Character Filters在 Tokenizer 之前对文本进行处理，例如增加删除及替换字符。可以配置多个 Character Filters。会影响 Tokenizer 的 position 和 offset 信息
一些自带的 Character Filters：
HTML strip - 去除 html 标签Mapping - 字符串替换Pattern replace - 正则匹配替换
4.2 Tokenizer将原始的文本按照一定的规则，切分为词（term or token）Elasticsearch 内置的 Tokenizers：whitespace |standard |uax_url_email |pattern |keyword |path hierarchy可以用 JAVA 开发插件，实现自己的 Tokenizer
4.3 Token Filters将 Tokenizer 输出的单词，进行增加、修改、删除自带的 Token Filters：Lowercase |stop|synonym（添加近义词）
5. Tokenizer
5.1 Tokenizer与char_filter
5.1.1 使用keyword与html_strip方法处理网页POST _analyze
{
"tokenizer":"keyword",
"char_filter":["html_strip"],
"text": "hello world"
}
//结果
{
"tokens" : [
  {
"token" : "hello world",
"start_offset" : 3,
"end_offset" : 18,
"type" : "word",
"position" : 0
  }
]
}
5.1.2 使用 standard与mapping 方法进行替换standard:按词切分，小写处理
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
   "type" : "mapping",
   "mappings" : [ "- => _"]
}
  ],
"text": "123-456, I-test! test-990 650-555-1234"
}
//返回
{
"tokens" : [
  {
"token" : "123_456",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 0
  },
  {
"token" : "I_test",
"start_offset" : 9,
"end_offset" : 15,
"type" : "",
"position" : 1
  },
  {
"token" : "test_990",
"start_offset" : 17,
"end_offset" : 25,
"type" : "",
"position" : 2
  },
  {
"token" : "650_555_1234",
"start_offset" : 26,
"end_offset" : 38,
"type" : "",
"position" : 3
  }
]
}
5.1.3 standard与mapping替换表情符号POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
   "type" : "mapping",
   "mappings" : [ ":) => happy", ":( => sad"]
}
  ],
  "text": ["I am felling :)", "Feeling :( today"]
}
//返回
{
"tokens" : [
  {
"token" : "I",
"start_offset" : 0,
"end_offset" : 1,
"type" : "",
"position" : 0
  },
  {
"token" : "am",
"start_offset" : 2,
"end_offset" : 4,
"type" : "",
"position" : 1
  },
  {
"token" : "felling",
"start_offset" : 5,
"end_offset" : 12,
"type" : "",
"position" : 2
  },
  {
"token" : "happy",
"start_offset" : 13,
"end_offset" : 15,
"type" : "",
"position" : 3
  },
  {
"token" : "Feeling",
"start_offset" : 16,
"end_offset" : 23,
"type" : "",
"position" : 104
  },
  {
"token" : "sad",
"start_offset" : 24,
"end_offset" : 26,
"type" : "",
"position" : 105
  },
  {
"token" : "today",
"start_offset" : 27,
"end_offset" : 32,
"type" : "",
"position" : 106
  }
]
}
5.1.4 standard与正则表达式pattern_replaceGET _analyze
{
"tokenizer": "standard",
"char_filter": [
{
   "type" : "pattern_replace",
   "pattern" : "http://(.*)",
   "replacement" : "$1"
}
  ],
  "text" : "http://www.elastic.co"
}
//返回
{
"tokens" : [
  {
"token" : "www.elastic.co",
"start_offset" : 0,
"end_offset" : 21,
"type" : "",
"position" : 0
  }
]
}
5.2 tokenizer与text通过路劲切分POST _analyze
{
"tokenizer":"path_hierarchy",
"text":"/user/ymruan/a"
}
//返回
{
"tokens" : [
  {
"token" : "/user",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
  },
  {
"token" : "/user/ymruan",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 0
  },
  {
"token" : "/user/ymruan/a",
"start_offset" : 0,
"end_offset" : 14,
"type" : "word",
"position" : 0
  }
]
}
5.3 tokenizer与filter（token_filters）GET _analyze
{
"tokenizer": "whitespace",
"filter": ["stop"], //on the a
"text": ["The gilrs in China are playing this game!"]
}
//返回，保留名词复数
{
  "tokens" : [
{
   "token" : "The",
   "start_offset" : 0,
   "end_offset" : 3,
   "type" : "word",
   "position" : 0
},
{
   "token" : "gilrs",
   "start_offset" : 4,
   "end_offset" : 9,
   "type" : "word",
   "position" : 1
},
{
   "token" : "China",
   "start_offset" : 13,
   "end_offset" : 18,
   "type" : "word",
   "position" : 3
},
{
   "token" : "playing",
   "start_offset" : 23,
   "end_offset" : 30,
   "type" : "word",
   "position" : 5
},
{
   "token" : "game!",
   "start_offset" : 36,
   "end_offset" : 41,
   "type" : "word",
   "position" : 7
}
  ]
}snowball将名称复数变为单数
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["stop","snowball"], //on the a
"text": ["The gilrs in China are playing this game!"]
}
{
"tokens" : [
  {
"token" : "The", //大写的The 不做过滤
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
  },
  {
"token" : "gilr",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
  },
  {
"token" : "China",
"start_offset" : 13,
"end_offset" : 18,
"type" : "word",
"position" : 3
  },
  {
"token" : "play",
"start_offset" : 23,
"end_offset" : 30,
"type" : "word",
"position" : 5
  },
  {
"token" : "game!",
"start_offset" : 36,
"end_offset" : 41,
"type" : "word",
"position" : 7
  }
]
}加入 lowercase 后，The 被当成 stopword 删除
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["lowercase","stop","snowball"],
"text": ["The gilrs in China are playing this game!"]
}
{
"tokens" : [
  {
"token" : "gilr",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
  },
  {
"token" : "china",
"start_offset" : 13,
"end_offset" : 18,
"type" : "word",
"position" : 3
  },
  {
"token" : "play",
"start_offset" : 23,
"end_offset" : 30,
"type" : "word",
"position" : 5
  },
  {
"token" : "game!",
"start_offset" : 36,
"end_offset" : 41,
"type" : "word",
"position" : 7
  }
]
}
6. 设置标准的 analyzer
6.1 官网自定义分词器的标准格式PUT /my_index
{
  "settings": {
   "analysis": {
      "char_filter": { ... custom character filters ... },//字符过滤器
      "tokenizer": { ... custom tokenizers ... },//分词器
      "filter": { ... custom token filters ... }, //词单元过滤器
      "analyzer": { ... custom analyzers ... }
   }
  }
}
6.2 定义自己的分词器PUT my_index
{
"settings": {
  "analysis": {
"analyzer": {
   "my_custom_analyzer":{
      "type":"custom",
      "char_filter":[
      "emoticons"
      ],
      "tokenizer":"punctuation",
      "filter":[
      "lowercase",
      "english_stop"
      ]
   }
},
"tokenizer": {
   "punctuation":{
      "type":"pattern",
      "pattern": "[ .,!?]"
   }
},
"char_filter": {
   "emoticons":{
      "type":"mapping",
      "mappings" : [
      ":) => happy",
      ":( => sad"
      ]
   }
},
"filter": {
   "english_stop":{
      "type":"stop",
      "stopwords":"_english_"
   }
}
  }
}
}参考资料：
极客时间：Elasticsearch核心技术与实战

Eelasticsearch 多字段特性及 Mapping 中配置自定义 Analyzer详解

发表回复

楼主

热门推荐