Further Understanding ES (Second Iteration)

By asking the same question in Gemini and OpenAI, not only can I further the understanding of the topic but also assess the two AIs. Here is the question for both:

“help me systematically learn elastic search, i have basic understanding already but want to have a comprehensive view and particularly know how the query/search is conducted, pagination, max docs in my work, which applying elastic serarch to document searching that the org already built up the doc database using Elastic searching technique, i want to avoid pitfalls or mistakes if ignorant of ES knowledge.”

First, openAI ChatGPT answer:

Core concepts: cluster, node, index, shard replica, document and mapping, analyzers, tokenizer, filters
Deep dive into querying and search, Search APIS, Query DSL (doman specific language?)
Pagination, Limits and Performance: from/size pagination is suitable for small offsets, but inefficient for deep pagination, i.e. from > 10,000 is costly. the alternative is “search after” which relies on a sort value from the last docuemnt, it’s efficietn but requires stable sorting, e.g. sort by timestamp + ID
Handling large documents and index management:
Error Avoidance and production best practices: update mappings carefully, use dynamic mapping cautiousely, use Kibana’s Dev tools to monitor

Then, Gemini’s version:

Solidicy Core concepts: cluster, node, index, type, document, fields, mapping, shards (Lucene index), analyzers; basic operations(CRUD), indexing, getting, deleting, updateing and bulk API
Mastering Search and querying:
query DSL: structure; query context or filter context; Leaf qury clauses: match, multi-match, term, range, exists, prefix, wildcard, regexp; Compound query clauses: bool, dis_max, constant_score,
Mapping and analysis in depth:
Sorting
Highlighting: highlight parameter
Practicalities and Advanced topics:

pagination is crucial for max docs.
max docs
aggregations
performance tuning and best practices
working with an existing system

Pitfalls to Avoid (Based on Your Concerns):

Ignoring Mappings / Relying on Dynamic Mapping:
- Mistake: Letting ES guess field types can lead to text where keyword is needed (e.g., for exact IDs, status codes) or vice-versa, impacting search and aggregations.
- Solution: Define explicit mappings. For text fields you want to sort/aggregate on, use multi-fields (e.g., a .keyword sub-field).
Misunderstanding Query vs. Filter Context:
- Mistake: Using query context (e.g., must with a term query) when a filter context (filter with a term query) would be faster and more appropriate for exact matches.
- Solution: Use filter for yes/no criteria. Use query when relevance scoring is needed.
Deep Pagination with from/size:
- Mistake: Requesting page 1000 with from: 9990, size: 10. This will be slow and hit max_result_window.
- Solution: Implement search_after for deep pagination needs.
Overly Broad or Inefficient Queries:
- Mistake: Using leading wildcards (*term), complex regexps, or script queries unnecessarily.
- Solution: Design data and queries for efficiency. Use prefix query if possible, analyze text appropriately so match queries work well.
Not Understanding Analyzers:
- Mistake: Searching for “Quick Brown Fox” and not getting results because the analyzer tokenized it as “quick”, “brown”, “fox”, and your term query (if misused) is looking for the exact, unanalyzed string.
- Solution: Use match queries for analyzed text. Understand what analyzer is configured for your text fields. Use Kibana’s Dev Tools > Grok Debugger or _analyze API to test analyzers.
Ignoring Cluster Health and Monitoring:
- Mistake: Letting the cluster go red/yellow, running out of disk space, or having consistently high CPU/heap.
- Solution: Regularly monitor cluster health, shard allocation, disk usage, JVM heap.
Treating Elasticsearch like a Relational Database:
- Mistake: Trying to design complex joins or expecting ACID transactions in the same way.
- Solution: Denormalize data where appropriate for search. Understand ES’s consistency model. Use nested or join field types for limited relational needs, but understand their tradeoffs.

Pagination (Crucial for “max docs”):
- from and size: The basic method.
  - size: Number of hits to return (default 10).
  - from: Starting offset (default 0).
  - Pitfall: Deep pagination with from and size is inefficient. Each shard must produce from + size results, which are then coordinated. For large from values, this becomes very expensive. The default index.max_result_window is 10,000 (sum of from + size). You can increase this, but it’s generally a bad idea due to performance implications.
- search_after: The recommended way for deep pagination.
  - Uses the sort values of the last document from the previous page to fetch the next page.
  - Stateless and more efficient for deep scrolling.
  - Requires a consistent sort order, usually including _id or another unique field as a tie-breaker.
- Scroll API (_scroll):
  - Used for retrieving very large numbers of documents (e.g., for re-indexing or data export).
  - Not for real-time user requests. It keeps a search context alive on the server.
  - Remember to clear the scroll context when done.
- Point in Time (PIT) API:
  - Used with search_after to ensure consistent views of the data across multiple search requests, even if documents are being indexed or updated concurrently. Useful for deep, consistent pagination.
Max Docs (Limits and Considerations):
- index.max_result_window: Default is 10,000. This is the maximum value of from + size. If you try to go beyond this with from/size, you’ll get an error. You can increase it, but it’s strongly discouraged for performance reasons. Use search_after or scroll for more than 10,000 documents.
- index.max_terms_count: Default 65,536. Limits the number of terms in a terms query.
- search.max_buckets: Default 10,000 (can be overridden up to 65,535 or higher with a cluster setting). Limits the number of buckets returned by an aggregation.
- Shard Limits: The number of documents per shard isn’t a hard limit from Elasticsearch itself, but performance degrades significantly if shards become too large (tens of GBs, e.g., > 50GB is often a sign to reconsider). Number of shards per node is also a consideration (heap usage).
- Understanding what “max docs” means in your context:
  - Max docs returned in a single search request? (Governed by size and index.max_result_window).
  - Max docs you can efficiently paginate through? (Leads to search_after).
  - Max docs an index can hold? (No hard limit, but practical limits based on shard size, node resources).
Aggregations:
- Powerful for analytics and faceted search (e.g., “show me counts of documents by author”).
- Bucket aggregations (grouping documents) and Metric aggregations (calculating metrics on those groups).
Performance Tuning & Best Practices:
- Use filter context where possible.
- Avoid leading wildcards in wildcard or regexp queries.
- Optimize mappings (e.g., use keyword for IDs, status fields; disable _source or use source filtering if payload is huge and not always needed).
- Shard sizing and strategy.
- Monitor your cluster (Heap usage, CPU, disk I/O, query latencies). Use the _cat APIs and Kibana monitoring.
Working with an Existing System:
- Get Mappings: GET /<your_index_name>/_mapping. Understand how fields are defined.
- Get Settings: GET /<your_index_name>/_settings. Check for custom settings.
- Inspect Queries: If possible, see what queries the application is currently generating.
- Understand Indexing Process: How are documents getting into Elasticsearch? Is there an ETL pipeline? This affects data freshness and structure.

Naixian Zhang

Further Understanding ES (Second Iteration)

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply