Friday, July 9, 2010

Search Options

Search is a critical element for OpenGeoPortal. Results must be properly ranked, complete and returned quickly. There are two approaches we can take. The traditional solution uses SQL. The layer meta data is put into a relational database. SQL queries run against the table and the results are displayed. Often the SQL query, using "ORDER BY", ranks the layers and determines the order layers are displayed. Another potential approach is to use more modern search technology. There are open source solutions (Lucene) and Solr) we might integrate.

What are the advantages of using Solr/Lucene? It has built in support for GIS data include geodetic coordinates, geohashes, bounding boxes and spacial hierarchies. Distances can be calculated in several coordinate systems including euclidean, great circle and Manhattan. It has built in support for advanced search features. Synonyms and misspellings are added by putting them into a configuration file. Likewise, words to ignore can be added by editing another configuration file. Ranking supports different weights on both specific layers and individual meta data fields. Weights are modified via configuration files, not changing code. Results can be both ranked and grouped. Results are available in multiple formats including XML and JSON.

The biggest disadvantage is somebody has to learn a lot about Solr and write some ingest code. I think I'm up for that.

Grant Ingersoll wrote a nice paper discussing using Solr with GIS data.

A few parting comments. First, people building high-end search solutions today don't look to SQL like they used to. Search solutions often include data repositories optimized for search, not relying on legacy data stores designed for transaction based read/write operations. That makes me wonder if we should build our search solution based on a SQL database. Second, I think the days are numbered for people creating their own search solution. Maybe we're not quite there yet, but as a function of time, programmers will increasing rely on integrating existing search solutions. It happened for hashtables and data repositories, it is now happening for search.

What do you think? Should we consider it? Is the technology mature enough? Does its Java/Tomcat infrastructure make it easy enough for us to deal with and integrate? Do we envision an search that relies on ESRI SDE that could not be replicated in Solr?

2 comments:

  1. Is the spatial stuff actually built in or is it a plugin?

    This makes sense to me, I think it'd be insane not to use an existing search api if it will seems like it can meet the requirements. Doing this opens up the possibility that users may be able to search your data in ways we can't currently anticipate by forming complex queries that would be difficult to support if you just build the whole service yourself.

    ReplyDelete
  2. I think the lucene/solr libraries are mature to be used or integrated in a application. I vote for them even if they don't provide all the functionality as long as they provide all the critical functionality.

    We had terrible experience maintaining an SQL based query for search. It got difficult to understand or modify the query after a point. Assuming that this use case will demand several metadata element and boolean queries, SQL approach doesn't seem like the way to go.

    ReplyDelete