Learning from Authoritative Security Experiment Results
Remote Identification of Bloom Filter False Positives
James Jones, SAIC
Jeremiah Shafer, SAIC
Scott Belden, SAIC
Background. Bloom filters are widely used to perform fast tests for set membership. One common example is to use bloom filters to speed up database queries. However, Bloom Filters are subject to false positives, occurring when a Bloom Filter indicates that the subject of the query is in the data store when in fact it is not. Such false positives result in an exhaustive and costly sequential search of the underlying data store. A Bloom Filter true negative is a query which the Bloom Filter correctly indicates is not in the data store. Both false positives and true negatives will return a negative result in response to a query, but the false positive query requires an exhaustive search of the data store to establish this. Such false positives are expected to be on the order of 1-10 per 1,000,000 queries.
Aim. Use timing analysis to distinguish queries which are Bloom Filter false positives from Bloom Filter true negatives without detailed knowledge of or access to the underlying Bloom Filter implementation.
Method. Configure an Ubuntu server with postgres and the Internet Movie Database (IMDb), and build a tsvector index on the movie_info text field, triggering use of the postgres Bloom Filter implementation. Disable all extraneous services to minimize timing noise. Using a local client, submit a large number of random text queries (at least 100,000) against the movie_info index of the database and collect timing data for each query. To identify the Bloom Filter false positives, flush the postgres query cache tables and repeat any queries with round trip times at least three standard deviations above the mean round trip time for the query set. Queries with consistently high round trip times should represent Bloom Filter false positives.
Results. The distribution of round trip times for negative queries was skewed right with a long but relatively smooth tail. Approximately 0.4% of the negative queries (about 400 out of 100,000) had round trip times in excess of three standard deviations above the mean. However, the query times were not consistent when repeated, even after clearing and priming the postgres database with preliminary queries to build on-demand tables and ensure that executable code is loaded into memory.
Conclusions. There may be too much noise in the timing data of queries to distinguish Bloom Filter false positives from true negatives. Noise may be machine generated, or may be due to optimizations in the postgres database. A simple Bloom Filter implementation may be used to validate the core hypothesis before continuing explorations of production applications.