Thursday, July 11, 2013

Semantic query optimization gone bad

Here's a small test case to illustrate an issue one of my customers ran into when upgrading their main database from 10G to 11.2.0.3. The issue is about a particular query that returns correct results in 10G, but returns no rows when run in 11G. The test case sets up two tables, a parent table and a child table. The parent table has a concatenated primary key that is referenced by the child table. This issue is caused by the different CHECK constraints that exist on the second column of the key in both the parent and the child table.
drop table child;
drop table parent;

create table parent
(c1 number not null
,c2 number not null
,constraint par_pk primary key(c1,c2)
,constraint chk01 check(c2 = 2))
/

create table child
(c0 number not null
,c1 number not null
,c2 number not null
,constraint chi_pk primary key(c0)
,constraint chi_par_fk foreign key (c1,c2) references parent(c1,c2)
,constraint chk02 check(c2 between 0 and 999))
/

create index chi_fk on child(c1,c2)
/

insert into parent values(1,2);
insert into child values(0,1,2);
commit;

The query in question is this one, rather straightforward join between above two tables:
select c.*
from parent p
    ,child c
where p.c1 = c.c1
  and p.c2 = c.c2
/

Note that this query should just return the single child row, given the data that we inserted in both tables, but in 11.2.0.3 it returns no rows.

This issue seems to be caused by what's called transitive constraint injection into the query's where clause. Since our query states "p.c2 = c.c2" the optimizer in its wisdom decides to apply the text of constraint CHK01 on c.c2 and inject the resulting text into the query-text. And vice-versa it decides to apply the text of constraint CHK02 on p.c2 and inject the resulting text into the query-text. This results in the following plan for above query:
---------------------------------------------------------------------------------------
| Id  | Operation                    | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |        |     1 |    65 |     1   (0)| 00:00:01 |
|   1 |  NESTED LOOPS                |        |     1 |    65 |     1   (0)| 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID| CHILD  |     1 |    39 |     1   (0)| 00:00:01 |
|*  3 |    INDEX SKIP SCAN           | CHI_FK |     1 |       |     1   (0)| 00:00:01 |
|*  4 |   INDEX UNIQUE SCAN          | PAR_PK |     1 |    26 |     0   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("C"."C2"=2)
       filter("C"."C2"=2)
   4 - access("P"."C1"="C"."C1" AND "P"."C2">=0 AND "P"."C2"<=999)

Note that at line 4, we seem to have lost the access predicate "p.c2 = c.c2". In itself that should not be an issue. If you were to look at this plan, and reverse-engineer its workings, the plan should just produce the single child row. But it doesn't.

Workarounds for this issue are:
- disable constraint CHK01, (exclusive) or disable constraint CHK02.
or,
- in the case of this plan, force a full scan on p, by adding a hint such as full(p).

There seems to go something bad internally in the exec-plan with regards to the Index-Unique-Scan on line 4 and its access predicates. If you run the query with gather_plan_statistics and then peek at a-rows (actual rows), you'll see that the Index-Unique-Scan fails to produce the single parent row.

I played around a little bit by varying the CHECK constraint on C2 of the child table. For instance if you change CHK02 to be CHECK(C2 > 0), line 4 of above execution plan ends up with this access predicate:

access("P"."C1"="C"."C1" AND "P"."C2">0 AND "P"."C2"="C"."C2")

Not something I've ever seen, a combination of a unbounded range and an equality predicate used as access predicate on the same indexed column. This one by the way has the same issue: the Index-Unique-Scan also produces zero rows.

Seems the code that is supposed to access the index is fed with some bad input, causing it to fail to execute correctly.

To be continued.

13/7/16 Update: This is a know bug in code that determines the access predicate to be used for an index scan. Bug number 12827166.