Thursday, January 14, 2010

Oracle Waits - latch : cache buffer chains

(article best formated at link:http://tinyurl.com/ybyjazq)
Waits on the cache buffer chains latch, ie the wait event "latch: cache buffers chains" happen when there is extremely high and conncurrent access to the same block in a database. Access to a block is normally a fast operation but if concurrent users access a block fast enough, repeatedly then simple access to the block can become an bottleneck. The most common occurance of cbc (cache buffer chains) latch contention happens when multiple users are running nest loop joins on a table and accessing the table driven into via an index. Since the NL join is basically a

For all rows in i
look up a value in j where j.field1 = i.val
end loop

then table j's index on field1 will get hit for every row returned from i. Now if the lookup on i returns a lot of rows and if multiple users are running this same query then the index root block is going to get hammered on the index j(field1).

In order to solve a CBC latch bottleneck we need to know what SQL is causing the bottleneck and what table or index that the SQL statement is using is causing the bottleneck.

From ASH data this is fairly easy:

select
count(*),
sql_id,
nvl(o.object_name,ash.current_obj#) objn,
substr(o.object_type,0,10) otype,
CURRENT_FILE# fn,
CURRENT_BLOCK# blockn
from v$active_session_history ash
, all_objects o
where event like 'latch: cache buffers chains'
and o.object_id (+)= ash.CURRENT_OBJ#
group by sql_id, current_obj#, current_file#,
current_block#, o.object_name,o.object_type
order by count(*)
/

From the out put it looks like we have both the SQL (at least the id, we can get the text with the id) and the block:
CNT SQL_ID OBJN OTYPE FN BLOCKN
---- ------------- -------- ------ --- ------
84 a09r4dwjpv01q MYDUAL TABLE 1 93170

But the block actually is probably left over from a recent IO and not actually the CBC hot block though it might be.
We can investigate further to get more information by looking at P1, P2 and P3 for the CBC latch wait. How can we find out what P1, P2 and P3 mean? by looking them up in V$EVENT_NAME:


select * from v$event_name
where name = 'latch: cache buffers chains'

EVENT# NAME PARAMETER1 PARAMETER2 PARAMETER3
---------- ---------------------------- ---------- ---------- ----------
58 latch: cache buffers chains address number tries

So P1 is the address of the latch for the cbc latch wait.
Now we can group the CBC latch waits by the address and find out what address had the most waits:

select
count(*),
lpad(replace(to_char(p1,'XXXXXXXXX'),' ','0'),16,0) laddr
from v$active_session_history
where event='latch: cache buffers chains'
group by p1
order by count(*);


COUNT(*) LADDR
---------- ----------------
4933 00000004D8108330

In this case, there is only one address that we had waits for, so now we can look up what blocks (headers actually) where at that address

select o.name, bh.dbarfil, bh.dbablk, bh.tch
from x$bh bh, obj$ o
where tch > 100
and hladdr='00000004D8108330'
and o.obj#=bh.obj
order by tch

NAME DBARFIL DBABLK TCH
----------- ------- ------ ----
EMP_CLUSTER 4 394 120

We look for the block with the highest "TCH" or "touch count". Touch count is a count of the times the block has been accesses. The count has some restrictions. The count is only incremented once every 3 seconds, so even if I access the block 1 million times a second, the count will only go up once every 3 seconds. Also, and unfortunately, the count gets zeroed out if the block cycles through the buffer cache, but probably the most unfortunate is that this analysis only works when the problem is currently happening. Once the problem is over then the blocks will usually get pushed out of the buffer cache.

In the case where the CBC latch contention is happening right now we can run all of this analysis in one query

select
name, file#, dbablk, obj, tch, hladdr
from x$bh bh
, obj$ o
where
o.obj#(+)=bh.obj and
hladdr in
(
select ltrim(to_char(p1,'XXXXXXXXXX') )
from v$active_session_history
where event like 'latch: cache buffers chains'
group by p1
having count(*) > 5
)
and tch > 5
order by tch

example output

NAME FILE# DBABLK OBJ TCH HLADDR
------------- ----- ------ ------ --- --------
BBW_INDEX 1 110997 66051 17 6BD91180
IDL_UB1$ 1 54837 73 18 6BDB8A80
VIEW$ 1 6885 63 20 6BD91180
VIEW$ 1 6886 63 24 6BDB8A80
DUAL 1 2082 258 32 6BDB8A80
DUAL 1 2081 258 32 6BD91180
MGMT_EMD_PING 3 26479 50312 272 6BDB8A80
This can be misleading, as TCH gets set to 0 ever rap around the LRU and it only gets updated once every 3 seconds, so in this case DUAL was my problem table not MGMT_EMD_PING


Deeper Analysis from Tanel Poder


Using Tanel's ideas here's a script to get the objects that we have the most cbc latch waits on

col object_name for a35
col cnt for 99999

SELECT
cnt, object_name, object_type,file#, dbablk, obj, tch, hladdr
FROM (
select count(*) cnt, rfile, block from (
SELECT /*+ ORDERED USE_NL(l.x$ksuprlat) */
--l.laddr, u.laddr, u.laddrx, u.laddrr,
dbms_utility.data_block_address_file(to_number(object,'XXXXXXXX')) rfile,
dbms_utility.data_block_address_block(to_number(object,'XXXXXXXX')) block
FROM
(SELECT /*+ NO_MERGE */ 1 FROM DUAL CONNECT BY LEVEL <= 100000) s,
(SELECT ksuprlnm LNAME, ksuprsid sid, ksuprlat laddr,
TO_CHAR(ksulawhy,'XXXXXXXXXXXXXXXX') object
FROM x$ksuprlat) l,
(select indx, kslednam from x$ksled ) e,
(SELECT
indx
, ksusesqh sqlhash
, ksuseopc
, ksusep1r laddr
FROM x$ksuse) u
WHERE LOWER(l.Lname) LIKE LOWER('%cache buffers chains%')
AND u.laddr=l.laddr
AND u.ksuseopc=e.indx
AND e.kslednam like '%cache buffers chains%'
)
group by rfile, block
) objs,
x$bh bh,
dba_objects o
WHERE
bh.file#=objs.rfile
and bh.dbablk=objs.block
and o.object_id=bh.obj
order by cnt
;

CNT OBJECT_NAME TYPE FILE# DBABLK OBJ TCH HLADDR
---- ----------------- ----- ----- ------- ------ ----- --------
1 WB_RETROPAY_EARNS TABLE 4 18427 52701 1129 335F7C00
1 WB_RETROPAY_EARNS TABLE 4 18194 52701 1130 335F7C00
3 PS_RETROPAY_RQST TABLE 4 13253 52689 1143 33656D00
3 PS_RETROPAY_RQST INDEX 4 13486 52692 997 33656D00
3 WB_JOB TABLE 4 14443 52698 338 335B9080
5 PS_RETROPAY_RQST TABLE 4 13020 52689 997 33656D00
5 WB_JOB TABLE 4 14676 52698 338 335B9080
5 WB_JOB TABLE 4 13856 52698 338 335F7C00
6 WB_JOB TABLE 4 13623 52698 338 335F7C00
7 WB_JOB TABLE 4 14909 52698 338 335B9080
141 WB_JOB TABLE 4 15142 52698 338 335B9080
2513 WB_JOB INDEX 4 13719 52699 997 33656D00


Why do we get cache buffers chains latch contention?

In order to understand why we get CBC latch contention we have to understand what the CBC latch protects. The CBC latch protects information controlling the buffer cache. Here is a schematic of computer memory and the Oracle processes, SGA and the main components of the SGA:


The buffer cache hold in memory versions of datablocks for faster access. Can you imagine though how we find a block we want in the buffer cache? The buffer cache doesn't have a index of blocks it contains and we certainly don't scan the whole cache looking for the block we want (though I have heard that as a concern when people increase the size of there buffer cache). The way we find a block in the buffer cache is by taking the blocks address, ie it's file and block number and hashing it. What's hashing? A simple example of hashing is the "Modulo" function
1 mod 4 = 1
2 mod 4 = 2
3 mod 4 = 3
4 mod 4 = 0
5 mod 4 = 1
6 mod 4 = 2
7 mod 4 = 3
8 mod 4 = 0
Using "mod 4" as a hash funtion creates 4 possible results. These results are used by Oracle as "buckets" or identifiers of locations to store things. The things in this case will be block headers.


Block headers are meta data about data block including pointers to the actual datablock as well as pointers to the other headers in the same bucket.


The block headers in the hash buckets are connected via a doubly linked list. One link points forward the other points backwards

The resulting layout looks like


the steps to find a block in the cache are


If there are a lot of sessions concurrently accessing the same buffer header (or buffer headers in the same bucket) then the latch that protects that bucket will get hot and users will have to wait getting "latch: cache buffers chains" wait.

Two possible ways this can happen






Things that can make CBC latch contention worse:





How many copies of a block are in the cache?

select
count(*)
, name
, file#
, dbablk
, hladdr
from x$bh bh
, obj$ o
where
o.obj#(+)=bh.obj and
hladdr in
(
select ltrim(to_char(p1,'XXXXXXXXXX') )
from v$active_session_history
where event like 'latch: cache%'
group by p1
)
group by name,file#, dbablk, hladdr
having count(*) > 1
order by count(*);


CNT NAME FILE# DBABLK HLADDR
--- ---------- ------ ------- --------
14 MYDUAL 1 93170 2C9F4B20

Notice that the number of copies, 14, is higher the the max number of copies allowed set by "_db_block_max_cr_dba = 6" in 10g. The reason is this value is just a directive not a restriction. Oracle tries to limit the number of copies.


Solutions
Find SQL ( Why is application hitting the block so hard? )
Possibly change application logic
Eliminate hot spots
Nested loops, possibly
Hash Partition
Uses Hash Join
Hash clusters
Look up tables (“select language from lang_table where ...”)
Change application
Use plsql function
Spread data out to reduce contention
Select from dual
Possibly use x$dual
Updates, inserts , select for update on blocks while reading those blocks
Cause multiple copies and make things worse


What would OEM do?



12 comments:

  1. Very nice explanation of a hard to understand topic. You definitely have a taste of explaining something with pictures. They are all very well presented. Thank you so much for sharing the information.

    ReplyDelete
  2. Strange happenings with your blog in Thunderbird. When I open your posting in Thunderbird from the Embarcadero blogging feeds,
    it opens Internet Explorer with the URL that starts with http://2ge1ffhdgsaom3padg8i2jd5pke1ajbt.friendconnect.gmodules.com/gadgets/......

    ReplyDelete
  3. interesting - we had some issues with the twitter feed as it cut off the first url in the post halfway. I have since replaced it with a tinyurl address
    Thanks for letting me know

    ReplyDelete
  4. Kyle,

    Very good post and diagrams make it much easier to understand.
    Thing I don't like about this wait event is that if you are seeing lot of contention for this wait event(say 50-100 sessions), there is no immediate solution like "kill a session" :(

    Cheers
    Amit

    ReplyDelete
  5. Hi Kyle,

    Very well written and diagrams are making things more understandble.Just wanted to clarify one thing..The number of copies of a block in the cache increase as the number of sessions requiring the data/row from that same particluar block increase leading to too many buffers in the bucket???

    Regards,
    Anand

    ReplyDelete
  6. Hi Anand,

    If someone changes a block and then doesn't commit, and then I come in and query something from that block, I'll make a read consistent version of the block, meaning I'll clone int and rollback the uncommitted changes. Now if there are a lot of readers and writes on the same block (not the same row or they'd lock each other) then we will get several copies of the block. Oracle tries to limit the # of copies of the block but it's a target limit not a hard limit. All of these copies of the block will all be in the same hash bucket and protected by the same cache buffers chains latch, so that can increase contention on this latch especially if there is already contention on that latch. The more copies of the block, the more blocks in the bucket and the longer it takes for sessions to search the bucket and find the copy of the block they want, thus they hold the latch longer increasing concurrency contention.

    ReplyDelete
  7. Hi Kyle,

    Thanks for such a awesome reply...have to now read lotz of blogs from here.Great.

    Regrads,
    Anand

    ReplyDelete
  8. Hi,

    Thanks for well prepared document.

    ReplyDelete
  9. Hi Kyle ,

    Thanks a lot for such awesome explanation. It really helped me to understand the stuff inside out. Expecting lot more such articles from you.

    Warm regards,
    Chandan.

    ReplyDelete
  10. Thanks for the feedback Chandan.
    I've put much of my wait event documentation as well as links to other peoples research on
    http://sites.google.com/site/embtdbo/wait-event-documentation
    The formating is easier and cleaner than this blog.

    ReplyDelete
  11. nice doc,
    I am facing the CB issue on one of my OLTP DB and the object is in question is global index on partition table, which is almost 30G.
    I can see hot blocks is always changing for the index.
    what do you think on this ?

    ReplyDelete
  12. GREAT article..... Explained things in one doc with diagrams, which will take a bible to explain without them.

    Keep writing & posting.......

    ReplyDelete