Monday, February 1, 2010

Tuning the Application, Database and Hardware

Tuning Example in 3 parts
part 1 - "The Database is hanging!" AKA "the application has problems, good grief"
I wonder if you can imagine, or have had the experience of the application guys calling with anger and panic in their voices saying "the database is sooo slow, you've got to speed it up."
What's your first reaction? What tools do you use? How long does it take to figure out what's going on?
Let's take a look at how it would work with DB Optimizer. When I get a call like this I take a look at the database with DB Optimizer:

I can clearly see that the database is not bottlenecked and there must be a problem on the application.
Why do I think it's the application and not the database? The database is showing plenty of free CPU in the load chart, the largest chart, on the top, in the image above. In the load chart, there is a horizontal red line. The red line represents the number of CPU's on the system, which in this case is 2 CPUs. The CPU line is rarely crossed by bars which represent the load on the database, measured in average number of sessions. The session activity is averaged over 5 samples over 5 seconds, thus bars are 5 seconds wide. The bars above fall mostly about 1 average active session and the bars are rarely green. Green represents CPU load. Any other color bar indicates a sessions waiting. The main wait in this case is orange, which is log file sync, ie waits for commits. Why is the database more or less idle and why are most of the waits we do see for "commit"? I look at the code coming to the database and see something like this:
insert into foo values ('a');
commit;
insert into foo values ('a');
commit;
insert into foo values ('a');
commit;
insert into foo values ('a');
commit;
insert into foo values ('a');
commit;
insert into foo values ('a');
commit;
insert into foo values ('a');
commit;

Doing single row inserts and committing after each is very inefficient. There is a lot of time wasted on network communication which is why the database is mainly idle, when the application thinks it's running full speed ahead, it is actually waiting mainly on network communication and commits. If we commit less and batch the work we send to the database, reducing network communications, we will run much more efficiently. Changing the code to

begin
for i in 1..1000 loop
insert into foo values ('a');
-- commit;
end loop;
end;
/
commit;

improves the communication delay and now we get a fully loaded database but we run into database configuration issues.
Part 2 It *is* the database (ie DBA get to work)

In the above DB Optimizer screen, the same workload was run 4 times. We can see that the time (width of the load) reduced, and the percent of activity on CPU increased.
Runs:
1. "log file sync" , the orange color, is the biggest color area, which means uses are waiting on commits, still even though we are committing less in the code. In this case we moved the log files to a faster device. (you can see the checkpoint activity just after run 1 where we moved the log files)
2 "buffer busy wait" , the burnt red, is the biggest color area. We drilled down on the buffer busy wait event in the Top Event section and the details tells use to move the table from a normal tablespace to an Automatice Segmenet Space Managed tablepace.
3."log file switch (checkpoint incomplete)" , the dark brown, is the largest color area, so we increased the size of the log files. (you can see the IO time spent creating the new redo logs just after run 3 )
4. the run time is the shortest and all the time is spent on the CPU which was our goal - ie to take advanteage of all the processors and run the batch job as quickly as possible.

Part 3 It's the machine (rock paper scissors)
Now that the application is tuned and the database is tuned let's run a bigger load:

We can see that the CPU load is constantly over the max CPU line. How can we have a bigger CPU load than there are actually CPUs on the machine? Because, this actually means that the demand for CPU is higher than the CPU available on the machine. In the image above there are 2 CPUs on the machine but and average of 3 users who think they are on the CPU, which means that on average 1 users is not really on the CPU but ready to run on the CPU and waiting for the CPU.
At this point we have two options - in this case we are only running one kind of load, ie the insert. For inserts we can actually go even further tuning this insert and use Oracle's bulk load commands:

declare
TYPE IDX IS TABLE OF Integer INDEX BY BINARY_INTEGER;
MY_IDX IDX;
BEGIN
for i in 1..8000 loop
MY_IDX(i):=1;
end loop;
FORALL indx IN MY_IDX.FIRST .. MY_IDX.LAST
INSERT INTO foo ( dummy )
VALUES ( MY_IDX(indx) );
COMMIT;
end;
/

But if this was an application that had a lot of different SQL and the SQL load was well distributed across the system then we'd have a case for adding more hardware to the system. Making the decision to add more hardware can be a difficult decision because in general the information to make the decision is unknown, unclear or just plain confusing, but DB Optimizer makes it easy and clear, which can save weeks and months of wasteful meetings and debates. For example
If we look in the bottom left, there is no SQL that takes up a significant amount of load, ie there is no outlier SQL that we could tune and gain back a lot of wasted CPU. We'd have to tune many many SQL and make improvements on most of them to gain back enough CPU to get our load down below the max CPU line. In this case, adding CPUs to the machine might be the easiest and most cost affective solution.
Conclusion:
With the load chart we can quickly and easily identify the bottlenecks in the database, take corrective actions, and see the results. IN part 1 we had an application problem, in part 2 we had 3 database configuration issues and in part 3 we had a hardware sizing issue. In all 3 chapters DB Optimizer provides a clear and easy presentation of the data and issues making solutions clear.

No comments:

Post a Comment