The great grandson of Husnu Sensoy

March 26, 2007

Spirit of Expert on Part 3…

Filed under: Oracle — kocakahin @ 10:06 am

Introduction

Esteemed friend not only provides the graphs but also post me a very detailed comment on my last post. Rather than publishing it as a comment, I will also post it as a separate write-up. I believe it will take more hit than my original post…

Although I am not good at queuing theory…

I would like to make couple comments. First of all, I would like to mention that this entry summarizes most of the “essentials” about one server systems. However, before delving into the details, it might be useful to perform some back of the envelope calculations. For one server models, Kingman’s approximation is a very simple yet extremely useful starting point. By using this approximation, you can pretty much estimate how your system will behave under different conditions in terms of arrival and service rates. Let me briefly summarize what this approximation tells us. First, couple definitions: (sorry I cannot enter formulae so it will be messy)

Let

c_a = coefficient of variation of inter-arrival time which is equal to sigma_a / m_a where sigma_a and m_a are the standard deviation and mean of inter-arrival time respectively. (Of course m_a = 1/lambda) (c_a = 1 for exponential distribution)

c_s = coefficient of variation of the service time which is equal to sigma_s / m_s where sigma_s and m_s are the standard deviation and mean of service time respectively. (Of course m_s = 1/mu) (c_s = 1 for exponential distribution)

u = utilization rate (u = lambda/mu)

Wq = expected waiting time in queue

Here comes the beauty:

Wq = (1/2)*(c_a^2+c_s^2)*(u/(1-u))*m_s

and here is the interpretation:

A very nice property of this approximation is that you do not have to know the distribution of arrivals or service rates. All you need is the mean and the standard deviation values so that you can compute c_a, c_s and u. Therefore, you do not have to go through the hassle of finding appropriate distributions for inter-arrival and service rates. I am not saying that data modeling is useless. What I am saying is you can get the most of what you want without performing any detailed analyses.

On the other hand, “1-u” is the probability that you will find the queue empty and the workstation idle if you observe it at a random time. Therefore, u/ (1-u) is a unit-less measure of the ability of your server to work off queues. Now, go ahead and start your favorite spreadsheet program to see the effect of increasing utilization. Start with low utilization and keep incrementing it until you reach 1. You can increment it by .05 until you reach .9 then by 0.01 until you reach .99. Now draw the graph u/(1-u). See what happens? Basically u/(1-u) ratio explodes after .95.

What does this tell you? This basically tells one thing: KEEP YOUR UTILIZATION UNDER CONTROL. Let’s go back to our approximation. Suppose c_a^2, c_s^2 and m_s are fixed. Then, you get the exact same result for Wq by changing utilization (i.e. Wq explodes after .95). In other words, if the ratios between your interarrival rate (lambda) and your service rate (mu) gets larger and larger, after one point, your buffer explodes. You can also observe it by looking at the Arena graphs shown above. Compare the mean response times for different loads.

Of course there is no problem when you have low utilization. When there are no jobs waiting in the queue the system refreshes itself. As mentioned in the text, once you are at time t, it does not matter what happened before so that you can treat t as zero since there is nothing in the queue. However, if you have “extreme load” it might take a really long amount of time until your buffer gets empty. Yay… Explosion… Almost all people working with queues know this “explosion” fact. But the problem is, most of them do not know how to interpret it and avoid it if possible.

First of all, if you believe high utilization is always good; change your job and do financial accounting instead of engineering. That’s what most of the financial accounting people think by the way, since they believe that high utilization means effective use of your companies’ resources. LOSERS… However, if you are happy as an engineer consider watching your system more carefully. And do not just watch also act. How can you act against increasing utilization? First of all, your operations team must set a maximum utilization level goal. This level might differ from operation to operation but a rule of thumb I can recommend is that you should not let your system go above 95% utilization in order to keep waiting times under control. How can you do that? You can set a maximum buffer size say, S, and reject customers once you hit that level (i.e. G/G/1/S). In other words, you reject arrivals if you do not have any space in your buffer. Another approach is related to renewal theory. You do not set a buffer size but you regularly refresh your system. You set up certain time intervals to empty your system. For instance, if we go back to the grocery store example, you stop accepting customers at the end of each hour until you have no people in your buffer. If you want something hardcore, you can combine these two policies. In other words, you set two levels for your buffer say S (large S) and s (small s). When you have S jobs waiting, you stop accepting new jobs until you hit s. Once you have s jobs in your buffer, you can start accepting again until you hit S and this cycle continues. Since you goal to minimize the waiting time deciding what S and s becomes an optimization problem. I can make another long comment about how to solve it but I think it is out of scope this time.

BOTTOMLINE:

  • You do not need detailed statistical analysis in order to comprehend how your system will behave.
  • Most of the systems can be managed by using back of the envelope calculations and careful control of utilization.
  • Once again, that does not mean you should ignore queuing theory. It is absolutely essential if you want to find a global optimal solution.

If you want some more fun try to observe what happens to Wq when c_a and c_s increases? You will see that VARIANCE IS THE ENEMY. This will also help you understand how important it is to achieve deterministic service rates. There is not much you can do about arrivals but once you have deterministic service rates, you can at least get rid of c_s^2 from that equation.

Have fun,

Y. Alan

Last word…

My dear friend I post your comment without touching a word. Thank you for great enlightenment on topic. Thanks…

Loading Oracle Part 3: Exponential Inter-arrivals & a bit of queuing theory

Filed under: Oracle — kocakahin @ 1:50 am

Introduction

In part two of the blog post series we have seen how to create a transaction load within the database itself. One of the missing things we have kept till this part is the concept of inter-arrival time. Remember that in previous part, as an example, we’ve scheduled 20 sessions and let them to insert 5000 records at the same time. But can you see the problem about this method?

Having a single 5000 insertion time statistics for each session usually causes a great variation in computations. This is all about what we measure. Remember that we are trying to approximate the response time of Oracle in a concurrent environment. In previous post we have only an average and a variance. Nothing more… But at the end of this model we obtain some mathematical models rather than just numbers. So what was wrong with the precious one? Let’s try to build a correlation with a super-market (more or less both fit into queuing theory facts).

Imagine a market with 20 cash desks (an Oracle instance whose job_queue_processses parameter is set to 20). Market is closed initially and at 09:00 doors are opened. Then 20 customers arrive simultaneously and buy 5000 goods and they pay in different cash desks. We measure the time between arrival and departure of each customer and make some statistics. I think the problem is clearer now. Let me summarize what is wrong:

  • Customers arrive suddenly at the same time just like 20 simultaneous sessions.
  • There are only 20 timing measurements. Modeling a system with 20 points is not so sound.

Tuning the Previous Model

Previous model we have utilized in the second part of the post has those problems. Let’s now see the methods for correcting those errors:

  • We rather use stochastic stationary Poisson arrivals instead of n simultaneous arrivals at a time.
  • We will do statistics over much more than 20 timing measurements.
  • We try to build mathematical stochastic processes.

Poisson Arrivals & Queuing Theory

Let’s first talk about the real characteristics of super market customer arrivals. Assuming that there can be only one customer at exactly the same time seems to be a sound assumption. Next assumption is another sound one. The number of arrivals between t and t+s depends only to s. In other words the expected number of customers between 09:00-09:10, 09:10-09:20 or 11:21-11:31 is equal to each other (Even we will relax this assumption also in next parts of the post). Customers arriving between [t,t+s] and [t+s,t+v] is totally independent. Each customer decides totally free to come without interacting with other customers. That’s all. I think all seem sound. Let me summarize them more formally once more using the words of infamous Law & Kelton adapting to our Oracle scenario:

  1. Sessions are opened one at a time. (It is very rare that session opening timestamp differences between two consecutive Oracle sessions equal to 0).
  2. The distribution of N(t+s) – N(t) is independent of t for all t,s>0
  3. N (t+s) – N(t) (the number of arrivals in the time interval (t, t+s]) is independent of {N(u), 0<u<t}.

If those three assumptions hold, this type of arrivals called stationary Poisson arrivals. Fortunately, most queuing systems follow this distribution. I won’t give formulas on Poisson distribution but let me illustrate what some Poissons look like:

Figure 1

We will be talking about those histograms a bit and try to understand, informally, the characteristics of a Poisson distribution. Logically those distributions are discrete and may take any discrete value in [0, infinite). Mode of the distribution is always somewhere around lambda. There may be other visual characteristics but now I will focus on the meaning of these histograms. What they mean? Take lambda=2 for example. Let’s read the figure:

  1. Given that on the average 2 customers arrive per minute to XYZ-market, what is the probability that in 1 minute exactly 5 customers will arrive?
  2. Or given that on the average 6 customers arrive per minute to UVT-market, what is the probability that less than 4 customers (exclusive) will arrive?

Answers to those questions may be given by just looking to histograms. First question asks the value you read from the 5th green bar. That is ~0.36. The second one asks for the sum of 0th, 1st, 2nd and 3rd purple bars (cumulative distribution, F (3) = P(X<4)). That is ~0.15. Isn’t it amazing that by just knowing that your customers’ arrival satisfies those three conditions, you can estimate anything about their arrival?

Let’s now focus on some other issue. If we can compute the number of customers (sessions) we will have in the next minute, is it possible to compute the time between two consecutive customers (sessions)? Exactly! Once more I won’t deal with mathematical proofs, although I like them very much, but I will put the outcome in one sentence:

“The time between two consecutive Poisson arrivals with parameter lambda follows exponential distribution with parameter 1/lambda

Figure 2

Figure 2 illustrates the inter-arrival rates of two systems with different inter-arrival rates (0.5 sess/min and 1 sess/min). For the smaller arrival rate half of the customers arrive before 1.5 minutes whereas this duration falls to less than 1 minute for the higher arrival rate. You may see this from cumulative distribution drawings in Figure 2(red and purple lines).

Oracle in Action Again

After this much theory let’s load Oracle once more. This time we will use Java for loading Oracle. That’s not because Oracle is not eligible to do it himself but rather people like external tools. What we do is pretty simple.

We create stationary Poisson Oracle sessions at five different arrival rates as follows:

Scenario

Arrival rate

Scenario Description

NoLoad

15 sessions/minute

Java creates 100 Poisson arrivals at given rate, each session (Java thread) inserts 5000 rows into table and measures the service time.

LowLoad

30 sessions/minute

MediumLoad

1 session/second

HeavyLoad

2 sessions/second

ExtremeLoad

10 sessions/second

Table 1

Graphs of inter-arrival rates are like more or less Figure 2. Another, more important, fact is the service time of Oracle. In following figures you will see the behavior of our development system under load.

NoLoad

Figure 3 Beta Model for NoLoad response time

As you may see in Figure 3, response time for NoLoad scenario is almost deterministic. Almost 90% of all insertion requests answered in less 160 than milliseconds. That’s highly due to arrival of sessions are so rare that no session need to wait for others to complete their task. If we include outliers into account, best response time can be modeled by using a Beta distribution (37 + 632*Beta (0.128, 1.2)).

LowLoad

Figure 4 LogNormal Model for LowLoad response time

As we increase session rate to 30 sessions per minute. Oracle response time changes characteristics and response time durations fit to a log-normal model which is a common distribution in modeling duration to complete some task. Notice that tails of the distribution are fatter than the previous model. This illustrates that some sessions wait for disk I/O, latch or lock held by other sessions. Stochastic model may be formalized as 39 + Logn(199, 1100).

MediumLoad & HeavyLoad

 
 

Figure 5 Exponential Model for MediumLoad response time

Figure 6 Exponential Model for HeavyLoad response time

 
 

MediumLoad and HeavyLoad scenarios perfectly represent the most simple queue systems. Meaning that inter-arrival and response times are both exponential and there is only one server to serve. We can model Oracle response time for those session loads as 44+EXPO (388) and 43 + EXPO (2290) respectively. Notice that those exponentials are somewhat different from those we use for inter-arrival times. In fact they are not. There is only a shift parameter 43-44. It statistically only changes the expected value of response time, not the characteristics of the model.

ExtremeLoad

Figure 7 Normal Model for ExtremeLoad response time

Final scenario’s purpose is somewhat different from the previous ones. Rather than modeling the real world, we use this scenario to show a general occurrence in queuing systems. Notice that majority of customer response time shift from left to middle of the histogram. That’s waiting time durations increase. Characteristics of the response time model changes from exponential to a center-moded model. Best model seems to be Normal distribution Norm (3380, 1960). One problem is that this model allows negative response times also .But this can be solved by absolute value function trick by disturbing the real Normal distribution model a little bit.

Conclusion

In this part we model Oracle response time for different arrival rates. Starting from this part of Loading Oracle post we start to make a little bit of statistics. This will go like this. Hopefully this much mathematics is not neither too much nor too informal. If it is so I am looking for your feedbacks.

Source codes

Below link includes the source codes I have used in this part. Be careful about running java code. A random value to arrival rate may cause an Oracle disaster for large values

http://husnu.sensoy.googlepages.com/LoadingOraclePart3.rar

Special Thanks

For the outputs of Arena Input Analyzer, I appreciate my esteemed friend Yasin Alan who is currently a PhD student in Cornell University Department of Operations Research and Information Engineering, New York.

Loading Oracle Part 4: R= S+W and Active Session History(ASH)

March 4, 2007

Loading Oracle Part 2: Hundreds of Virtual Users in Oracle 10g…

Filed under: Oracle — kocakahin @ 3:05 am

Introduction

Starting from the second part of Loading Oracle post, our main scope will be the bulletins listed in the Simulation section of Part 1. In the next parts of the post those bulletins will be discussed in detail. To begin with we explain how to imitate hundreds of users in Oracle without using any external thread/process creating software.

Job concept in Oracle

In http://tahiti.oracle.com
Oracle defines the job as follows:

“A job is a user-defined task that is scheduled to run one or more times. It is a combination of what needs to be executed (the action) and when (the schedule)”

Prior to 10g, Oracle has an infamous PL/SQL package DBMS_JOBS having 5-6 sub-routines in. However by the arrival of 10g, Oracle replace his old package with a more talented one namely DBMS_SCHEDULER. Although detailed discussion of DBMS_SCHEDULER package is not the scope of this post, we will briefly explain how Oracle handles its scheduled jobs and why this is important for our purpose. So let us explain the Oracle job architecture on a single instance (it is much more an extension in RAC):

(DBA|ALL|USER)_SCHEDULER_JOBS views

Oracle internally keeps the metadata of all scheduler jobs in its internal tables. Database users may access this information from catalog views named as *_SCHEDULER_JOBS (just like almost any catalog view).

Oracle creates those views by joining its internal tables obj$, user$, obj$, and scheduler$_job. Since the details of these views are out of our scope, for further detail on them you may refer to http://download-uk.oracle.com/docs/cd/B19306_01/server.102/b14237/statviews_2049.htm#i1587306

JOB ORCHASTIRATION & EXECUTION

So if Oracle stores his scheduled jobs in those structures, the next question may be “How he manages and executes them?”(By the way Oracle is a male).Each Oracle instance has a job coordinator background process namely cjqNNN. This process orchestrate all other slave job processes for execution, do caching of jobs for faster execution, perform disaster recovery of jobs for crashed instances, etc.

First of the two most important topics for Loading Oracle post purpose is the execution of those jobs. As a slave process got the metadata of the job to be executed it starts a new session, performs its task and closes the connection by committing his transaction. This means that each job by itself acts like a session opening a database connection. So this means that it will be like loading database with number of sessions at a time if Oracle can start multiples of scheduler jobs at a time.

So the next question will be how to create those sessions at a time and what determines the number of Oracle jobs that can run simultaneously. The answer is JOB_QUEUE_PROCESSES parameter. Oracle can handle as many slave processes as the value of JOB_QUEUE_PROCESSES parameter. This parameter is a system and instance modifiable parameter and can take a maximum value of 1000 which is a pretty large value.

A CHASM CAN BE FELT IN…

Remember that Oracle is software and is not designed to run in real-time. So scheduling of n simultaneous jobs to start at the same point in time may not be as easy as the value of n grows up. So you are the one as developer that has to synchronize your jobs to start almost at the same time. As you will see in below we use another good old Oracle package namely DBMS_ALERT for this purpose. To understand how the things work see Figure 1.

Figure 1

  1. At time t = 0, we schedule 10 jobs to be executed (DBMS_SCHEDULER.create_job). Theoretically they have to start execution slightly after t = 0. But we know that it may not be possible in practice. So we order each slave process to wait for GO signal till t = 60. (DBMS_ALERT.waitone(‘GO’,l_msg,l_status,60)).
  2. Although scheduled processes may not start execution at exactly the same time, almost all of them will reach to so-called synchronization barrier by time t = 30 – ∆. (a small time just before t = 30 sec.)
  3. At time t = 30 master process signal a GO for registered slaves (DBMS_ALERT.signal(‘GO’,'Go’)). All waiting processes hanged at synchronization barrier will continue running having l_status set to value 0 meaning that “I’ve received the signal“.
  4. After this point two things will occur but the exact ordering of those events cannot be marked in precision. So they will be discussed as sub-events of the same event:
    1. Since a few jobs may not reach synchronization barrier before the signaling of GO, they will timeout at time t = 60. But this is not a big problem since they have l_status to be set 1 meaning that timeout has occurred for them and they don’t contribute to timing metrics.
    2. Finally at time t = 84 no more running processes left.

Let’s load it…

We see that it is possible to load Oracle with hundreds of processes at a time. Let’s as the first example of Loading Oracle post, try a mini example by testing the average insertion rate (IPS) for heap and indexed organized tables. By this example we fundamentally show that

Time metrics you measure by “Idiot Looping” doesn’t represent the real world.

Pre-loading tasks

Before running the scripts loading the Oracle, test schema needs some grants and objects. Moreover JOB_QUEUE_PROCESSES parameter needs to be set to an appropriate value. So run the preloading.sql to complete following tasks:

  • Set JOB_QUEUE_PROCESSES = 50(A larger value than 20, because there are some Oracle functionalities (like AQ propagation processes) having a number of scheduled jobs).
  • Grant DBMS_LOCK and DBMS_ALERT to test schema.
  • Create a timing table called AUX_RUN_STAT for measuring the time metrics.
  • Create a sequence called DEMO_SEQ for later use.

IPS over IOT in an isolated environment

First we measure the IOT insertion rate (IPS) in an isolated environment. For doing this test, we will use SIOT.sql (Sequential IOT) script. This script inserts 5000 rows into an IOT table 20 times. Then we will take the average of this insertion times and talk over this in “Moral of the story” section.

IPS over heap table in an isolated environment

This time we perform the same test we have just done on a heap table using same storage parameters we have used for IOT table. For doing this we’ve used SHEAP.sql (Sequential HEAP).

Insertion rate of IOT in a concurrent environment

After completing “Idiot Looping” tests now we can start up with real world tests. First we will try loading Oracle with 20 concurrent sessions each will try inserting 5000 rows into an IOT. For this we’ve used CIOT.sql (Concurrent IOT).

Insertion rate of heap table in an isolated environment

Finally we do the concurrency test for HEAP using CHEAP.sql (Concurrent HEAP).

Moral of the story

After all those executions, we may extract the following summary table using AUX_RUN_STAT table, a piece of statistics and SQL.

  

Scenario 

IPS 

Average (msec.) 

Std (msec.) 

Service Time for 95% of sessions or iterations

1 

SHEAP 

52687.04 

94.90 

1.59 

94.03 – 95.77

2 

SIOT 

45433.89 

110.05 

11.08 

104.02 – 116.08

3 

CHEAP 

68181.82 

1100.00 

108.09 

1029.96 – 1170.04

4 

CIOT 

66473.65 

1128.27 

12.20 

1120.36 – 1136.17

Figure 2

So let’s underline some results:

  1. Notice that IPS rate increases (by an average rate of 23% – 32%) as the level of concurrency increases from 1 to 20.
  2. Another point is one can conclude that heap table yields a 13% better IPS rate than indexed organized table by just looking sequential test results. Indeed, this is the pitfall of “Idiot Looping” method. Because concurrent test results indicate that this difference is only 2.5% at 20 level of concurrency.
  3. As a final remark sequential results shows a statistically significant difference in service times between insertion timings over heap table and IOT, but this result is falsified by the results of concurrent test because the service time intervals of IOT and heap table are overlapping.

To sum up, this simple example illustrates the problem with testing systems using “Idiot Looping” technique. Remember that it may give a rough idea about the performance of the system but concurrent environment tests may totally irrelevant to those results. So in order to reach correct benchmarks about your system, you need to benchmark your system at the correct level of concurrency.

Loading Oracle Part 3: Exponential inter-arrivals & a bit of queuing theory

 
 

 
 

 
 

 
 

  

March 2, 2007

Loading Oracle Part 1: A brief introduction

Filed under: Oracle — kocakahin @ 1:37 am

Introduction

Besides various tests like unit testing, integration testing or regression testing performed in routine development processes, development in Oracle requires another very important testing namely load/stress testing. The behavior of the systems in terms of integrity, performance and availability may dramatically change under load and concurrency.

Possible Risks under Concurrency

Loss of Integrity under Load

In a very well-known scenario by Thomas Kyte in Expert One-on-one, it is illustrated that in case that you don’t use FOR UPDATE statement before specific update operations, it’s pretty much guaranteed that your application will fail in a concurrent environment. For doing this Mr. Kyte, open two simultaneous sessions from SQL*PLUS and performs his test.

This example and many others are good for demonstrative purposes. But it is also true that real life is much more complicated then the laboratory conditions. So there is a common need for testing a program unit under concurrent executions in PL/SQL development environments.

Performance Degeneration under Load

If you ever read Oracle concept guide most probably you are familiar with parameters like INITTRANS, FREELISTS for tables. Or you must hear that heavy indexing over OLTP systems will degenerate the performance of DML operations. But I am pretty sure that you have never seen the effect of those by eyes unless you suffer from them in a real production environment.

So another requirement for concurrent programming is to test the performance of the configuration or performance of modules under load before a swarm of users test them.

Non-scalability under Load

You may know that in case that heavy locking/blocking will increase the average response time of system users in any concurrent application. This is also true for Oracle. In case that you obtain unnecessary TM or TX locks, it is true that sooner or later you will complete your task but system will be unavailable for other users requiring the same locks. This will obviously decrease the scalability of a concurrent environment.

So we also need to measure the level of scalability before deploying a non-scalable application.

 
 

Solutions to Problem

After stating the possible risks may be faced in a concurrent environment, another good thing will be to propose some solution methods. Here are most popular of them:

Large Loops

Idiot Loopers

Most people use this method. They believe that looping from a single session means testing system under load. Yes it may be true that Oracle performs many I/Os and uses lots of CPU cycles, but unfortunately this has nothing to do with concurrency. You are still working over a single session and there is only you acting on your test application.

Advanced Loopers

More expert loopers try this method. They execute their loops from 5-10 parallel sessions. By doing this they create a parallel system use. But this approach has problems also. We may list them as follows:

  • 5-10 level of concurrency is usually not a sensible level for many real-world applications.
  • Pure looping by itself is problematic in the sense that it doesn’t represent the real life. In real life users don’t open the application at the same time. You need to model the so-called ramping in your testing environment.
  • Creating, managing, closing parallel sessions over SQL*PLUS may be difficult. You need to sit in front of your desk to monitor whether all go successfully.

Real World Tests

This is maybe the best solution in terms of results among all others. You may find 100s of testers before production and let them to test your system over their own SQL*PLUS sessions. But most of the time this is also the most expensive method. Also you can never be sure that all testers perform their test cases on time or coordination of those testers may be a real problem for a concurrent test.

Simulations

The final method is the topic of rest of the blog. In the broadest sense, simulation may be defined as the imitation of the real-life. This is done by the use of probability theory and statistics. Using simulation model will allow us to do the followings:

  • Testing our PL/SQL modules and Oracle for 0-1000 users.
  • Loading and de-loading system using ramping.
  • Modeling interarrival times for user requests.
  • Mix testing.
  • Logging concurrent environment errors.
  • Measuring concurrent environment response times.
  • Deciding whether one PL/SQL module or Oracle configuration is really better than the other under load.

Loading Oracle 2: Hundreds of Virtual Users in Oracle 10g…

Blog at WordPress.com.