Category Archives: Datawarehousing
Sometimes I find opportunities to perform theoretical presentations. Final one was on Support Vector Machines to young engineers in Turkcell Technology. SVM is not just one of the machine learning algorithms with strong theoretical settlement but also trying to understand SVM, let us play with various theoretical stuff like linear algebra, optimization theory, topology, etc. to build up a machine learner.
Keep in mind that this presentation is just a very brief introduction to SVM. Implementation details require a separate presentation. So the name of my presentation is theory ≤ SVM < implementation. I hope you enjoy it.
The podcast we have recently performed is available. Click here…
In a data warehouse database it is always likely to find some static data portion (usually a large portion) . This means that this portion of data will not be updated any more but rather it will be queried. In this post you will find the benefit of compressing your historical segments in a data warehouse environment.
Cost of creating a compressed table
It is usually known that creating compressed tables requires more time. Let me show why this is the case. We prepare a source table called big_table_new using Thomas Kyte’s infamous script (http://www.apress.com/book/downloadfile/2351). Then we create two identical tables, except that one is compressed and the other is not. Moreover we use runstats_pkg package to compare two executions of CTAS:
|hsensoy@dds> set timing onhsensoy@dds> set autot tracehsensoy@dds> exec runstats_pkg.rs_start;PL/SQL procedure successfully completed.
hsensoy@dds> create table big_table_nocomp pctfree 1 nologging tablespace LMT_AUTO_NEW parallel 32 as select * from big_table_new;
hsensoy@dds> exec runstats_pkg.rs_middle;
PL/SQL procedure successfully completed.
hsensoy@dds> create table big_table_comp compress pctfree 1 nologging tablespace LMT_AUTO_NEW parallel 32 as select * from big_table_new;
hsensoy@dds> exec runstats_pkg.rs_stop;
Run1 ran in 9974 hsecs
Run2 ran in 14599 hsecs
run 1 ran in 68,32% of the time
Name Run1 Run2 Diff
STAT…Elapsed Time 10,046 14,670 4,624
STAT…user I/O wait time 57,814 8,085 -49,729
STAT…CPU used by this sessio 117,697 333,567 215,870
STAT…recursive cpu usage 117,615 333,497 215,882
STAT…physical writes direct 2,653,136 956,400 -1,696,736
STAT…undo change vector size 6,478,648 4,326,472 -2,152,176
STAT…redo size 33,033,884 18,905,292 -14,128,592
You see only a small portion of runstats_pkg results. As we look at the results, we immediately understand that the reason of compressed table creation to take longer time than uncompressed table creation. It is due to CPU time. Compression algorithm requires a significant computation time (x2 CPU time). But notice that since it creates a smaller data segment, it generates less redo, consumes less undo space, and performs less I/O requests.
How much to compress?
This is directly related to level of REDUNDANCY within a data block. If the rows with in a data block contains many common value columns this means the benefit you get from compression will be higher. Let’s show this. To make a redundancy prediction, we first analyze the BIG_TABLE_NEW from which we load the data. This is for filling the necessary columns of dba_tab_cols view, which we use later on.
|hsensoy@dds> begindbms_stats.gather_table_stats(ownname => ‘HSENSOY’,
tabname => ‘BIG_TABLE_NEW’,
degree => 32);
PL/SQL procedure successfully completed.
Remember that we will make only a rough prediction(later you will see that it works pretty good) based on the fact that sortting data such according to most redundant columns (least distinct ones) will create table whose blocks contains highly redundant data. As we look at the dba_tab_cols table statistics we immediately see that some columns are significantly redundant.
|hsensoy@dds> select dtc.column_name, dtc.num_distinct, dtc.avg_col_lenfrom dba_tab_cols dtc
where dtc.owner = ‘HSENSOY’ and dtc.table_name = ‘BIG_TABLE_NEW’
order by dtc.num_distinct,dtc.avg_col_len desc;
COLUMN_NAME NUM_DISTINCT AVG_COL_LEN
————— —————- ———–
STATUS 1 6
SECONDARY 1 2
GENERATED 2 2
TEMPORARY 2 2
OWNER 17 6
OBJECT_TYPE 31 10
SUBOBJECT_NAME 650 2
CREATED 756 8
TIMESTAMP 761 20
LAST_DDL_TIME 767 8
DATA_OBJECT_ID 5398 2
OBJECT_NAME 27059 24
OBJECT_ID 50581 5
ID 200043516 7
14 rows selected.
Then we create the table using CTAS and we sort the data according to column order we obtain in previous step.
|hsensoy@dds> create table big_table_careful_comp compress pctfree 1 nologging tablespace LMT_AUTO_NEW parallel 32 asselect *
order by status,
And ta taa!!! Same data and same compression algorithm but only 5 GB with compared to randomly ordered data.
|hsensoy@dds> column segment_name format a30
hsensoy@dds> select segment_name, trunc(bytes / 1024 / 1024 / 1024,2) gigabytes, blocks, extents
where owner = ‘HSENSOY’
and segment_name in (‘BIG_TABLE_COMP’, ‘BIG_TABLE_CAREFUL_COMP’);
SEGMENT_NAME GIGABYTES BLOCKS EXTENTS
—————————— ———- ———- ———-
BIG_TABLE_COMP 14,67 961944 4129
BIG_TABLE_CAREFUL_COMP 5,27 345744 2793
So remember that in a data warehouse database, in contrast to an operational one, it may be possible to load data sorted into tables in ETL step. You have to take into consideration of the fact that sorted data is much more compressible.
In the previous section we see that compressed table (or partition, subpartition,index) segments require less space on disk. One may infer that this will improve query performance. It’s true. Let’s see how
|hsensoy@dds> select count(*) from big_table_nocomp;
hsensoy@dds> select count(*) from big_table_comp;
hsensoy@dds> select count(*) from big_table_careful_comp;
Remember that timing may be misleading. To see that compression really works let’s look closer:
|No Compression||Compress Unsorted Data||Compress Sorted Data|
Yes, querying a compressed table really works well in terms of I/O and CPU usage.
To sum up, we may conclude that data warehouse database may utilize Oracle table compression option by making a correct physical database design. That’s because their fundamental characteristics is being redundant/denormalized. So in order to improve your query performance, storage efficiency and in return shorten your backup & recovery timings you HAVE TO carefully think on using compression option.
Finally, I introduce some practical practics on using compression:
- Separate your OLD data from the NEW one in terms of tablespace whenever it’s possible.
- Instead of loading data into compressed tables. You load the data swiftly into a uncompressed segment and then recreate segment periodically as compressed using alter table statements clause.
- Keep your compressed segments in readonly tablespaces whenever it is possible to prevent compressed blocks from update operations. Update cause block to be decompressed decreasing the effect of compression.
- Tables range partitioned according to a date column are usually best candidates for partition/subpartition compression. Write scheduled scripts to perform segment compression.
- Talk with your business requirement, development and administration team to identify which partition of data is read-only and suitable for compression (For example, six month old data is static)
As far as I have recently read, Oracle doesn’t allow you to set some physical features of sub-partitions such as compression clause. I have been mistaken that composite partitioned tables can not be compressed at all. However in data warehouse environments, a general requirement is being able to compress segments of historic partitions of RANGE + HASH partitioned tables. I have tried a few scenarios and finally found a solution. Yes, Oracle allows it.
One approach can be to compress all subsegments by trying recreating partition with COMPRESS clause in parallel. However this approach will not work:
ALTER TABLE X MOVE PARTITION 2007JAN NOLOGGING COMPRESS PARALLEL 16;
You will immediately face with ORA-14257: cannot move partition other than a Range or Hash partition exception. This means Oracle can not recreate all subpartition segments at a time.
Another approach is to try recreating compressed subpartition segments by using COMPRESS option one by one:
ALTER TABLE X MOVE SUBPARTITION 2007JAN_01 NOLOGGING COMPRESS PARALLEL 16;
This time you will come up with ORA-14160: this physical attribute may not be specified for a table subpartition. That’s because Oracle doesn’t allow you to set some physical attributes of subpartitions. NOLOGGING and COMPRESS are two of those attributes. Those attributes inherit from partition of subpartition.
Change the physical attribute at partition level
The first thing you have to do is to change physical attribute (compression attribute) of the partition which owns the subpartition by using MODIFY clause:
ALTER TABLE X MODIFY PARTITION 2007JAN COMPRESS;
This statement will modify partition parameter in such a way that all new data blocks created by direct path inserts will be compressed. Remember that the table has composite partitioning. This means subpartition segment blocks are the compressed blocks.
Recreate subpartition segments
“ALTER TABLE…” statement that we have used above affects only the blocks that are created after the statement has been issued. In order to compress existing subpartition blocks, we need to recreate them:
ALTER TABLE X MOVE SUBPARTITION 2007JAN_01 PARALLEL 16;
ALTER TABLE X MOVE SUBPARTITION 2007JAN_02 PARALLEL 16;
ALTER TABLE X MOVE SUBPARTITION 2007JAN_03 PARALLEL 16;
ALTER TABLE X MOVE SUBPARTITION 2007JAN_04 PARALLEL 16;
All these four subpartition segments will be recreated in compressed form within the tablespace(s) in which they have been currently residing. That is because they inherit the compression attribute from their parents (partition of subpartitions).
Check *_SEGMENTS view to see the effect
As you execute the following command you will see that your segments have already been compressed:
SELECT partition_name, bytes / 1024 / 1024 / 1024 gigabytes
WHERE segment_name = ‘X’;
In order to test this scenario I have used one of the summary tables of our data warehouse. Table is RANGE+HASH partitioned and one of these partitions is ~85 GB before compression and ~22 GB after all sixteen subpartitions have been compressed.
Remember that Oracle unfolds, considering that the term is correct, the compressed blocks if you update the data in compressed ones, which decreases the benefit you obtain by using compression. So Oracle advises using compression option with read only tablespaces to guarantee the gain.