SQL Joins

Image
Joins are used to combine rows from two or more tables based on related columns between them. Joins allow you to retrieve data from multiple tables simultaneously, enabling you to create complex queries that fetch data from different sources. There are different types of joins in SQL, including: INNER JOIN Returns only the rows that have matching values in both tables based on the specified join condition. It discards non-matching rows from both tables. Example:           create table t1(x int); insert into t1 values(1); insert into t1 values(1); insert into t1 values(0); create table t2(y int); insert into t2 values(0); insert into t2 values(1); insert into t2 values(1);           select * from t1 inner join t2 on t1.x = t2.y Output: 2. LEFT JOIN (or) LEFT OUTER JOIN Returns all the rows from the left (or first) table and the matching rows from the right (or second) table. If there is no match, NULL values are

SortKey, DistKey & Compression

Sortkeys

Compound Sortkeys


Effecient when used in the where conditions are in same order as specified in the sortkey.
This is the default sortkey .

ex : sortkey (column1, column2)

where column1='somevalue' and column2='somevalue'  give better performance than where column2='somevalue'


Interleaved Sortkeys


Gives equal weightage to all the columns specified in the sortkey. Gives better results than the compound sortkey in the circumstances when cannot able to identity columns used in the where conditions.

ex : interleaved sortkey (column1, column2)

As time grows the data added to the existing table grows , it will degrade the performance on both the compound and interleved sortkeys , but it has greter effect on interleaved key.Need to Vaccum Reindex sort key tables whenever new data added.

http://docs.aws.amazon.com/redshift/latest/dg/r_vacuum-decide-whether-to-reindex.html


Distribution Key


Defines how rows should be stored across the nodes.Usually moving the data between the nodes during joins consumes more time, so we have to more careful in defining distribution keys.
we can define only one distkey per fact table.

if sort and distkey uses the same column , it would help the query optimizer to choose effective MERGE JOIN step instead of other join types.

In some cases we cannot come to the conclusion in identifying the distkey , in that case we can use diststyle even , it would randomly distribute the rows across the nodes.

for the dimension table if data are minimal then use diststyle all, it would copy the entire table across all the nodes , so it would avoid the cost effective DB_BDCAST,DB_DB_INNER.

Compresssion Encoding


Compression techniques helps to reduce the data size in disk , so during the join operation the size of the data moved between nodes will get reduced , it will gradually increase the performance.
We can enable the compression encoding for all the columns in table.
Its better to avoid enabling the encoding for the sort keys , so while processing it has to decode the sort column and where conditions will be applied it will cause performance degrading.
We can enable compression either manually or automatically.
If its manual , then we need to analyze the columns before enabling encoding.

analyze compression tablename

will do this.But with sufficient data it would suggest more accurate encoding type.

The best option is to leave the finding of perfect encoding type to redshift by enabling endoing during copy like

COPY ... COMPUPDATE ON


Comments

Popular posts from this blog

Machine Learning Foundations - Deep Learning Summary - Quiz

Machine Learning Foundation - Deep Learning - Programming Assignment

Machine Learning Foundations - Recommender System - Quiz