Handling large tables

Greg.NET · Jun 18, 2007

I have a few tables that have 2-3 million records each and thousands more records are inserted every day. Queries are starting to slow down a lot and Im wondering what would be the best way to handle this.

Usually a record won't be used again after a few months from its insertion date. But once in a great while one from a year ago will be updated, so its not an option to simply delete old records.

If you have any advice for me I would greatly appreciate it!

I am relatively new to sql or any databasing, but I've been reading about sql 2005's partitioning feature and it sounds like my database could benefit from that.

ss7thirty · Jun 18, 2007

Index

Have you indexed the fields that you are going to be querying on? This can significantly increase speed

Greg.NET · Jun 18, 2007

I indexed all the fields that I felt needed to be indexed. But is it at all beneficial to index bit fields or other non selective fields?

ss7thirty · Jun 18, 2007

It can be very beneficial to index fields in groups. For example:

Tbl_Houses
Fields:
StreetNumber (varchar(10))
StreetName (varchar(55))
IsDemolished (bit)
Field1
Field2
Field3

You are running a select query:

SELECT * FROM Tbl_Houses WHERE StreetNumber = '1' and StreetName = 'Main Street' and IsDemolished = '1'

If you had an index called Address_Demolished that contained both StreetNumber, StreetName amd IsDemolished this would significantly improve the speed of this select statement.

You should create indexes for each type of query you are going to run. The index creates a pointer to the records and will SIGNIFICANTLY increase performance in larger databases. I was working with a databse that contained every address in New York City. Before I created indexes it would take almost a whole day to update the data on a box of images. After creating them this process would take just a few minutes.

As far as creating a single field index for a bit value, I could not see this having a helpful impact because so many of the fields will have the same value. But if you use the bit value in combination with a couple other fields it could be helpful.

The one thing that I realized is that indexing single fields is not always that helpful is HUGE databases when you are running queries that involve multiple fields. What I have learned is that you should create multiple field indexes.

Best of luck,

Steve

Greg.NET · Jun 18, 2007

thank you very much that makes a lot of sense. But one thing I worry about is disk space. If I go "index crazy" won't that drastically increase the size on disk? Keep in mind I have 60 columns in one of my tables!

right now I only index the fields that are the most selective(Id, etc). I don't have any indexes with multiple fields but Ill experiment and see what happens.

ss7thirty · Jun 19, 2007

That I am not too sure about. I dont think it should be too bad. If you use the indexes wisely I don't think you will feel the effects too much. However, you do want to make sure that you only create necessary indexes and ones that are efficient. Which ever field combinations are in your most common select statements, etc. will help them run faster. If you use them wisely in your most common queries or only the ones that are taking forever, you should be fine.

cjard · Jun 26, 2007

Greg.NET said:
thank you very much that makes a lot of sense. But one thing I worry about is disk space. If I go "index crazy" won't that drastically increase the size on disk?

No, but it will slow down your INS/UPD/DEL statements because of all the index mintenance.

We cant really answer your question without looking at the plans for the execution of the queries. Database performance tuning is an art, not something that can be taught in 3 lines on an internet forum. You need to get deeper into how databases work; that you havent so far created any multi-column indexes kinda implies to me that you dont really know how they work.. It would be worth looking into

right now I only index the fields that are the most selective(Id, etc).

THat's no good if your queries are hitting broad ranges. It costs as much to use the index as it does to just read the entire table. Find all the people in a city who are male = its so low selectivity that its faster to jsut read the table than to read the index on Gender, cache all the row ids that are male (50%) then go look up those row ids.. Bahh... Too much waste of resources.. THe DB will just read the whole table.

I don't have any indexes with multiple fields but Ill experiment and see what happens.

Again, database performance tuning isnt jsut about slamming some indexes in and seeing what happens. THese things need thought, and design. You need to understand how a db stores data, how it fetches it, how it plans its queries and the logic behind using its indexes. You need to think about how you design your tables, data type choice, and you need to think about the way you write queries and know how the analyzer responds to use of certain keywords. You want to find all the records in B that are not in A? Most inexperienced programmers write:

SELECT * FROM B where id NOT IN (SELECT id FROM a)

Because that is how they think. Poor database usually makes a complete hash of such a query and selects a massive list of IDs into memory, which it then loops through (or hashes) for every row in B, looking for no match. In a worst case scene, a million records in each table, this op might need a thousand billion comparisons.. Overnight job.. First thing I put in place at work was a "Dont use IN for a list longer than what you would type by hand" - its a broad rule of thumb, its not always applicable, but it was a good start in curing the horrendous habits of a set of particular poor SQL coders here.

If youre getting serious about performance tuning your database, you need a lot of reading material.. Remember, weeks of coding can save you hours of planning

Greg.NET · Jul 6, 2007

I appreciate the help everyone

A little off topic but what would you recommend besides

SELECT * FROM B where id NOT IN (SELECT id FROM a) ? Generally speaking is there a more efficient way of doing this or are you trying to say that depending on the application theres usually a better way?

Also, what's wrong with experimenting? Its not like I'm going to randomly throw together a bunch of indexes. I have to do this on a weekend because that's the only down time for the server so I've been researching this for weeks in order to make it worth my time. But I won't know until I fool around with it for a while and see what works best.

ss7thirty · Jul 6, 2007

There is absolutely nothing wrong with experimentation...I dont know who told you that. If I did I'm sorry, I could not have been more wrong. That is the only way I ever learned anything really.

cjard · Jul 6, 2007

Greg.NET said:
I appreciate the help everyone

A little off topic but what would you recommend besides

SELECT * FROM B where id NOT IN (SELECT id FROM a) ?

VB.NET:

SELECT
  b.*
FROM
  b
  LEFT OUTER JOIN
  a
  ON b.id = a.id
WHERE
  a.id IS NULL

Generally speaking is there a more efficient way of doing this or are you trying to say that depending on the application theres usually a better way?

As a broad rule of thumb for optimization, avoid use of IN for long lists

Also, what's wrong with experimenting?

Erm, nothing.. And you do learn, albeit at a slower pace than reading prepared material from someone who has written it with the aim of teaching in mind..

But I won't know until I fool around with it for a while and see what works best.

Indeed, I just would always advocate "educated guessing" over "random guessing"

Greg.NET · Jul 7, 2007

hmmm would I also be able to do something like:

Update a set a.status = 0
from a INNER join b on a.foreignid = b.id
where b.field = something

Because what I've been doing is:

Update a set a.status = 0
where a.foreignid IN (select b.id from b where b.field = something)

Greg.NET · Jul 7, 2007

also, I only know that this works because I tried it. Do you have a link that would explain why this actually works? Because I would interpret

SELECT
b.*
FROM
b
LEFT OUTER JOIN
a
ON b.id = a.id
WHERE
a.id IS NULL

to mean that the in the a table, the id field must be null.

ss7thirty · Jul 8, 2007

Is there anyway you can better explain what each of the joins do because I am still a bit unclear about all of the variants. My understanding is that there are 4 different types:

RIGHT OUTER
LEFT OUTER
RIGHT INNER
LEFT INNER

I think that this is where the confusion is coming in.

cjard · Jul 9, 2007

Greg.NET said:
hmmm would I also be able to do something like:

Update a set a.status = 0
from a INNER join b on a.foreignid = b.id
where b.field = something

Because what I've been doing is:

Update a set a.status = 0
where a.foreignid IN (select b.id from b where b.field = something)

The first query can(and would, if I were writing it) be used to perform the work of the second. Another understood form can be:

VB.NET:

UPDATE table a SET a.field = value
WHERE EXISTS (SELECT 1 FROM anotherTable b WHERE a.id = b.id)

Of all the options, I would still go with the first. In oracle, because we dont have UPDATE .. FROM we either write an updatable view, which is hard because the optimizer is particularly retarded sometimes, or we would use a MERGE statement.. If youre curious how the EXISTS form differs from the IN form, have a google for it..

cjard · Jul 9, 2007

Greg.NET said:
also, I only know that this works because I tried it. Do you have a link that would explain why this actually works? Because I would interpret

SELECT
b.*
FROM
b
LEFT OUTER JOIN
a
ON b.id = a.id
WHERE
a.id IS NULL

to mean that the in the a table, the id field must be null.

Nothing can ever equal null, not even another null. Try this:

SELECT CASE WHEN null = null THEN 'null equals null' ELSE 'null doesnt equal null' END

Because null represents unknown, we cannot say whether one unknown equals another unknown, because they are unknown. In schrodingers cat experiment, we cannot say whether the cat is dead or alive. If there were two boxes with two cats, we couldnt say if they were in the same state. This concept is core to this discussion about nulls and their unknownness.

A left join takes the table on the left and matches up every row on the right, putting nulls in for those rows where no match is found:

color.id, color.val
--
1, red
2, green
3, blue

fruit.name, fruit.color_id
--
apple, 1
apple, 2
lime, 2

color LEFT JOIN fruit:
1, red, apple, 1
2, green, apple, 2
2, green, lime, 2
3, blue, null, null

No fruit are blue. The blue line exists with null fruit.

So it stands to reason that we can say "which colours have no associated fruit?" by saying
WHERE fruit.name IS NULL

Note that I said "null can never be equal another null"

This means that any fruit with a null colour ID cannot be a part of the join. If we had a row where we really didnt know what colour a lemon was:

lemon, null

This row will NOT show in color LEFT JOIN fruit, because the join condition is an equals (color LEFT JOIN fruit ON colorid = id) and we know that NULL = <anything> is false

If fruit.colorid is NULL it cannot ever be equal to any color.id, so it is dropped from the results.

Also, if we had a colour that we didnt have an id for:

null, magenta

No fruits would ever be classed as magenta, not even our lemon, for the same reason. The id of this color is null, it cannot equal anthing else, ever, it doesnt participate in the join and the result of the join condition is always false where one or both sides are null.

-

So basically, where your thinking was confused was in assuming that any row in A where the ID was null, would match up with any row in B where the id was also null; not so. The results of

a LEFT JOIN b

a1, b1
a2, null <-- hole
a3, b3
a4, b4
a5, null <-- hole

produce a "solid column of a rows, with no holes", a weak column of b rows, having holes where there was no match"

You will never see a null in the a column.

ergo, for every row in a join b where b.id is null, it means there is no row in b for that row in a. Or, "the <a.id> is NOT IN <list of all b.id> "

--

It is also worth noting here that if you use a LEFT OUTER JOIN and then make a WHERE clause that uses any operator other than IS or IS NOT, then you revert the query to an inner join:

VB.NET:

SELECT
  *
FROM
  color
  LEFT OUTER JOIN
  fruit
  ON
    fruit.colorid = color.id

WHERE
  fruit.price > 100

Remember our lines:
color.id, color.val, fruit.id, fruit.colorid, price
1, red, apple, 1, 200
2, green, apple, 2, 200
2, green, lime, 2, 300
3, blue, null, null, null

No fruit are blue, the price will be listed as null for blue fruit. Because NULL > 100 is "unknown" it is made false, and the row is dropped.

Ive seen many queries where someone asked "this is supposed to return me all X that dont have a matching Y but it doesnt return anything" and its been because there has been a where clause trying to equate Y

In this case, either move the condition into the join, not the where, or provide for the cases where there is a null:

VB.NET:

SELECT
  *
FROM
  color
  LEFT OUTER JOIN
  fruit
  ON
    fruit.colorid = color.id AND
    fruit.price > 100

VB.NET:

SELECT
  *
FROM
  color
  LEFT OUTER JOIN
  fruit
  ON
    fruit.colorid = color.id

WHERE
  (fruit.price > 100 OR fruit.price IS NULL)

Remember to use brackets if youre ANDing other conditions into the where.

Handling large tables

Greg.NET

Active member

ss7thirty

Well-known member

Greg.NET

Active member

ss7thirty

Well-known member

Greg.NET

Active member

ss7thirty

Well-known member

cjard

Well-known member

Greg.NET

Active member

ss7thirty

Well-known member

cjard

Well-known member

Greg.NET

Active member

Greg.NET

Active member

ss7thirty

Well-known member

cjard

Well-known member

cjard

Well-known member

Similar threads

Share this page

Latest posts