Thursday, January 28, 2010

Making Numeric Row Identifiers in PostgreSQL

Around a month ago I started to learn how to use PostgreSQL to create, maintain, and query databases. It wasn't hard to find things in my personal life that were appropriate for databasing: Job applications, Hydroelectricity usage/charges, usernames and passwords. Putting data from these aspects of my life into databases has given me some great practice using a query language.

The thing about databasing that really trips me up is the concept of indexing. Depending on how you query (search) your database, any one record (row) in your database could show up in the 11th, 27th, 1st or 99th row of your query results. Hence, trying to access the record in which you stored your facebook username and password by its absolute row number just won't work. That's a concept for Python and not PostgreSQL (please correct me if I'm wrong!). What you can do however is assign each of your database records a unique numerical identifier and so any time you need to modify a single row, you can refer to it by its identifier no matter what the content in the other fields of that record.

What I wish I knew when I made my database is that it is REALLY easy to create a type of column in your table that automatically assigns unique numerical identifiers to your records when you insert new values into the table. When you're making the table in your database, declare your id column as type SERIAL. Then when inserting new values into your table, all you have to do is enter the word DEFAULT in place of a number that you would come up with yourself. The number id in the SERIAL column type seems to start at 1 by default, obviously incrementing by 1. Nice and simple. Following is an example:

CREATE TABLE test (
blah SERIAL
yadda varchar(50)
);
INSERT INTO test VALUES(DEFAULT, 'hello');
INSERT INTO test VALUES(DEFAULT, 'goodbye');

SELECT * FROM test;
blah | yadda
------+---------
1 | hello
2 | goodbye
(2 rows)



In my situation, I didn't seem to be able to make a SERIAL column for my existing table, or convert a column I already had. So, I needed a way of generating unique numerical identifiers after 100 records had already been inputted. I found two solutions (if there are others, please tell) to work for me. First the easy solution:

ALTER TABLE tablename SET WITH OIDS;

OIDS stands for Object Identifiers. When you use the above statement, PostgreSQL assigns a unique numerical identifier to each row in your table. This solution didn't look as clean to me as the scenario where the SERIAL type column was made when the table was made. The raw value that the OIDS in my table start at 16522, which is a little weird. I think this might be due to data that I have in other databases on my computer, but I can't be sure right now. The bonus is that the OIDS seem to have been assigned roughly according to the order in which I inputted these records into the table in my database. Thus it's a simple solution and allows me to refer to any one record by a simple unique identifier. I've read that the raw value of these OIDS will wrap around to something very small after you reach about 4 billion records. I don't feel in danger of that using PostgreSQL for personal matters though!

The next solution was more difficult, but gave me nicer results. First you need to figure out how you want to order the records in your database. I wanted to order them by date, so let's call my record dating column rdate. Unfortunately, many records were inputted on the same date, which means if I only relied on the date for ordering, I would get many records having the same numerical index. So, you have to take advantage of as many additional fields in your table that will make your record unique, and ensure that the following process results in as many different numerical identifiers as there are records. Let's say that in addition to specifying rdate as your ordering column, you also specify columns yadda, and blah. Also, your id variable is simply named id. You would then define the following function in PostgreSQL (my reference for this function comes from a pgsql mailing list posting):

CREATE OR REPLACE FUNCTION reorder_table() RETURNS INTEGER AS '

DECLARE
newcode INTEGER ;
table_record RECORD ;

BEGIN
newcode := 1 ;

FOR tablerecord IN SELECT * FROM tablename ORDER BY rdate LOOP

UPDATE tablename SET id = newcode
WHERE rdate = table_record.rdate AND yadda = table_record.yadda AND blah = table_record.blah;

newcode := newcode + 1 ;

END LOOP ;

RETURN 1;
END;
' LANGUAGE plpgsql;


Once you've entered this in, all you have to do is call your function through a simple statement: SELECT reorder_table();. If your inclusion of yadda and blah made each record in your table unique enough, then you will now have numerical identifiers ranging from 1 to the total number of records in your table! Of course if you want your numerical identifiers to be assigned according to the order of a different column in your table, replace rdate in the SELECT statement in the above function with the column name that you want as your ordering column.

What I like about using the OIDS in my situation is that the identifers are automatically generated, whereas I would need to run the above function every time I update the table in my database in order that every record has an identifier. Suffice it to say that in the future, I will create a SERIAL type column next time I need identifiers like this!

No comments:

Post a Comment