Introduction:
It is a database design technique that reduces data redundancy and eliminates Insertion, Deletion, and Update anomalies.
It divides larger table in to smaller tables and links them using relationships.
Database Normalization is used for following Purpose:
To Eliminate the redundant or useless data
To Reduce the complexity of the data
To Ensure the relationship between tables as well as data in the tables
To Ensure data dependencies and data is logically stored.
Need Of Database Normalization :
If your database objects are not in structured or normalized way it is difficult to update the database without facing the database loss.
Insertion,Updation and deletion are very frequent if data is not normalized.
Normalization is part of successful database design.
Without Database normalization the database system can be slow,Inaccurate and inefficient.
Real Life Example of data which is not normalized:
Roll No | Student Name | Branch Name | HOD | Office Phone |
101 | Valli | CSE | Mr. George | 9876543210 |
102 | Nagesh | CSE | Mr. George | 9876543210 |
103 | Aadya | CSE | Mr. George | 9876543210 |
104 | Avyay | CSE | Mr. George | 9876543210 |
In the table above, we have data of 4 Computer Sci. students. As we can see, data for the fields branch, hod(Head of Department) and office_tel is repeated for the students who are in the same branch in the college, this is Data Redundancy.
Insertion Anomaly
Suppose for a new admission, until and unless a student opts for a branch, data of the student cannot be inserted, or else we will have to set the branch information as NULL.Also, if we have to insert data of 100 students of same branch, then the branch information will be repeated for all those 100 students.
These scenarios are nothing but Insertion anomalies.
Updation Anomaly
What if Mr. George leaves the college? or is no longer the HOD of computer science department? In that case all the student records will have to be updated, and if by mistake we miss any record, it will lead to data inconsistency. This is Updation anomaly.
Deletion Anomaly In our Student table, two different informations are kept together, Student information and Branch information. Hence, at the end of the academic year, if student records are deleted, we will also lose the branch information. This is Deletion anomaly.
Normalization Rule Normalization rules are divided into the following normal forms:
First Normal Form
Second Normal Form
Third Normal Form
BCNF
Fourth Normal Form
First Normal Form (1NF) For a table to be in the First Normal Form, it should follow the following 4 rules:
It should only have single(atomic) valued attributes/columns.
Values stored in a column should be of the same domain
All the columns in a table should have unique names.
And the order in which data is stored, does not matter.
Time for an Example:
Roll No | Student Name | Subject |
101 | Valli | OS,CN |
103 | Nagesh | Java |
102 | Aadya | C,C++ |
The above table satisfies 3 out of 4 rules of 1NF. So now if we want bring the above table into 1NF we have update the table as below:
Roll No | Student Name | Subject |
101 | Valli | OS |
101 | Valli | CN |
102 | Nagesh | Java |
103 | Aadya | C |
103 | Aadya | C++ |
Using the First Normal Form, data redundancy increases, as there will be many columns with same data in multiple rows but each row as a whole will be unique.
Second Normal Form (2NF) For a table to be in the Second Normal Form,
It should be in the First Normal form.
And, it should not have Partial Dependency.
Dependency
Let's take an example of a Student table with columns student_id, name, reg_no(registration number), branch and address(student's home address).
In this table, student_id is the primary key and will be unique for every row, hence we can use student_id to fetch any row of data from this table
Even for a case, where student names are same, if we know the student_id we can easily fetch the correct record.
Student(Table):
student_id | name | reg_no | branch | address |
10 | Harika | 02-EQ | CSE | South Windsor |
11 | Siva | 03-LK | IT | Manchester |
12 | Harshi | 04-gh | IT | Vernon |
Hence we can say a Primary Key for a table is the column or a group of columns(composite key) which can uniquely identify each record in the table.
So all I need is student_id and every other column depends on it, or can be fetched using it.
This is Dependency and we also call it Functional Dependency.
Partial Dependency
Let's create another table for Subject, which will have subject_id and subject_name fields and subject_id will be the primary key.
subject_id | subject_name |
1 | Java |
2 | C++ |
3 | Php |
Now we have a Student table with student information and another table Subject for storing subject information.
Let's create another table Score, to store the marks obtained by students in the respective subjects. We will also be saving name of the teacher along with marks.
score_id | student_id | subject_id | marks | name of the teacher |
1 | 10 | 1 | 70 | Java Teacher |
2 | 10 | 2 | 75 | C++ Teacher |
3 | 11 | 1 | 80 | Java Teacher |
In the score table we are saving the student_id to know which student's marks are these and subject_id to know for which subject the marks are for.
Together, student_id + subject_id forms a Candidate Key for this table, which can be the Primary key, but the teacher's name only depends on subject, hence the subject_id, and has nothing to do with student_id.
This is Partial Dependency, where an attribute in a table depends on only a part of the primary key and not on the whole key.
To remove partial dependency
There can be many different solutions for this, but out objective is to remove teacher's name from Score table.
The simplest solution is to remove columns teacher from Score table and add it to the Subject table. Hence, the Subject table will become:
Subject(Table):
subject_id | subject_name | Teacher |
1 | Java | Java Teacher |
2 | C++ | C++ Teacher |
3 | Php | Php Teacher |
And our Score table is now in the second normal form, with no partial dependency.
Score(Table):
score_id | student_id | subject_id | marks |
1 | 10 | 1 | 70 |
2 | 10 | 2 | 75 |
3 | 11 | 1 | 80 |
Third Normal Form (3NF) A table is said to be in the Third Normal Form when,
It is in the Second Normal form.
And, it doesn't have Transitive Dependency.
So Lets use the same example where we have 3 tables, Student, Subject and Score.
In the Score table, we need to store some more information, which is the exam name and total marks, so let's add 2 more columns to the Score table.
score_id | student_id | subject_id | marks | exam_name | total_marks |
Transitive Dependency:
When a non-prime attribute depends on other non-prime attributes rather than depending upon the prime attributes or primary key is called Transitive Dependency.
Our new column exam_name depends on both student and subject. For example, a mechanical engineering student will have Workshop exam but a computer science student won't. And for some subjects you have Prctical exams and for some you don't. So we can say that exam_name is dependent on both student_id and subject_id.
And the column total_marks depends on exam_name as with exam type the total score changes. For example, practicals are of less marks while theory exams are of more marks.
But, exam_name is just another column in the score table. It is not a primary key or even a part of the primary key, and total_marks depends on it.
Score(Table):
score_id | student_id | subject_id | marks | exam_id |
To Remove Transitive Dependency:
Take out the columns exam_name and total_marks from Score table and put them in an Exam table and use the exam_id wherever required.
Exam(Table):
exam_id | exam_name | total_marks |
1 | Workshop | 200 |
2 | Mains | 70 |
3 | Practicals | 20 |
Advantage of removing Transitive Dependency
The advantage of removing transitive dependency is,
Amount of data duplication is reduced.
Data integrity achieved.
Boyce and Codd Normal Form (BCNF) Boyce and Codd Normal Form is a higher version of the Third Normal form. This form deals with certain type of anomaly that is not handled by 3NF. A 3NF table which does not have multiple overlapping candidate keys is said to be in BCNF.
For a table to be in BCNF, following conditions must be satisfied:
It must be in 3rd Normal Form
and, for each functional dependency ( X → Y ), X should be a super Key.
Time for an Example:
Below we have a college enrolment table with columns student_id, subject and professor.
student_id | subject | professor |
101 | Java | P.Java |
101 | C++ | P.Cpp |
102 | Java | P.Java2 |
103 | C# | P.Chash |
104 | Java | P.Java |
In the table above:
One student can enrol for multiple subjects. For example, student with student_id 101, has opted for subjects - Java & C++
For each subject, a professor is assigned to the student.
And, there can be multiple professors teaching one subject like we have for Java.
student_id, subject together form the primary key, because using student_id and subject, we can find all the columns of the table.
One professor teaches only one subject, but one subject may have two different professors.
Hence, there is a dependency between subject and professor here, where subject depends on the
professor name.
This table satisfies the 1st Normal form because all the values are atomic, column names are unique and all the values stored in a particular column are of same domain.
This table also satisfies the 2nd Normal Form as their is no Partial Dependency.
And, there is no Transitive Dependency, hence the table also satisfies the 3rd Normal Form.
In the table above, student_id, subject form primary key, which means subject column is a prime attribute.
But, there is one more dependency, professor → subject.
And while subject is a prime attribute, professor is a non-prime attribute, which is not allowed by BCNF. So we have to split the above table into two tables :
Student(Table):
student_id | p_id |
101 | 1 |
101 | 2 |
102 | 4 |
103 | 3 |
104 | 1 |
Professor(Table):
p_id | professor | subject |
1 | P.Java | Java |
2 | P.Cpp | C++ |
3 | P.Chash | C# |
4 | P.Java2 | Java |
Fourth Normal Form (4NF) A table is said to be in the Fourth Normal Form when,
It is in the Boyce-Codd Normal Form.
And, it doesn't have Multi-Valued Dependency.
Multi-Valued Dependency:
A table is said to have multi-valued dependency, if the following conditions are true,
For a dependency A → B, if for a single value of A, multiple value of B exists, then the table may have multi-valued dependency.
Also, a table should have at-least 3 columns for it to have a multi-valued dependency.
And, for a relation R(A,B,C), if there is a multi-valued dependency between, A and B, then B and C should be independent of each other.
If all these conditions are true for any relation(table), it is said to have multi-valued dependency. Time for an Example:
s_id | course | hobby |
1 | Science | Cricket |
1 | Maths | Hockey |
2 | C# | Cricket |
2 | Php | Hockey |
As you can see in the table above, student with s_id 1 has opted for two courses, Science and Maths, and has two hobbies, Cricket and Hockey.
The columns course and hobby are independent of each other.
So there is multi-value dependency, which leads to un-necessary repetition of data and other anomalies as well.
To remove multi-valued dependency, We need to split the above table into 2 tables.
CourseOpted(Table):
s_id | course |
1 | Science |
1 | Maths |
2 | C# |
2 | Php |
Hobbies(Table):
s_id | hobby |
1 | Cricket |
1 | Hockey |
2 | Cricket |
2 | Hockey |
Now this relation satisfies the fourth normal form.
A table can also have functional dependency along with multi-valued dependency. In that case, the functionally dependent columns are moved in a separate table and the multi-valued dependent columns are moved to separate tables.
Fifth Normal Form(5NF):
A table is said to be in the FifthNormal Form when,
It should be in Fourth Normal Form.
And, it doesn't have Join Dependency.
5NF is satisfied when all the tables are broken into as many tables as possible in order to avoid redundancy.
Its also called as PJNF (Project Join Normal Form).
Time for an Example:
subject | lecturer | semester |
Computer | Valli | Semester 1 |
Computer | Nagesh | Semester 1 |
Math | Nagesh | Semester 1 |
Math | Aadya | Semester 2 |
Chemistry | Avyay | Semester 1 |
In the above table, Nagesh takes both Computer and Math class for Semester 1 but he doesn't take Math class for Semester 2.
In this case, combination of all these fields required to identify a valid data.
Suppose we add a new Semester as Semester 3 but do not know about the subject and who will be taking that subject so we leave Lecturer and Subject as NULL.
But all three columns together acts as a primary key, so we can't leave other two columns blank.
So to make the above table into 5NF, we can decompose it into three relations P1, P2 & P3:
P1
Semester | Subject |
Semester 1 | Computer |
Semester 1 | Math |
Semester 1 | Chemistry |
Semester 2 | Math |
P2:
Subject | Lecturer |
Computer | Valli |
Computer | Nagesh |
Math | Nagesh |
Math | Aadya |
Chemistry | Avyay |
P3:
Semester | Lecturer |
Semester 1 | Valli |
Semester 1 | Nagesh |
Semester 1 | Nagesh |
Semester 2 | Aadya |
Semester 1 | Avyay |
Comments