Published On:Sunday, 27 November 2011
Posted by Muhammad Atif Saeed
Data Management
Data Management |
Today’s Goals:(Data Management)
irst of a two-Lesson sequenceToday we will become familiar with the issues and problems related to data-intensive computing
We will find out about flat-files, the simpleast databases
Next time, in our 4th Lesson on productivity software, we will discuss relational databases and
implement a simple relational database
Keeping track of a few dozen data items is straight forward
However, dealing with situations that involve significant number of data items, requires more attention
to the data handling process
Dealing with millions - even billions - of inter-related data items requires even more careful thought
36.1 zainBooks.com :
Consider the situation of a large, online bookstoreThey have an inventory of millions of books, with new titles constantly arriving, and old ones being
phased out on a regular basis
The price for a book is not a static feature; it varies every once in a while
Thousands of books are shipped each day, changing the inventory constantly
Some are returned, again changing the inventory situation constantly
The cost of each shipped order depends on:
Prices of individual books
Size of the order
Location of the customer
Mode of shipment
For each order, the customer’s particulars –_ name, address, phone number, credit card number – are
required
Generally, that data is not deleted after the completion of the transaction; instead, it is kept for future
reference
All the transaction activity and the inventory changes result in:
Thousands of data items changing every day
Thousands of additional data items being added everyday
Keeping track & taking care (i.e. management) of all that constantly changing and expanding data is not
a trivial task and requires disciplined attention and actions for ensuring the smooth & profitable
operation of the bookstore
36.2 Issues in Data Management:
Data entryData updates
Data integrity
Data security
Data accessibility
Data Entry:
New titles are added every dayNew customers are being added every day
Some of the above may require manual entry of new data into the computer systems
That new data needs to be added accurately
That can be achieved, for one, by user-interfaces that prevent the input of invalid data
Data Updates :
Old titles are deleted on a regular basisInventory changes every instant
Book prices change
Shipping costs change
Customers’ personal data change
Various discount schemes are always commencing and concluding
All those actions require updates to existing data
Those changes need to be entered accurately
That can also be achieved by user-interfaces that prevent the input of invalid data
Data Security :
All the data that zainBooks has in its computer systems is quite critical to its operationThe security of the customers’ personal data is of utmost importance. Hackers are always looking for
that type of data, especially for credit card numbers
Enough leaks of that type, and customers will stop doing business with zainBooks
This problem can be managed by using appropriate security mechanisms that provide access to
authorized persons/computers only
Security can also be improved through:
Encryption
Private or virtual-private networks
Firewalls
Intrusion detectors
Virus detectors
Data Integrity:
Integrity refers to maintaining the correctness and consistency of the dataCorrectness: Free from errors
Consistency: No conflict among related data items
Integrity can be compromised in many ways:
Typing errors
Transmission errors
Hardware malfunctions
Program bugs
Viruses
Fire, flood, etc.
Ensuring Data Integrity:
Type Integrity is implemented by specifying the type of a data item:Example: A credit card number consists of 12 digits. An update attempting to assign a value with more
or fewer digits or one including a non-numeral should be rejected
Limit Integrity is enforced by limiting the values of data items to specified ranges to prevent illegal
values
Example: Age of person should not be negative
Referential Integrity requires that an item referenced by the data for some other item must itself exist in
the database
Example: If an airline reservation is requested for a particular flight, then the corresponding flight
number must actually exist
Physical Integrity is ensured through hardware redundancy, backups, etc
Data Accessibility:
If the transaction and inventory data is placed in a disorganized fashion on a hard disk, it becomes verydifficult to later search for a stored data item
What is required is that:
Data be stored in an organized manner
Additional info about the data be storedso that the data access times are minimized
What if two customers check on the aavailability of a certain title simultaneously?
On seeing its availability, they both order the title – for which, unfortunately, only a single copy is
available
Same is the case when two airline customers try booking the only available seat
A solution to this concurrency control problem: Lock access to data while someone is using it
We can write our own SW that can take care of all the issues that we just discussed
OR
We can save ourselves lots of time, cost, and effort by buying ourselves a Database Management
System (DBMS) that takes care of most, if not all, of the issues
36.3 DBMS :
DBMSes are popularly, but incorrectly, also known as ‘Databases’A DBMS is the SW system that operates a database, and is not the database itself
Some people even consider the database to be a component of the DBMS, and not an entity outside the
DBMS
A DBMS takes care of the storage, retrieval, and management of large data sets on a database
It provides SW tools needed to organize & manipulate that data in a flexible manner
It includes facilities for:
DBMS Database
User/
Progra
m
Adding, deleting, and modifying data
Making queries about the stored data
Producing reports summarizing the required contents
Database:
A collection of data organized in such a fashion that the computer can quickly search for a desired dataitem
All data items in it are generally related to each other and share a single domain
They allow for easy manipulation of the data
They are designed for easy modification & reorganization of the information they contain
They generally consist of a collection of interrelated computer files
Example: Univerisity Student Database:
Student's nameStudent’s photograph
Father’s name
Phone number
Street address
eMail address
Courses being taken
Courses already taken & grades
Pre-VU educational record
Example: zainBooks’ Customer DB:
Name, address, phone & fax, eMailCredit card type, number, expiration date
Shipping preference
Books on order
All books that were ever shipped to the customer
Book preference
Example: zainBooks’ Inventory DB:
Book title, author, publisher, binding, date of publication, priceBook summary, table of contents
Customers’, editors’, newspaper reviews
Number in stock
Number on order
Special offer details
36.4 OS Independence:
DBMS stores data in a database, which is a collection of interrelated filesStorage of files on the computer is managed by the computer OS’s file system
Intimate knowledge of the OS & its file system is required to provide rapid access to the data
The DBMS takes care of those details
It hides the actual storage details of data files from the user
It provides an OS-independent view of the data to the user, making data manipulation and management
much more convenient
What can be stored in a database?
In the old days, databases were limited to numbers, Booleans, and text
These days, anything goes
As long as it is digital data, it can be stored:
Numbers, Booleans, text
Sounds
Images
Video
In the very, very old days …:
Even large amounts of data was stored in text files, known as flat-file databasesAll related info was stored in a single long, tab- or comma-delimited text file
Each group of info – called a record - in that file was separated by a special character; vertical bar ‘|’
was a popular option
Each record consisted of a group of fields, each field containing some distinct data item
Flat-File
Database
Record
Field
Record
Delimiter
36.5 The Trouble with Flat-File Databases:
The text file format makes it hard to search for specific information or to create reports that include onlycertain fields from each record
Reason: One has to search sequentially through the entire file to gather desired info, such as ‘all books
by a certain author’
However, for small sets of data – say, consisting of several tens of kB – they can provide reasonable
performance
Consider this tabular approach …
(same records, same fields, but in a different format)
Title Author Publisher Price InStockGood Bye Mr.
kim king khan zainBooks 1000 Y
The Terrible
Twins
kim
Champion zainBooks 199 Y
Calculus &
Analytical
Geometry
Smith Sahib Good
Publishers 325 N
Accounting
Secrets
Zamin
Geoffry
Sung-e-
Kilometer
Publishers
29 Y
Tabular Storage: Features & Possibilities:
Similar items of data form a columnFields placed in a particular row – same as a flat-file record – are strongly interrelated
One can sort the table w.r.t. any column
That makes searching – e.g., for all the books written by a certain author – straight forward
Title, Author, Publisher,
Price, InStock|Good Bye Mr.
kim, king khan,
zainBooks, 1000, Y|The
Terrible Twins, kim
Champion, zainBooks, 199,
Y|Calculus & Analytical
Geometry, Smith Sahib, Good
Publishers, 325, N|Accounting
Secrets, Zamin Geoffry,
Sangg-e-Kilometer Publishers,
29, Y|
Tabular Storage: Features & Possibilities:
Similarly, searching for the 10 cheapest/most expensive books can be easily accomplished through asort
Effort required for adding a new field to all the records of a flat-file is much greater than adding a new
column to the table
CONCLUSION: Tabular storage is better than flat-file storage
We will continue on this theme next time
Today’s Summary:(Data Management)
First of a two-Lesson sequenceToday we became familiar with the issues and problems related to data-intensive computing
We also found out about flat-file and tabular storage
Next Lecture:(Database SW)
Next time, in our 4th Lesson on productivity SW, we will continue our discussion on data managementWe will find out about relational databases
We will also implement a simple relational database