The modern business world is, more than anything else, data-driven. We simply can’t overestimate how significant the proper use of data is in making any type of organisation successful. However, data scientists, as well as anyone else using data, often have to spend an enormous amount of time finding, organising, or cleaning data. Furthermore, trying to figure out whether the given set of data is credible and can be used for business purposes is also usually very tedious and time-consuming.
This leaves very little time for actual data analysis which is what brings value to the organisation. The most useful tool in solving this problem is a data dictionary, which helps people understand data better and have information of its quality, often at just a glance.
Table of Contents
What is a Data Dictionary?
A data dictionary is a file or a set of files that contain information about objects in the database. It’s basically a repository of data names, definitions, and attributes used to describe the data. We can say its data about data. This is done by describing data columns based on common traits within another table.
The data dictionary may include information on data types, the meaning of data elements, allowable values, data ownership, security restriction, and details of the structure. These repositories containing metadata are crucial for maintaining the structure of an underlying database and communicating the information it contains. In addition, a data dictionary provides more useful information on relationships between different database tables, prevents data redundancy, and makes data well organised and easily searchable.
It allows users to understand even the most complex databases without the need to investigate every column. With a data dictionary, the business stakeholders’ requirements can be more easily communicated, so that technical personnel can design a database or data structure that answers those demands. Data dictionaries ensure that everyone in the organisation is on the same page when it comes to metrics and key definitions used in the company.
Types of Data Dictionaries
Data dictionaries are not all the same as they may serve different purposes. Also, elements within a data dictionary can interact with users in different ways. The basic classification divides data dictionaries into two major categories – active and passive data dictionaries.
Active Data Dictionary
Active data dictionaries are built within database management systems (DBMS). They automatically reflect any change or modification within the host database. This means that if a user makes any change in the database, the change will automatically take place in the data dictionary, too. Most professional organisations have some sort of database management system, as it helps them avoid any potential discrepancy between the database structure and data dictionary.
Passive Data Dictionary
This type of data dictionary is a bit more complicated than the active ones and can take many forms. Simply said, this type of data dictionary doesn’t automatically update with every change in the host database. They’re built as new databases to store data dictionary information but are separate from the databases they are used to describe. They require more work to stay in sync with the host database and should be handled very carefully to avoid any potential discrepancies.
This type of data dictionary is most commonly created as a spreadsheet in programs such as excel. While these passive data dictionaries are not built to automate database-to-data dictionary encoding, the well-built and structured Excel spreadsheet can do this automatically. Other ways to create passive data dictionaries are by using data catalogs, data integration/extract transform load (ETL) metadata repositories, or data modelling tools.
Why Use a Data Dictionary?
Companies all around the world are making huge investments in data initiatives. The data-driven business model is more and more taking the centre stage when it comes to core operational principles. Still, this doesn’t mean that the organisations and individuals within them can always access or share the data easily. The collected amount of data, no matter how huge it can be, is rather useless if the organisation can’t efficiently interpret and analyse it.
The importance of data dictionaries and the reason for their use lies in the convenience and ease they provide when it comes too managing and analysing the data. They are crucial for teams that use and share a large amount of data on the daily basis. This is particularly true for big and complex organisations that often have multiple teams working on the same project.
With the help of a data dictionary, these teams don’t have to depend on data scientists and database managers to learn about data and extract the needed information from the database. Each team and its individual members can use the data elements of the same quality, meaning, and relevance. Using a data dictionary, it’s easy to establish conventions and maintain consistency within the whole dataset used on the project.
Without it, the risk of losing or misunderstanding a certain piece of information significantly increases. The ability of any team to quickly and easily access the database and interpret it speeds up the decision-making process throughout the organisation.
The Key Elements of a Data Dictionary
The data dictionary provides users with information on each attribute for the data element. The attributes, or fields, are places within a database containing the information. Commonly, data dictionaries are organised in spreadsheet format and each attribute is listed as a spreadsheet row, while columns contain elements of information on that attribute. The elements may vary from one data dictionary to another, but some key ones appear in almost any data dictionary. Below are the most common elements that are included in the data dictionary.
- Attribute name – or attribute ID is a unique identifier, labelling each attribute. It should be descriptive and based on what the attribute is commonly called by most users.
- Alias – these are synonyms or other names of the element. Basically, aliases are different names used by different users for the same element.
- Optional/Required – this element indicates if the specific information is required for an attribute before saving the record.
- Attribute type – determines which type of data is allowed in the related field. These types of data may include numeric, date, alphabet, alphanumeric, enumerated list, varchar, booleans, or others. The attribute types mainly depend on how the data will be used.
- Length – this element defines the length of the attribute. Some lengths are standard, such as US state abbreviations or zip codes. Other may vary, and, in those cases, length may vary and then must be jointly determined by the data analysts and users.
- Input/Output format – these elements are determined using the special coding characters, indicating how the data should be presented.
Besides these core elements, the data dictionary may include various others, such as default value, whether the element is based or derived, validation criteria, physical information on where the data is stored, or source of information.
Example of a Data Dictionary
The best way to understand how a data dictionary works is through a practical example. We’ll take an example of a database a commercial bank may use to store the data about its clients. This database may contain various attributes, such as the name of the client, address, phone number, age, gender, education, financial savings, loans, and many others. They’re each stored in a separate column.
In the data dictionary, the columns will be turned “sideways”, making them into rows in the data dictionary spreadsheet. The precise tables that the data dictionary is made of may depend on the DBMS (Database Management System). The data dictionary linchpin is the document that contains all the data dictionary tables (commonly named sys catalog). The names of the tables should be descriptive and precise enough to enable an easy guess of what each table may contain.
In our example, each attribute now occupies a row in a data dictionary spreadsheet. So, we have a row for each of the above-mentioned attributes. We’ll concentrate on one of them, for example, the hypothetical “client’s age”. It will have its own row with various columns providing elements that further describe this attribute. The first element will likely be the attribute name which is, logically, “client’s age”. The next element may provide the definition of the attribute – “age of clients” for example. The element describing the attribute format will simply be “number”.
As these entries will have the value of or higher, the data type element will be “integer’. Following this pattern, the data dictionary may include various other elements for “client’s age” and other attributes. It may include the element defining whether the attribute is required or not, possible values, maximum length or field size, the location where that specific data is stored, and numerous others.
Advantages and Disadvantages of Data Dictionaries
Using a data dictionary provides various and numerous benefits to the organisation. It delivers data integrity by promoting adoption and use of consistent data elements which make databases easier to use and more understandable to everyone working on a specific project. Organisations that use data dictionaries can use the data in a more reliable, dependable, and trustworthy way. However, the adoption of data dictionaries also has a couple of drawbacks that organisations must be aware of.
Below are the major pros and cons of the use of data dictionaries.
Advantages:
- Improves data quality.
- Provides an organised, comprehensive, and easily searchable list of data.
- Improves control and documentation.
- Allows for documentation and reporting and maintains data integrity across multiple programs.
- Helps communicate the meaning of different procedures, elements, and terms.
- Provides consistency in data use.
- Simplifies the system data requirements structure.
- Reduces data redundancy.
- Provides easy reuse of data.
- Makes data analysis easier and simpler.
- Allows easy and consistent enforcement of standards.
- Provides information relationship between different database tables.
- Improves decision-making thanks to better data.
- Provides easier programming and simplifies the software design process.
- Eases the location of errors in the system descriptions.
- Detects anomalies more quickly.
- Reduce the potential risk of downstream errors.
- Assures smoother database upgrades.
- Helps build transparency within data teams.
Disadvantages:
- Creating and organising data dictionaries can be very time consuming and tedious.
- The functional details are not provided.
- Can be difficult to use and hard to understand for less technical-savvy users.
How to Create a Data Dictionary
Before creating data dictionaries, there are a couple of questions that need to be answered. Whoever is creating a data dictionary needs to define what each of the variables (attributes, elements, fields) stands for and how it’s being collected and measured or calculated.
Furthermore, there has to be an agreement on what tests need to be performed to ensure that the data is reliable and trustworthy. In addition, the organisation has to clearly designate and determine who can collect, interact with, manage, and change data and who will monitor those processes. It’s best to build a data dictionary simultaneously with modelling the data as it makes all of the previously mentioned tasks much easier.
The data dictionary can be created manually or automatically. Most of the database management systems and information systems created by computer-aided software engineering have built-in active data dictionaries. The Analyser tool for Microsoft Access, used for database analysis and documentation, can be used as a tool for creating a data dictionary from Access-based or Access-connected data. Another way to generate documentation is with SQL, Oracle, Server, or MySQL. When building passive data dictionaries, they need to be created outside the database management system as they’re not managed by DMBS.
In case you can’t automatically generate machine-readable DD, you’ll have to build it manually, submitting it from a single source as a spreadsheet. This is best done in Excel and there are plenty of online templates available for this type of data dictionary creation.
Either way, creating a data dictionary involves a couple of necessary steps. You’ll need to make each field or data column in the database table and list it as a data dictionary row. In addition, you’ll have to define each field in the data dictionary and determine what it is. Finally, you’ll need to either compile the source data into the data dictionary using the DBMS software or use Excel or other software to build out the logic in the spreadsheet.
Data Dictionary Template
Business Analysis Excellence has a business analysis and agile template package that includes a data dictionary template to capture the data elements that is required for a new solution or it can be used to record the meaning of data elements in an existing system.
Free Data Modelling Training
You can get free data modelling training at Bridging the Gap.
Some readers may also be interested in a data modelling for business analyst training.