20100125

Data Sources

                When storing data, there are typically 3 major choices, Databases, XML and custom Binary Files.  I’ll be discussing several pros and cons to each method.  I will also be discussing a 4th option that has been recently emerging, involving living network data.


                Before emerging into any as the end all be all solution for your project, you should look into the values and strengths of each option.  Weigh them carefully, once you select a route, it is often very difficult to change it later.


                Databases are very good at storing lots of information, and being able to pull it out very quickly as well.  It is very common and simple for most database system to hold on to millions of records of data and search, sort and update very quickly.  Most databases follow a simple language called SQL for retrieving, inserting and editing data, which (when taught well) is not difficult to learn.  Databases can also be expanded to additional machines to help improve overall performance in larger systems.


                Databases tend to lock the format of your data down.  It becomes difficult and time consuming to continually change and support databases as they grow.  Typically this produces a decent amount of administrative work.  Data storage be wasteful. (if 5% of your users use 3k for a description field, but the other 95% only use 0.5k,  Every row still requires 3K, whether used or not)


                XML is very flexible to data formats.  If you are constantly updating and changing data, XML can provide a solid system to work with it.  If you format your changes well, saved files can be forward and backward compatible with different versions of your application, without needing to write additional code to support it.  XML is humanly readable, which means that a notepad can easily edit and read the data for you.  Not only is it human readable, but it is descriptive, so just by looking at the XML file, you can determine what and how it is used, making it easy to edit files from programs you didn’t write.  XML is also easily transferred over the Internet, saved locally on any machine and is supported quite well on Every reasonable OS and programming language.


                XML is not intended for large amounts of data.  When reading it, it all has to be parsed and stored in memory.  Because of this constraint, it is often not easy or fast to conduct searches of XML.  XML is stored as plain text, which also means it takes up more space.  Since Lengths can be anything, your program has to be written to expect any length.  Also, data types like numbers and Boolean values are stored as strings, so they have to be specifically converted into numbers.  XML is also a poor choice when storing and loading binary data, like images, sound files or movies.


                Custom Binary files have been used for a long time.  This means storing the data (typically) in the smallest amount of space it possibly needs.   It also means that it is harder to use outside of your own applications, which is often helpful for protecting local data.  You control entirely control the format, and can decide how best to shape it, making portions of it similarly searchable like a database, and other portions flexible like XML.  When storing large amounts of data that will only be used by your application, this is a common practice to use.


                Custom Binary files lock you into a format.  Any changes you make have tend to require a decent amount of code to transfer from one version to another.  They are difficult to read, and if bugs are caused by misplaced data, or some accident in writing the data, the entire file can be ruined.  It does not allow your data to be easily read by other systems, which could be negative depending on your user’s needs and interests.   They also require more design time because you are not only deciding the format of the data, but also writing all the tools and code for handling the data.


                A fourth option I referred to as “Living Networks” is demonstrated quite well by a system called “Terracotta”.  The reason I selected a particular brand name for this, is because it is new (and old, as similar concepts have been around for a while) system, that I have not seen a valid competitor for yet.  In this system, it manages memory over multiple networked servers.  It has proven faster than the best databases when the data is designed for it.  It is also flexible like XML, and also provides (by its nature) redundancy and load balanced data access.  It is a system that I look forward to in the near future becoming a new standard for data storage and retrieval.


                Terracotta only supports Java, leaving C++, C#, VB, PHP and other languages out.  Because of that lack of interoperability, when selecting this, you are forcing your data to be accessed through JAVA/J2EE alone, meaning your data has now limited your application choices, not very sustainable in mixed environments.    It is not intended for home computing and should not be considered a solution for local data. 


                Each of these 4 data solutions target different audiences.  Even within these systems, we are not very technically mature.  It should be noted that every one of these areas is coming out with new and better ways of storing information, (XML being an exception) and will continue to do so for a while.  While one solution might be the most optimal at the moment, improvements in other environments may make them your optimal choice a year later.

No comments: