You are assigned the task to produce the architecture of a new application that has the following requirements:
The new application will ingest, every morning at 8am, a CSV file from a specific location, validate the content and transform the content of the source file in an XML file before to push it to another location where it will be archived for future use by another application.
The generated XML file will contain descriptions (in both English and French) identified by codes that are provided in the source CSV file (the source file doesn’t contain descriptions but just codes that are used to identify the corresponding descriptions).
That is all what you have been provided as requirements and your manager comes to see you to ask what other information you need so you can quickly produce the architecture as this is a high priority project.
There are no right or wrong answers, but I invite you to think about some important questions that will help you build your architecture.
Indeed, in addition to understand the functional requirements, you need to focus on the architecture characteristics (non-functional requirements. See my blog) that will drive your architectural decisions.
Some of the characteristics are explicitly mentioned in the requirements and others are implicit or even inexistant. Our job as architect is to try to identify them.
The following are some examples of questions for which you need to have answers to provide your solution.
- Are there other options to pull the data from other than the CSV extract approach? Does this extract file already exist? Can we influence the team (or the company) to consider other options?
You need to dig deeper to find out if a requirement is a “real” requirement or rather a solution to an underlying problem. For example, in this case it is important to understand if the option to use a CSV extract file is a real constraint imposed by the context (E.g., produced by a legacy application that doesn’t provide other alternatives) or is a solution proposed by the business line as an easy work around to pull the data from the system source.
It is our role to challenge the requirements and propose other alternatives more adapted to pull the data such in this case as using Queues or APIs.
- Another question is related to the security: how sensitive is the data in the file? Where is this file stored? Who can access this file?
Let’s suppose that we are told that the data in the extract is sensitive and should be accessible by a very limited number of employees. This means that it is important to keep the CSV files in a secure location before to be consumed by the location application (E.g. located in a folder in a sftp account instead of a network drive accessible by all).
- The next question is related to the size of the ingested CSV file.
In addition to the average size, the maximum size and the expected increase in the future are crucial information to help you take better decisions such as if all the data can be kept in memory and processed or consider other options such as have parallel applications running in a parallel…etc.
- Is there a need to archive the data in the ingested CSV file?
The answer to this question may have direct implication on the decision to use a DB. Indeed, if the data is just transformed and no data is saved during the transformation and there is no requirement to archive it so why do we need to have a database in the first place? Yes, there may be a need to keep track of some information such as the processing time, name of the CSV files and so on. In this case, this information can be just logged in a simple text file.
- What about the localisation?
From the requirement above, it is clear that the localization is important as you need to add English and French descriptions in the generated files. Next question is where to store the multi-language descriptions? If no DB is required for the other requirements, keep these descriptions in resource bundle files make more sense.