The lesson DNA sequencers teach to strings types
Strings have existed in computer science since its beginning. Defined as a sequence of symbols in Wikipedia, they govern any data wired down.
Did you know, that due to a famous software handling almost everything as a string, scientists had to rename their discoveries?
Anything is data. Or seems to be. Jump to heading
I started being a tech lead with this strict guideline:
Dont suffix a name with
-data
. Anything is data. Drop this suffix.
As usual, when there is an issue with naming something, it often means something is wrong. The DIKW pyramid classifies data as facts, measurements, or representations of observations.
So, I adjusted my guideline:
If you suffix something with
-data
, you're expressing your intent that you do not care about interpreting said thing. It's opaque!
Strings can be either a datum, a piece of information, or even a knowledge
Level | Def | Example |
---|---|---|
Wisdom | Brings value by answering a why/ what do question | Instructions given to the users that grants them ability to open the file |
Knowledge | Something aquired by cross-referencing informations | The list of users names who can access the file |
Information | A description/interpretation of facts | The size of the file described to a human |
Data | A fact | File contents that needs to be written to disk |
With this classification, you will see that you're forced to think about the value of your product. Your company responds to societal needs. These needs create problems in your users' lives. You will design applications to help these users inside use-cases.
In DDD terms, try to have the wisdom in the center of your domain, and try to have the lesser level (data, information) almost out of it.
Level | Space | Example | Recommended level for handled things |
---|---|---|---|
Societal need | Problem space | Humans need to take care of their health | Wisdom |
Problem | Problem space | There is not enough GP doctors | Knowledge |
Use-case | Solution space | As a patient, I want to find a GP near me | Information |
Service | Solution space | locate-medical-centers(position,range) | Data |
Avoid having to handle Information and Data within the Problem space
: if you want to improve your problem space, ask your business intelligence analyst to give you wisdom/knowledge from these.
As usual, you should make sure that your specific application layers do not depend on generic concepts. You can, however, make a service depend on a specific knowledge or wisdom!
From data to strings Jump to heading
Okay, enough about data already! Pull me some strings!
Strings can be either level on the DIKW framework. As a tech lead of your team, your role is to make sure that the product value is structuring your software.
Since the definition of string is a sequence of symbols
, you will see that it overlaps with some data structures such as ArrayBuffer
, Uint8Arrays
, std::string
not forgetting const char*
nor F#/C# Span<>
. Devs have the embarrassment of suffering the necessity of choosing amongst this.
What is the intent? Do you want your thing to be a fact, or something which interpretation brings value?
Let's start with a story from the world of genetics
Scientists had to rename some genes...because of Excel Jump to heading
Giving names to things is a difficult process. In the case of genetics, well, if you have a gene called Membrane Associated Ring-CH-Type 1, what short name would you give to it? Imagine typing it into Excel. You probably thought the same name that the HGCN did.
There are plenty of incorrectly assuming something is a date
memes around, I will let you google them, But this has impacted DNA studies and people have made workarounds.
So, what is the issue? What is the link between Excel cells, strings and data? A question of level and intent.
On which DIKW level do you think a cell is? Well... All of them!
Make sure the architecture enforces product intent Jump to heading
As a tech lead, your role is to ensure the safety of your architecture. To do that, you will probably use tools such as compilers/transpilers, and type-checking mechanisms. Leverage them. Acknowledge the fact that, like scientists, there will be mistakes made. Someone will forget to add quotes around "March1"
and the system will silently explode.
Usual data-types to represent sequential data Jump to heading
- He said data types!
- Yes, I did! Using data in
data types
is fine because, in the concept of this description, their content itself does not matter. If I start thinking about the content, it becomes a value object or an entity.
Use-case | Type of sequence | Ex of type | Notes |
---|---|---|---|
Handling (getting,displaying) user text input | characters | string , std::string | Consider the user ability to write anything from their keyboard. Including emojis or kanjis. |
Handling (receiving, sending) binary data | bytes/integers | ArrayBuffer , rust Box<[u8]> , c# char[] , c char* | Consider here that the data may not be printable. It can be full of zeros or things without unicode representation |
Handling (receiving, sending) binary data over text | characters | Binary encoding such as base N inside a string | The most famous base transformation is base64 . It allows a textual representation of a buffer. Another famous representation is the base16 aka hexadecimal. |
Binary sequences Jump to heading
A long time ago at my university, I used C char *
to handle strings. This worked well because, at the time, we mostly used ASCII encoding, with the ISO-8859-1 one, famous in France. I coded algorithms such as
const char *findWordAfterComma(const char *input)
{
for (const char *p = input; *p; p++)
{
if (*p == ',')
{
return p + 1;
}
}
return NULL;
}
// print to screen
write(0, findWordAfterComma("hello,world"));
If you do not see the issue here, well... it's assuming that every single character takes one byte. Do not do this. Modern strings are encoded in memory, it can be using UTF-8, UTF-16, or other formats. Experiment this:
console.log('🤓'.length);
console.log('你好'.length);
console.log(new TextEncoder().encode('你好').length);
You will notice that some character sequences have different length if you represent them in strings, or in binary encoded forms.
The lesson of the story Jump to heading
Strings are for things you want to display. not for binary data.
If you call something data, do not handle its content!