Dark mode switch icon Light mode switch icon

The lesson DNA sequencers teach to strings types

6 min read

Strings have existed in computer science since its beginning. Defined as a sequence of symbols in Wikipedia, they govern any data wired down.

Did you know, that due to a famous software handling almost everything as a string, scientists had to rename their discoveries?

Anything is data. Or seems to be. Jump to heading

I started being a tech lead with this strict guideline:

Dont suffix a name with -data. Anything is data. Drop this suffix.

As usual, when there is an issue with naming something, it often means something is wrong. The DIKW pyramid classifies data as facts, measurements, or representations of observations.

So, I adjusted my guideline:

If you suffix something with -data, you're expressing your intent that you do not care about interpreting said thing. It's opaque!

Strings can be either a datum, a piece of information, or even a knowledge

LevelDefExample
WisdomBrings value by answering a why/ what do questionInstructions given to the users that grants them ability to open the file
KnowledgeSomething aquired by cross-referencing informationsThe list of users names who can access the file
InformationA description/interpretation of factsThe size of the file described to a human
DataA factFile contents that needs to be written to disk

With this classification, you will see that you're forced to think about the value of your product. Your company responds to societal needs. These needs create problems in your users' lives. You will design applications to help these users inside use-cases.

In DDD terms, try to have the wisdom in the center of your domain, and try to have the lesser level (data, information) almost out of it.

LevelSpaceExampleRecommended level for handled things
Societal needProblem spaceHumans need to take care of their healthWisdom
ProblemProblem spaceThere is not enough GP doctorsKnowledge
Use-caseSolution spaceAs a patient, I want to find a GP near meInformation
ServiceSolution spacelocate-medical-centers(position,range)Data

Avoid having to handle Information and Data within the Problem space: if you want to improve your problem space, ask your business intelligence analyst to give you wisdom/knowledge from these.

As usual, you should make sure that your specific application layers do not depend on generic concepts. You can, however, make a service depend on a specific knowledge or wisdom!

From data to strings Jump to heading

Okay, enough about data already! Pull me some strings!

Strings can be either level on the DIKW framework. As a tech lead of your team, your role is to make sure that the product value is structuring your software.

Since the definition of string is a sequence of symbols, you will see that it overlaps with some data structures such as ArrayBuffer, Uint8Arrays, std::string not forgetting const char* nor F#/C# Span<>. Devs have the embarrassment of suffering the necessity of choosing amongst this.

What is the intent? Do you want your thing to be a fact, or something which interpretation brings value?

Let's start with a story from the world of genetics

Scientists had to rename some genes...because of Excel Jump to heading

Giving names to things is a difficult process. In the case of genetics, well, if you have a gene called Membrane Associated Ring-CH-Type 1, what short name would you give to it? Imagine typing it into Excel. You probably thought the same name that the HGCN did.

There are plenty of incorrectly assuming something is a date memes around, I will let you google them, But this has impacted DNA studies and people have made workarounds.

So, what is the issue? What is the link between Excel cells, strings and data? A question of level and intent.

On which DIKW level do you think a cell is? Well... All of them!

Make sure the architecture enforces product intent Jump to heading

As a tech lead, your role is to ensure the safety of your architecture. To do that, you will probably use tools such as compilers/transpilers, and type-checking mechanisms. Leverage them. Acknowledge the fact that, like scientists, there will be mistakes made. Someone will forget to add quotes around "March1" and the system will silently explode.

Usual data-types to represent sequential data Jump to heading

Use-caseType of sequenceEx of typeNotes
Handling (getting,displaying) user text inputcharactersstring, std::stringConsider the user ability to write anything from their keyboard. Including emojis or kanjis.
Handling (receiving, sending) binary databytes/integersArrayBuffer, rust Box<[u8]>, c# char[], c char*Consider here that the data may not be printable. It can be full of zeros or things without unicode representation
Handling (receiving, sending) binary data over textcharactersBinary encoding such as base N inside a stringThe most famous base transformation is base64. It allows a textual representation of a buffer. Another famous representation is the base16 aka hexadecimal.

Binary sequences Jump to heading

A long time ago at my university, I used C char * to handle strings. This worked well because, at the time, we mostly used ASCII encoding, with the ISO-8859-1 one, famous in France. I coded algorithms such as

const char *findWordAfterComma(const char *input)
{
    for (const char *p = input; *p; p++)
    {
        if (*p == ',')
        {
            return p + 1;
        }
    }
    return NULL;
}

// print to screen
write(0, findWordAfterComma("hello,world"));

If you do not see the issue here, well... it's assuming that every single character takes one byte. Do not do this. Modern strings are encoded in memory, it can be using UTF-8, UTF-16, or other formats. Experiment this:

console.log('🤓'.length);
console.log('你好'.length);
console.log(new TextEncoder().encode('你好').length);

You will notice that some character sequences have different length if you represent them in strings, or in binary encoded forms.

The lesson of the story Jump to heading

Strings are for things you want to display. not for binary data.

If you call something data, do not handle its content!

Originally published on by Tristan Parisot

Edit this page on GitHub