Help understand and clarify AtomicDB concepts and usage

Question 1:

What is the clear meaning of ‘association’ in AtomicDB? Is it calculated from a model? Describe in detail the algorithm which sets it or deter-mines its existence across potentially disparate data sources. Is it a functional type relationship (subject -> verb -> target) or a numerical value (number of co-occurrences of data elements) or what is it? Tell us why we should trust this algorithm?


In the AtomicDB system all associations are bi-directional. Any item any-where in vector space (in this case accessed through an encapsulated quad-d 128 bit token) can reference directly any other item(s), and since the AtomicDB vector space is made up of 1018 virtual points, many discrete data items can be referenced.

An association is a reference from one data item to another, and a corresponding reference from the other data item to the one. There is no separate ‘connector’ or predicate item, nor is there a table with data item indexed co-occurrence counts and keys.

One may think of it as an ‘n’ dimensional network of continuously counted relationships, organized in vector space dimensions, where each point in the network has contained within it, direct ‘paths’ (actually vector space indexes) to each and every other related point. The ‘algorithm’ is entirely fact-based and absolutely deterministic, (non-statistical).

Question 2:

Explain how this technology is different from triplestore?


Triple stores are stored as subject – predicate – object records, typically in a two table (type) configuration consisting of an entity table, which captures the namespace of the data and whose data item id’s are used in one or many relations tables where the triple is represented using the id’s of the entity table. Namespace management is key to productive use of triple stores and same-named entities from different contexts must be pre or post processed to disambiguate them.In AtomicDB, the value of an item is just an attribute of the token representing the item. Because all data sets are auto-contextualized on ingestion, co-occurrence of terms is referenced from an abstraction that handles multiple instances mapped to different contexts using tokens in vector space. AtomicDB has no tables. And predicates are implemented as dimensions in vector space, not edge objects that are referenced from the triple records.

Question 3:

Talk about AtomicDB in terms of ACID vs BASE, CAP theorem, what are trade-offs in using AtomicDB?


AtomicDB has no tables at all, so it is very difficult to answer that without extensive explanation and qualification, but, in a probably unsatisfactory summary, because AtomicDB is a combination of network, vector space and atomic models, where item uniqueness is guaranteed, data values don’t participate in relationships except as attributes of items, and distribution and replication are on completely different processing vectors, the issues of those trade-offs are far less important.

Question 4:

Who are other players in this field and what sets you apart from them?


There are three basic models out there, File-Cluster or BTree-based XML/JSON doc stores, table-based triple stores and in-memory columnar-oriented table stores. AtomicDB is none of those, but has the architectural advantage of being able to provide equivalent performance to all of them.

AtomicDB is an always active network of interconnected in vector-space Informational elements. Each piece of data resides atomically in association with every other related piece of data at the center of its universe of relationships and thus each piece of data is an entry point into the net-work. At the low level it is a network. At the high level it is a graph.Neo4J – is about the most advanced graph db, but suffers (as they all do) with namespace and meta management limitations, as well as having any high level contextualization being hidden in the triple stores themselves, as none of that is native to the system itself. Indexers and meta attribution has to be bolted on and is not intrinsic to triples. MongoDB and Hadoop etc. are great file tree stores for huge, simplistic data sets, as node and disk spanning is built in, but if data and relationship complexity is an issue, all need extensive post processing (read: highly paid consultants and data scientists) to qualify what got put in there for each and every thing one might want to get out. Hana, Qlikview, and hundreds of other in memory systems are just snapshots of other data sets. AtomicDB is always read/write.

Question 5:

How does AtomicDB handle time series? How does it manage associations between data sources with entities that have attributes that change over time?


Entities and Attributes are Atomic Items and there is no internal distinction between them. Events are handled as transactions and are also Atomic Items, with relationships to the Entities and Attributes participating in the Event. Depending on the nature of the data sources and their intended use, one would typically utilize the cardinality of that relationship dimension to always show the latest Event reference, which would, thereby, always have the most up to date Attribute values associated.

Question 6:

Describe any provisions for multiple servers if data sets get too big for single disk?


Because of the Vector space mapping of the Token Keys that are used to represent the data elements, data sets can be mapped to any number of physical destinations that are preferably on one or several contingent high bandwidth networks. Each Token Key in both a unique identifier in 128 bit space as well as a logical mapping to a specific node/disk/block/sector/offset or equivalent location where the data element resides. Segmentation or sharding in the classic sense is handled quite differently since all AtomicDB systems can be configured to inter-relate with one another, since every instance is compatible with every other instance by design.

Question 7:

How do we make this work for large disparate data-sets that may not be cleanly linked? How does AtomicDB associate data that was originally collected/ingested without any requirement at that time that they be linked in any way but may represent the same or associated objects?


Any field from any ingested data set can be post merged with any field from any other data set, and auto-data-merge / de-duplication / unification / correlation will occur. This function is actually a primitive.

Question 8:

Do we lose any functionality at all in going from SQL to AtomicDB (grouping, aggregating, date typing etc.)?



Question 9:

We often need to classify, cluster objects in sense of machine learning (both supervised, unsupervised) as well as select, extract, reduce dimensions to only relevant ones as in PCA for example. Can you describe how AtomicDB makes this job easier?


Meta management is fully integrated into the AtomicDB system. Classification, categorization, grouping and clustering is as simple as adding associations to any set of items. Dimension reduction is totally unnecessary because all data elements are fully contextualized and can be referenced selectively without any need for extraction. Add all fields of interest to a Model, (ostensibly a view) Select target fields by clicking on them in a window, Select filter criteria by clicking on them in a window. Push Get button. Review results. No programmers, data specialists, data scientists or database specialists needed.

Question 10:

Do we need to understand more about the ETL tool itself? With only a “GET” function to retrieve data, it would appear that the ingest side is responsible for the adds/drops/updates?


The tool we have is an EL tool. Transformation is usually needed only when trying to map extracted data sets to a different (usually incompatible) structure (such a data warehouse or new database). Since AtomicDB was designed to simply accommodate any existing data structure, we don’t need to transform it for those reasons. We might want to transform a da-ta set because it was really badly designed or poorly implemented, such as having columns which should be items, but that would be done with a mapping in a pre-processor. The API also has IMPORT, ADD, MODIFY and ASSOCIATE functions.

Question 11:

Couldn’t quite see how you would do a range query using “GET” sub-directives within the API, or a sort, sub-directives within the API? 


Data appeared already normalized and fairly clean, which we know is sometimes an issues. Manually editing on the GUI probably wouldn’t catch all needs. Is there anything special that would help here

Almost always an issue. I have rarely seen ‘clean’ data, except from except from certain 3 letter agencies after redaction.

In terms of pre-built cleansers, we find it easier to write a quick parser that bins the data into Known Good, Questionable, and Somethings Wrong Here bins. Because data items are unified, de-duped and contextualized, writing custom cleaner algorithms for pre or post-processing are relatively trivial. If you don’t have in-house expertise, we can provide as needed.

Question 12:

Unstructured textual data? 


Yes it is. We have written app-level parsers that identify all potentially subject indicating terms and produce a semi-structured representation (in AtomicDB tokens) of the document with bidirectional associations done on a heading, sentence, paragraph and section (chapter) basis. From that, feature sets, subject derivation and auto-similarity mapping can be done. We can also integrate the Semantic Parser of your choice.

Question 13:

I imagine a scenario where 2 sets of data may not associate directly but indirectly through a ‘third party’ data set? Describe how AtomicDB might determine there is link between first 2 sets. 


If the third party data set is also ingested and there are corresponding data fields, by including the ‘third party’ data set into the Model where the two data sets reside, it will auto correlate, unify the appropriate fields and de-duplicate the data.

Question 14:

How does AtomicDB handle continuous numeric data? does each value get its own data node or is data binned? We have numerical data potentially spanning vast numerical scales. Describe the binning/discretization algorithm if there is any?


The best way to handle data streams will depend on the intended use. Most often what matters is thresholds and patterns that evolve in or are derived from the data and since it is usually based on some temporal aggregation it is important to be able to process on a defined temporal granularity that may vary from use case to use case. Feature sets are just patterns in relationships to AtomicDB and entities or events with similar features can be easily accessed using a reflexive association function com-posed of two GET’s.

Have additional questions?

Please email us at the below address and we would be happy to answer any questions you might have.


Thank you, we will get back to you within 24 hours.