Blog | Connoiter

King Tutte Data Mapper Pipeline

The King Tutte pipeline is named after Bill Tutte, who was a Canadian/British mathematician and WW2 code breaker, who hung out with Alan Turing in Bletchey Park cracking Nazi codes. The Tutte Institute for Mathematics and Computing (TIMC) is named after Bill Tutte. TIMC happens to be the organization where core data map technology is being developed (HDBSCAN, UMAP, EVoC, Toponymy, DataMapPlot, etc.). At its core, the King Tutte data map pipeline built by Connoiter simply strings a fews of those technologies together to produce data maps. ...

DataMapPlot: Iceberg experiments

DataMapPlot generates interactive data map viewer web apps; think Google Maps but for some imaginary latent-space landscape. The data used to render the maps is currently shipped over HTTP as zipped Arrow files (in red in the image below). The proposal herein is to try another Apache file technology, Iceberg (in green below), which might immediately perform better than Arrow over the web. If that turns out to true then as a corollary this upgrade would set up DataMapPlot for easy integration within the Iceberg ecosystem. ...

Data map linter

The data map linter (highlighted in green) in the Vectron web-app stack. A first deliverable of the project Useful for testing the spec, maybe looking at legacy data map formats Minus UI is what might get folded into DataMapPlot Vectron stack: full top to bottom slice Network (maybe not cache? i.e. no OPFS) Data model in JS/TS UI: just table view, stretch [[https://www.react-graph-gallery.com/dendrogram][some dendogram]] or treeview to show cluster tree

Vectron dialogs and pipeline

Vectron is architected such that the UI and the data map engine are completely separate. Apps such as Latent Interface have a similar separate, because the data map engine is running as Python on the web server behind an API. Vectron can do the same, using Latent Interface’s API, but Vectron can also have the engine running in Workers locally within the browser (Python engine vs JavaScript engine). The following diagram adds UI dialogs which correspond to the pipeline stages introduced in the previoud post, Data Map Pipeline: ...

Vectron stack

Vectron is Connoiter’s data map web app. Vectron is a reference implementation of a data map reader, developed while working out a data map schema standardization effort. Vectron also implements a data map pipeline based on WebGPU.

Tutte Institute data map machinery

The folks at the Tutte Institute have been building this machinery for over a decade. Other relevant repos: Apricot: labeling, summarizing, data annotation/curation Glasbey: map coloring (think Five Color Theorem) PyNNDescent Variable result set size for KNN on Sample Space Podcast 2024-05

Data map pipeline

(Unlike the previous post which focused on the data of a data map, this post focuses on the stages of the pipeline.)

Data map production

(The main subject here is the evolution of the data map through the pipeline. As such, the pipeline model is maximally simplified here, down to two steps.) Data map production can be boiled down to two main steps: Surveyor Cartographer

Why Apache Iceberg

Why Apache Iceberg: Existing data map tools roughly aligned (already using Parquet) Allows for schema elovution which can be used to have an intentially limited v1 and then easily evolve to v2, v3, …xs Geo data types are new in Iceberg 3 (which is finalizing in summer of 2025) Based on GeoArrow GeoArrow parser for DeckGL: geoarrow/deck.gl-layers Useful data structures for representing objects on 2D maps Web-app context: Very light weight data web-app “stack” icebird and friends icebird + hyparquet = 85kb + 10kb Way lighter than duckDB or lanceDB and their ilk Lakehouse context: Iceberg is a widely adopted format for data lakehouse Static files (over HTTP or local file system) DataMapPlot generates static file data map web apps “Lakehouse” is simply a fancy term form static storage (“object store”), plus a little DB machinery As of early 2025, Iceberg Catalog is now available as a mainstream managed service “The R2 Data Catalog in open beta, a managed Apache Iceberg catalog built directly into your Cloudflare R2 bucket.” We can expect the same from all the other cloud providers RSN Network transport weight of various data stacks: ...

Version one goals

The goals for version one of the Data Map Schema project are intentionally simple: Harmonize DataMapPlot and latent-scope to use the same file format over the web. Connoiter implements is writing reference implmentations for read and writer Client: Vectron, a web-app that is a data map reader and visualizer Server: Dockerized RAG repo data mapper, a data map writer The client and server reference implementations will be demo’d together in a RAG chat context. The chat UI will have a Data Map viz showing which parts of the RAG repository’s data is being included in the LLMs context window based on user’s query. This will work with some popular FOSS RAG codebase. ...