Printed Machine Learning Classifiers

Muhammad Husnain Mubarik*, Dennis D. Weller†, Nathaniel Bleier§, Matthew Tomei§

Jasmin Aghassi-Hagmann§, Mehdi B. Tahoori§ and Rakesh Kumar§

§University of Illinois Urbana-Champaign, †Karlsruhe Institute of Technology, §University of Applied Sciences Offenburg

Abstract—A large number of application domains have requirements on cost, conformity, and non-toxicity that silicon-based computing systems cannot meet, but that may be met by printed electronics. For several of these domains, a typical computational task to be performed is classification. In this work, we explore the hardware cost of inference engines for popular classification algorithms (Multi-Layer Perceptrons, Support Vector Machines (SVMs), Logistic Regression, Random Forests and Binary Decision Trees) in EGT and CNT-TFT printed technologies and determine that Decision Trees and SVMs provide a good balance between accuracy and cost. We evaluate conventional Decision Tree and SVM architectures in these technologies and conclude that their area and power overhead must be reduced. We explore, through SPICE and gate-level hardware simulations and multiple working prototypes, several classifier architectures that exploit the unique cost and implementation tradeoffs in printed technologies - a) Bespoke printed classifiers that are customized to a model generated for a given application using specific training datasets, b) Lookup-based printed classifiers where key hardware computations are replaced by lookup tables, and c) Analog printed classifiers where some classifier components are replaced by their analog equivalents. Our evaluations show that bespoke implementation of EGT printed Decision Trees has 48.9× lower area (average) and 75.6× lower power (average) than their conventional equivalents; corresponding benefits for bespoke SVMs are 12.8× and 12.7× respectively. Lookup-based Decision Trees outperform their non-lookup bespoke equivalents by 38% and 70%; lookup-based SVMs are better by 8% and 0.6%. Analog printed Decision Trees provide 437× area and 27× power benefits over digital bespoke counterparts; analog SVMs yield 490× area and 12× power improvements. Our results and prototypes demonstrate feasibility of fabricating and deploying battery and self-powered printed classifiers in the application domains of interest.

Index Terms—printed electronics, machine learning

I. INTRODUCTION

While the impact of computing appears to be ubiquitous in today’s society and economy, a large number of important domains are still minimally touched. Consider the over 10-trillion dollar fast-moving consumer goods (FMCG) market [47], for example. Disposables such as packaged foods, beverages, toiletries, over-the-counter drugs, and other consumerables are sold largely without any embedded computing devices that could help with identification and tracking [69] (is this today’s pill?), quality monitoring [14] (is this milk bad?), brand authentication [56] (is this apple Golden Delicious?), or interactivity [72] (is my beer at the temperature I like?).

The primary reason why FMCG domains (as well as domains such as low-end healthcare (e.g., bandages and wound dressings), agriculture [13], and environment [3]) have not seen much penetration of computing is the cost limitation of today’s silicon-based computing systems. Silicon-based systems continue to cost much more than the cost requirements of these domains. For example, item-level tagging of several FMCG products - consider apples, milk, soda bottles, and bandages - must have sub-cent costs [64] (i.e., equivalent to the cost of a barcode [64]) that silicon-based systems cannot meet due to the high manufacturing, testing, and assembly costs of such systems [46]; even the cheapest microcontrollers and RFIDs cost several cents [64]. Many of the above domains also have stretchability, porosity, non-toxicity, and flexibility requirements that silicon-based systems cannot meet [41].

Low voltage printed electronics [27] has emerged as promising technology to target such application domains. Printing technologies often rely on maskless [76], portable [44], and additive [27] manufacturing methods which can greatly reduce costs and production timelines [28]. Such technologies also lead to devices that are conformable [36] and non-toxic [49]. Furthermore, recently developed printed technologies (e.g., EGT [26]) are low-voltage, allowing them to be battery-powered or potentially self-powered [5] when used in context of the above applications.

In this paper, we focus on printed machine learning (ML) classifiers. A large number of printed applications may need to make classification decisions in the field. For example, a printed smart wound dressing [48] may be used to determine if a wound has healed. A printed in-situ sensor [72] may determine if a packaged food item has expired. A printed pulse oximeter [45] may determine when the oxygenation or pulse rate levels are abnormal. Prior work has not explored the design space of printed machine learning classifier architectures for any learning algorithm. This is not surprising since such exploration relies on design tools which require process design kits (PDKs) for printed technologies. Such PDKs have just started becoming available [66], [80] as the technologies have begun to mature, making developing such PDKs worthwhile.

In this research, we perform an exploration of low-cost classifier architectures for printed technologies (Fig. 1) using recently developed EGT and CNT-TFT PDKs [10]. Our exploration yields several interesting observations. First, since printed technologies have orders of magnitude larger feature sizes than state-of-the-art CMOS, the circuits designed and fabricated in printed technologies have significantly worse area and power characteristics than silicon counterparts (Table I). Therefore, simple classification algorithms and models that can be implemented at low gate count are strongly favored (e.g., Decision Trees and Support Vectors Machines (SVMs)). Second, since both non-recurring engineering (NRE) costs and per unit-area fabrication costs in printed technology are low.

* Equal contribution
This paper makes the following contributions:

- We perform the first exploration of different classification algorithms in terms of accuracy and potential cost for two printed technologies (EGT [26] and CNT-TFT [65]). Our results show that simple classification algorithms such as Decision Trees and SVMs provide a good balance in terms of accuracy and potential overheads, with cost compared to digital hardware. Comparing to digital counterparts, analog SVMs can have 1.4× lower area, 1212× slower delay, and 490× lower power benefits for analog SVM architectures.

- We develop and evaluate bespoke printed classifier architectures. We show that EGT-bespoke Decision Tree implementations have 4× lower delay, 75× lower power, and 48× lower area (on average) than their conventional counterparts. Corresponding benefits for bespoke SVM implementations are 1.4×, 12.7×, and 12.8× respectively. To the best of our knowledge, this is the first quantification of the benefits of bespoke classifiers in printed technology. We also fabricated a working EGT prototype of a bespoke Decision Tree. This is the first prototype of a Decision Tree in a printed technology.

- We develop and evaluate lookup-based printed classifier architectures, where certain logic functions (e.g., comparators in Decision Trees and MAC units in SVMs) are replaced by lookup tables. Our results show that lookup-based EGT Decision Trees improve the area of bespoke counterparts by 1.93× and power by 1.65×, with 50% delay overhead (on average). Lookup-based SVMs, see in the best case, a 40% reduction in delay with 8% and 1% improvement in area and power respectively. To the best of our knowledge, these are the first lookup table-based implementations of machine learning classifiers for printed technology. We also fabricated a working prototype of an EGT ROM that can be used to implement lookup-based printed classifiers.

- We develop and evaluate analog printed classifiers where data representation and computation (comparisons for Decision Trees and MACs for SVMs) are implemented using analog logic. Our results show that EGT analog Decision Tree classifiers outperform their digital bespoke counterparts by 437× and 27× for area and power respectively and are 1.63× slower. Corresponding area and power benefits for analog SVM architectures are 490× and 1212× respectively; analog SVMs are 1.36× slower than bespoke counterparts. These are the first printed implementations of analog Decision Trees and SVMs and quantification of their benefits. We also fabricate and evaluate a prototype analog signal Decision Tree - this is the first prototype of an analog Decision Tree in a printed technology.

II. BACKGROUND AND RELATED WORK

Printed electronics is an emerging technology which holds promise to enable flexible [29], large-area [30], and ultra-low-cost computing systems [31] through the use of printing-based fabrication techniques such as screen [5], roll-to-roll [21], [39], and inkjet [21], [25], [27], [73] printing. Inkjet printing, for example, has attracted a lot of attention, as it allows for contact-less printing on a wide range of carrier materials such as flexible substrates due to its mask-less fabrication process, where jetting of droplets is controlled by a CAD software, thus enabling digital printing [20].

Some printing methods rely purely on additive manufacturing steps, while others are based on both additive and subtractive processes. Comparable to subtractive silicon-based processes, subtractive printing processes include fabrication steps which involve the development of photoresists and subsequent etching. Due to this, the subtractive processes are relatively expensive compared to additive processes, as they demand expensive equipment and infrastructure [16]. In contrast, a fully additive approach to additive manufacturing is 2.54× and 0.03× lower delay, 67× lower power, and 0.46× lower area respectively. To the best of our knowledge, these are the first lookup table-based implementations of machine learning classifiers for printed technology. We also fabricated a working prototype of an EGT ROM that can be used to implement lookup-based printed classifiers.

TABLE I: PPA analysis of common ML operations in PPDK, CNT-TFT and TSMC40nm. D, A, P: Delay, Area, Power

<table>
<thead>
<tr>
<th>Components</th>
<th>EGT</th>
<th>CNT-TFT</th>
<th>TSMC-40nm</th>
</tr>
</thead>
<tbody>
<tr>
<td>D</td>
<td>D</td>
<td>D</td>
<td>D</td>
</tr>
<tr>
<td>A</td>
<td>A</td>
<td>A</td>
<td>A</td>
</tr>
<tr>
<td>P</td>
<td>P</td>
<td>P</td>
<td>P</td>
</tr>
<tr>
<td>msec cm² mW</td>
<td>usec mm² mW</td>
<td>usec mm² mW</td>
<td></td>
</tr>
<tr>
<td>Comparator</td>
<td>11.2 0.15 0.61</td>
<td>19.5 0.21 8.32</td>
<td>0.23 94 0.14</td>
</tr>
<tr>
<td>MAC</td>
<td>27 1.1 1.2 4.12</td>
<td>16.1 1.4 0.57</td>
<td>0.57 255 0.51</td>
</tr>
<tr>
<td>Relu</td>
<td>2.54 0.03 0.14</td>
<td>1.44 0.35 0.10</td>
<td>0.1 67 0.46</td>
</tr>
</tbody>
</table>

1 Some may prefer to call these classifiers mixed-signal or semi-analog since digital elements are still present.
TABLE II: Accuracy and computation requirements of different classification algorithms; models generated by scikit-learn. DT-1/2/4/8: Decision Tree Classifiers with depth 1/2/4/8, RF-2/4/8: Random Forest classifiers with 4/8/16 estimators with max depth of each tree is up to 8, MLP-1: Multi-Layer Perceptron with 1 hidden layer and up to 5 hidden nodes, MLP-3: Multi-Layer Perceptron with 3 hidden layers up to 5 nodes per hidden layer, A: Accuracy on test data, #C: Number of comparisons, #M: Number of MAC operations.

<table>
<thead>
<tr>
<th></th>
<th>DT-1</th>
<th>DT-2</th>
<th>DT-4</th>
<th>DT-8</th>
<th>RF-2</th>
<th>RF-4</th>
<th>RF-8</th>
<th>MLP-1</th>
<th>MLP-3</th>
<th>LR</th>
<th>SVM-C</th>
<th>SVM-R</th>
</tr>
</thead>
<tbody>
<tr>
<td>Accuracy</td>
<td>0.56</td>
<td>0.79</td>
<td>0.38</td>
<td>0.71</td>
<td>0.91</td>
<td>0.85</td>
<td>0.93</td>
<td>0.55</td>
<td>0.84</td>
<td>0.79</td>
<td>0.79</td>
<td>0.88</td>
</tr>
</tbody>
</table>

ALGORITHM FOR PRINTED APPLICATIONS

The first question we attempt to answer is - what classification algorithms can be feasibly supported in printed technologies? The choice of the classification algorithm to support in printed hardware for a given application depends both on the characteristics of the application (which determine the accuracy for a given algorithm) and the costs of the implementation of the algorithm in hardware (e.g., power, area, and latency). We studied five different classification algorithms - Decision Trees [59] (DTs), Random Forests [12] (RFs), Multi-Layer Perceptrons [62] (MLPs), Logistic Regression (LR), and Support Vector Machines (SVMs) [11] - and evaluated them in terms of accuracy and potential hardware cost using scikit-learn over seven datasets belonging to simple machine learning applications that consume at least one sensor input and have low precision, duty cycle, and sample rate requirements. Printed [8], [19], flexible [15], wearable [52], [63], [67], [78], or RFID [6], [17], [53], [68], [74] sensors already exist for these applications making them suitable for our study. Six datasets (Arrhythmia [34], cardiotocography [9], pendigits [4], GasID [32], and RedWine/WhiteWine [18]) were chosen from the UCI Machine learning repository [24]; HAR (Human activity recognition) dataset was taken from [7]. Arrhythmia [34] classifies heart rhythms based on ECG sensor outputs. Cardiotocography (cardio) [9] classifies cardiotocograms based on several sensed values and their histograms. Pendigits [4] classifies hand-written numbers based on pen-tip pressure and location. GasID [32] uses chemical sensors to classify gasses. Red and White Wine [18] classify wines by their quality using pH and metal trace sensors. The HAR (Human Activity Recognition) dataset was taken from [7] and classifies a person’s activity (walking, standing, sitting, etc) based on accelerometer outputs.

We performed pre-processing for each application dataset to remove all the non-sensor (categorical) features. Each dataset is then divided (70/30) into a training dataset and a test dataset. In the training set, all input features are normalized to have zero mean and unit variance. During training, hyperparameters are selected using scikit-learn’s built-in hyperparameter search functionality (RandomizedSearchCV) with 5-fold cross validation and 100 iterations. MLPs, SVMs, and logistic regression are trained until convergence with the default tolerance. Once training is over, we generate classification accuracy (in Table II) on the test dataset.

Table II, in conjunction with Table I, allows us to understand the accuracy-cost tradeoffs between different classification

III. CHOOSING CLASSIFICATION
algorithms in printed and silicon technologies. We present results for Decision Trees with depth 1, 2, 4 and 8, Random Forests with 2, 4, and 8 trees, and MLPs with 1 and 3 hidden layers and 5 nodes per hidden layer. We evaluated both classification and regression versions of SVMs (SVM-C and SVM-R respectively). SVM-C is using the one-vs-one multi-class classification strategy, so there is one binary classifier for every pair of class labels. For SVM-R, the class labels (integers) are treated as real values used to train a single SVM regressor. During inference, the output (which is a real value) is mapped to the nearest class label.

To estimate the potential hardware cost of each classifier in printed and silicon technologies, we observe that the computation in each classifier (during classification) is dominated by two operations: comparisons and two-input multiply-accumulates (MACs). The potential hardware cost of each classifier, therefore, depends on the number of these operations in the trained model and their cost of hardware implementation (Table II). We count the number of each operation in the trained model of each classifier generated by scikit-learn.

To calculate the implementation cost of an operation, we implemented each operation natively in RTL and synthesized in printed technologies are orders of magnitude higher than the silicon implementation. This means that a classification algorithm whose hardware implementation exceeds a certain number of MAC units and comparators may be infeasible for printed applications, even as its silicon implementation may have acceptable overheads.

Results show that the Decision Trees are clearly the lowest cost classifiers. MLPs, LR and SVM-C (SVM classification) algorithms have high hardware cost (19 to almost 2000 MAC units) making the corresponding area and power overheads in silicon - 0.004 to 0.51 mm² and 0.009 to 1 W - are most likely acceptable for most applications). RFs also have high hardware cost for high accuracy goals (52 to 168 cm² and 0.21 to 0.68 W in EGT for RF-8) due to a large number of comparators used. However, the cost can be scaled down at the expense of accuracy by reducing the depth of each tree in the forest. SVM-Rs have higher hardware cost than most Decision Trees, but still have much lower cost than other classifiers.

Overall, our results show that simple classification algorithms such as Decision Trees (for all applications) and SVM-Rs (for some applications) provide a good balance in terms of accuracy and estimated costs. In fact, for HAR, Decision Trees have the highest accuracy tied with more complex classifiers. SVM-Rs performs better than Decision Trees for applications such as wine quality where there are simple linear relationships between input features and class labels. More complex algorithms such as MLP, and LR have higher accuracy on average, but have high overhead in printed technologies. Random Forests may allow tunable accuracy-cost tradeoffs. However, Decision Trees are the kernel of a Random Forest ensemble; any optimization for Decision Trees is a natural optimization for Random Forests.

As a result, we restrict our detailed evaluations in subsequent sections to Decision Trees and SVM-Rs.

A. Conventional Classifier Architectures in Printed Technologies

Decision Tree and SVM-R classifiers have low cost while still providing reasonable accuracy. For this reason, we choose to study these classifiers in greater depth. We implemented conventional decision tree and SVM-R classifiers in EGT and CNT-TFT to understand how well these architectures may meet the requirements of printed applications and where the cost bottlenecks and opportunities for improvement are.

1) Decision Trees: First, we consider decision trees. Complexity of the decision tree classification algorithm can be scaled by changing the number of levels in the tree and the number of comparisons at each level. For deeper trees, there exists a meaningful parallelism versus work-efficiency tradeoff. If comparisons are evaluated in serial, only the comparisons leading to the correct classification are required. However, we might want to evaluate comparisons at multiple levels in parallel.
to reduce latency. The more levels we evaluate in parallel, the more wasteful comparisons we will have to do.

We evaluate two implementations of decision trees in hardware, which correspond to the two ends of the parallelism versus wasted-work tradeoff space. The first implementation performs comparisons fully serially, so we call this the serial decision tree. The corresponding architecture diagram is shown in Fig. 2. In the serial implementation, there is a single comparator. There are two ROM memories, one for thresholds that the input features are compared to, and the other for classifications. During inference, the working node in the tree is stored in a shift register. The shift register is initialized with the value 1, and, in each subsequent cycle, the result of the current comparison is stored in the least significant bit of the shift register. Since the shift register works for any tree with a given depth, it is capable of indexing nodes that do not exist for a given (possibly unbalanced) tree. Therefore, we must either transform the value of the shift register before indexing into threshold memory or size the threshold memory assuming a full tree. We chose the latter. Once the initial set bit in the shift register reaches the most significant bit, we know inference is complete and use the stored index to look up the classification in the classification ROM.

For each decision tree depth, we generated RTL for the corresponding decision tree performing comparisons of 8-bit values. For the number of input features, we used the number of nodes in the tree or the maximum number of input features in any of our applications, whichever is smaller. For our modeling, we use standard cell libraries and ROM models from [10]. ROM models for TSMC-40nm are derived from [79]. Table III shows the latency, area, and power of logic and memory (ROM) in the decision trees for different technologies.

The results show that inference latency (logic delay + ROM delay) is high, especially for deep trees (almost 200ms for depth-8 trees!). Area and power overheads are also excessive. For example, the power requirement of EGT DT-8 exceeds the peak power produced by several printed and hybrid harvesters [40], [42] (Fig. 3). Similarly, EGT DT-4 and DT-8 cannot be powered by Blue Spark 10 mA h (2 mA peak current) and 30 mA h (2 mA peak current) printed batteries [71]; Molex [2] 90 mA h (20 mA peak current) printed batteries have three times bigger area footprint. Analogously, high area of the serial trees has direct impact on yield, bill of materials (BOM), and fabrication throughput.

Note that DT-4 and DT-8 have similar gate count. For trees in Table III, the number of unique features is equal to \( \min\{2^d - 1, 14\} \) where \( d \) is the depth of the tree and 14 is the average of the number of unique features in the datasets used. For both DT-4 and DT-8, the number of unique features is 14, which fixes the size of input multiplexers. The small difference in the gate counts of DT-4 and DT-8 is due to the increase in the width of the shift register, which is linear with the depth of the tree.

Our second implementation performs comparisons fully in parallel, so we call it a maximally parallel decision tree. The architecture of maximally parallel tree is shown in Fig. 2. For every node in the decision tree, there is a comparator and two registers. One register holds a threshold and the other holds an input feature. The result of all the comparisons is then used to select the classification using a multiplexer.

The delay, area, and power of the maximally parallel trees are shown in Table IV. These trees are, on average, 1.32× faster than the serial counterparts. However, area and power overheads are excessive when compared to serial implementations (20× bigger and consume 8.07× more power for EGT implementation). In fact, only a depth-1 EGT parallel tree can be powered either by Blue Spark or Molex printed batteries (Fig. 3); the deeper trees consume too much power to be powered by any printed battery or a printed energy harvester.

2) Support Vector Machines: We also implemented conventional inference architecture for SVM-Rs. The implementation is fully parallel. I.e., every MAC operation is assigned to its own MAC unit in hardware. The input features and coefficients (chosen equal in number to the max input features in all the datasets - 263, for arrhythmia) are stored in registers. The hardware multipliers, equal in number to the number of input features, multiply the input features with the corresponding trained coefficient. All the multiplication results are then added and mapped to the nearest class using comparators and class encoder as shown in Fig. 2.

Table V shows the delay, area and power analysis of 4, 8, 12 and 16 bit (width of input features and coefficient) SVMs implemented in EGT, CNT-TFT and TSMC40nm. Our evaluation shows that even the smallest SVM (4-bit) has an area of 85 cm² and a power of 288 mW. In fact, no conventional SVM can be powered by a printed battery or energy harvester (Fig. 3).
Fig. 3: Conventional Parallel (PDT) and Serial (SDT) EGT Decision Trees of depths 1, 2, 4, and 8 placed into sets based on which sources can power them: Molex [2], and Blue Spark [70], [71] printed batteries. Printed and hybrid harvesters [40], [42] cannot power any conventional EGT classifier architecture.

**TABLE V: Conventional SVMs: 4, 8, 12 and 16 bits.**

| SVM | D (msec) | A (cm²) | P (mW) | Gates | CNT-TFT | D (msec) | A (cm²) | P (mW) | Gates | TSMC4nm | D (msec) | A (cm²) | P (mW) | Gates |
|-----|----------|--------|--------|-------|---------|----------|--------|--------|-------|--------|----------|--------|--------|--------|-------|
| SVM-4 | 85.6 | 85.3 | 288 | 32k | 0.042 | 1.02 | 2.48 | 38k | 1.47 | 0.019 | 6.1 | 9.98k |
| SVM-8 | 125 | 439 | 1424 | 170k | 0.07 | 5.39 | 15 | 202k | 2.07 | 0.09 | 25 | 46k |
| SVM-12 | 142 | 860 | 2632 | 252k | 0.06 | 27 | 309k | 2.4 | 0.19 | 46 | 95k |
| SVM-16 | 151 | 1445 | 4294 | 403k | 0.07 | 45 | 624k | 2.6 | 0.32 | 76 | 153k |

Overall, our evaluations show that conventional classifier architectures have high delay, area, and power overheads when implemented in printed technologies. Subsequent sections present printing-specific architectures that are capable of reducing these overheads for several applications by multiple orders of magnitude.

IV. BESPOKE PRINTED CLASSIFIERS

Since both NRE costs and per unit-area fabrication costs in printed technology are low [31], even sub-cent [54], especially for additive and mask-less technologies such as inkjet printing that may even allow portable and on-demand printing, this enables highly custom bespoke classifiers - classifier architectures that are customized to a model generated for a given application using specific training datasets - even at low to moderate volumes. Such degree of customization is mostly infeasible in lithography-based silicon technologies, especially at low to moderate volumes, due to high NRE costs (lithography equipment, material processing equipment, etc.) as well as high fabrication costs (maskset costs, etc.). This degree of customization enables reduced area and gate count designs, which further reduces marginal costs. In this section, we present the first quantification of benefits of bespoke classifiers in printed technologies.

A. Bespoke Decision Trees

To generate bespoke serial decision trees (Fig. 4), we explore trees with width 4, 8, 12 and 16 bits for a given application

2As reference, Fujifilm Dimatix 2850 Materials inkjet printer [57] that we use to print electronics costs 50000USD and achieves sub-cent marginal cost per printed circuit when accounting for the cost of cartridges, ink, and other materials; in contrast, even older silicon foundries may cost hundreds of millions of dollars [35]

3Marginal costs may get even lower for higher degree of commercialization (since cartridges/inks/materials may become cheaper). Fixed costs (e.g., printer) may also decrease at high volumes.

and use scikit-learn to calculate the accuracy corresponding to those widths. We pick the tree that gives the best accuracy for the application (up to three significant digits) with minimum hardware cost (e.g. for Arrhythmia DT-1, accuracy remains the same when we increase the classifier width from 4 to 16, hence we pick DT-1 with 4-bit comparator width). In addition, we customize the mux to have a number of inputs equal to the number of input features. We also customize the width of the shift register to be equal to the depth of the tree. Size of each entry in the Threshold ROM is customized to the width of threshold values. The number of entries in the class ROM size is customized to match the number of classes. Fig. 6 shows the delay, area, power analysis of EGT bespoke serial trees relative to conventional serial trees. EGT bespoke serial trees have 1.2%, 37%, and 22% improvements in latency, area, and power (on average). Corresponding benefits for CNT-TFT bespoke serial trees (not shown) are 1.02%, 33%, and 26% respectively.
Fig. 5: **Left/Middle:** Design flow of the proposed hardware prototype of a 2-bit balanced and depth-of-2 bespoke digital Decision Tree. Top-level design of the hardwired Decision Tree is converted into logic level representation and then into transistor level circuit description. Next layout-extraction and microscope photo of the fabricated Decision Tree extracted. **Right:** Transient Measurements

Fig. 6: **Bespoke Serial Trees** normalized against the conventional Serial Trees.

To generate a bespoke maximally parallel tree (Fig. 4), the registers that are used for holding thresholds and inputs in the conventional architecture are removed. We train all the trees using our scikit-learn framework to get the trained threshold values which we then hardwire in the RTL, replacing the threshold registers in the conventional maximally parallel implementations. The input feature registers are replaced with connections directly to the input feature port they will use. The synthesis tool can then optimize away unnecessary gates in the design. For example, now that the actual trained threshold values are hardwired, the comparators have only one variable input which greatly simplifies overall design. We use the same methodology as bespoke serial trees to finalize the bitwidth of the tree. Results for EGT bespoke parallel trees are shown in Fig. 7. Our results show that the bespoke maximally parallel decision trees perform much better in all measures compared to their conventional counterparts. Latency, area, and power improve by 3.9×, 48.9×, and 75.6× respectively (on average). CNT-TFT bespoke maximally parallel trees yield similar benefits - latency, area, and power benefits (not shown) are 6.6×, 62.6×, and 27.3× respectively (on average).

EGT parallel bespoke trees have 10.3× lower area, 28.8× lower power, and 9.51× lower latency, on average, compared to their bespoke serial equivalents. In fact, unlike conventional counterparts, parallel bespoke trees are strictly better than serial bespoke trees.

**B. Bespoke SVMs**

We also developed bespoke SVM classifiers (Fig. 4). The number of input features, coefficients, multipliers, and comparators, as well the width of the registers, multipliers, adder, and comparators were fixed to the corresponding application specific values which we get after training the SVM regression models on the corresponding datasets using our scikit-learn
framework. In addition, since registers are expensive in printed
technologies (a DFF is 1.41 \text{mm}^2, 0.018 \text{mm}^2, and 3.99 \mu \text{m}^2 in
EGT, CNT-TFT, and TSMC 40nm respectively; corresponding
power values are 121 \mu \text{W}, 77 \mu \text{W}, and 4.7 \mu \text{W}), we replace the
registers with hardwired trained coefficient values. Now that
the coefficients are hardwired, our multipliers have only one
variable input which further optimizes the logic of hardware
multipliers. Fig. 11 shows that EGT bespoke SVMs have 1.4
\times lower delay, 12.8 \times lower area, and 12.7 \times lower power
(on average) compared to conventional SVM implementations.
Corresponding benefits for CNT-TFT bespoke SVMs (not
shown) are 1.7\times, 16\times, and 8.96\times respectively.

C. A Bespoke Decision Tree Prototype

Finally, to demonstrate the feasibility of printing bespoke
classifiers, we designed and fabricated an EGT-based 2-bit
encoded bespoke balanced and binary Decision Tree of depth
2 with threshold 102 (Fig. 5). Such a tree is easily transformed
into a simple logic gate representation (Fig. 5). We extracted
the corresponding circuit-level layout and used the Fujifilm Dimatix
2850 Materials inkjet printer for EGT-based fabrication on a
ITO-sputtered glass substrate, which was structured by laser-
ablation to obtain the passive conductive tracks. A microscope
photo of the fabricated circuit after printing the EGTs is shown
in Fig. 5.

To test operation of the prototype, all output class label pins
\(C_1, C_2, C_3, C_4\) were measured against all possible input signals
of the relevant bit positions in \(x_1\) and \(x_2\), which are: \(x_1^2, x_2^2, x_1^1\).
As can be seen from Fig. 5, only one class label \(C_i\) is activated
at the same time, in accordance to the functional description
in Fig. 5. Thus the fabricated circuit is fully functional. Also,
as the outputs of root and split nodes in this design have a
high input impedance and low output impedance, the presented
depth-2 tree can be used as a building block for building
arbitrary larger trees.

To the best of knowledge, this is the first quantification of
the benefits of bespoke classifiers in printed technology. Also,
this is the first prototype of a digital Decision Tree in a printed
technology.

V. LOOKUP-BASED PRINTED CLASSIFIERS

It is well known [22] that a computational function typically
implemented using digital or analog logic can often also be
implemented as a lookup table (LUT). The practicality of this
approach depends on a judicious selection of the computation
to replace with a LUT as well as the overheads of the technology
the LUT is implemented in. In several printed technologies,
including EGT, ROMs have low area and power overhead (e.g.,
1-bit EGT ROM has an area of 0.05 mm\(^2\), while one-input
inverter has an area of 0.22 mm\(^2\) [10], corresponding power
values are 3.13 \mu W and 9.6 \mu W respectively) since ROMs
can be built simply as a crossbar architecture where the cross-
points are shorted by printing a conductive material (such as
PEDOT:PSS) to represent a bit-value [10]. Unlike silicon where
ROM cells can have high delay (e.g., 900\times slower than the
inverter cell [79]), delay of an EGT crossbar-based ROM cell
is also low (within 1.5\times of inverter cell [10]). This opens
up the possibility of implementing classifiers using lookup
tables (LUTs) where computation logic (e.g., comparators in
Decision Trees and multipliers in SVMs) are replaced by LUTs
to reduce area and power overhead.

Two issues arise when replacing logic with ROMs. The first
issue is: how much computation should be replaced with each
LUT? If the amount of computation we replace with a ROM
is too large, the number of ROM entries in the replacement
is also large, and we may not see benefits. If the amount
of computation we replace is too small, the surrounding logic (e.g.,
muxes and decoders) needed to access ROM can become too
expensive. An exploration is needed to find the right amount
of computation to replace with ROM. The second related
issue is: how much reuse occurs for the surrounding logic
(e.g., decoders)? For the ROM sizes that we are interested in,
the decoder, for example, is expensive enough that a ROM-
based comparison is always more expensive than its logic-
based counterpart. The effective overhead of the decoder per
lookup can be decreased, however, if the same decoder is
used multiple times (i.e., if it can be shared across multiple
computations). Fortunately, this is often the case with classifiers
since classifiers compute on the same input feature multiple
times.

A. Lookup Replacements in Decision Trees and SVMs

In serial Decision Trees, there is only one comparator to
replace and none of the inputs are used simultaneously, so
ROMs are not a good fit for logic replacement (the area
and power of individual lookup-based comparator was 6.7\times
and 7.3\times respectively compared to the non-lookup based
comparator due to decoder overhead). In parallel Decision
Trees, however, there are often many comparisons using the
same input feature leading to significant decoder reuse. Fig. 8
shows the architecture of lookup-based implementation of
maximally parallel trees. In Fig. 9, we show the latency, area
and power benefits we obtained from replacing all comparators
in the EGT parallel tree implementations with lookup based
equivalents (results are normalized against bespoke maximally
parallel trees). In many cases, especially with shallow trees,
there is not enough input feature reuse for lookup tables to
be useful. But, in the best case, we see 13\%, 38\%, and 70\%
improvements in delay, area, and power. For CNT-TFT, the
ROM area is typically greater than logic area (0.05 mm\(^2\) 1-bit
ROM vs 0.002 mm\(^2\) one-input inverter [10]), while ROM power
is lower than logic power (2.77 \mu W 1-bit ROM vs 8.08 \mu W
one-input inverter [10]). As a result, a lookup-based parallel
tree implementation provides 76.2\% power benefit, on average,
at the cost of increasing the area 69\times.

For SVMs, we replaced the multipliers with ROM-based
implementations (Fig. 8). However, the area and power of
individual lookup-based multiplier was 1.30\times and 1.17\times
respectively compared to the non-lookup based multiplier
due to decoder overhead. Fig. 12 shows the latency, area, and power

\(^6\text{ROM cells also have high power in silicon; } \sim 1200\times \text{ the power of an inverter cell [79]}\)
of lookup-based SVMs normalized with respect to bespoke SVMs in EGT. Since every input feature is used only once, there is no decoder sharing like there was for parallel Decision Trees. As such, we do not see any benefits.

Fortunately, lookup-based classifier architecture introduces...
A. Analog Replacements in Decision Trees and SVMs

By tuning the geometry of the printed resistors, different resistance states can be encoded which represent multiple bits of information. The chosen resistances of the printed prototype were: $R_1 = 2R_{sense}$, $R_2 = \infty$ (not printed), $R_3 = R_{sense}/2$ and $R_4 \sim 0\Omega$ (maximum resistor area).

Thus 2-bit of information could be encoded per ROM element, and thus 8-bit information for the whole 4x1 ROM. Transient measurements in Fig. 14 show data being read out successfully based on the decoded address.

The delay of the prototyped ROM element was about 10ns with an average power consumption of 39μW. The area requirement was 38mm$^2$. The printed prototyped ROM can be easily scaled to larger memory sizes by adding additional rows or columns and can, therefore, serve as the building block for lookup-based printed classifiers.

B. Lookup Prototyping

Finally, to demonstrate feasibility of implementing lookup-based classifier architectures, we fabricated a 4x1 printed one-time programmable ROM element. The four rows of the ROM in Fig. 14 are accessed by a decoder logic block consisting of pass transistors $T_1 - T_4$, while data is stored in a resistive crossbar architecture with printed resistors $R_1 - R_4$ at the crossbar interconnects. The read signal of a read operation is obtained from the output voltage $V_{out}$ across the sensing resistor $R_{sense}$. The printed ROM basically implements a voltage divider structure, with the fixed $R_{sense}$ in the pull-down network and the variable printed resistors $R_i$ in the pull-up network. By tuning the geometry of the printed resistors, different resistance states can be encoded which represent multiple bits of information. The chosen resistances of the printed prototype were: $R_1 = 2R_{sense}$, $R_2 = \infty$ (not printed), $R_3 = R_{sense}/2$ and $R_4 \sim 0\Omega$ (maximum resistor area).

Thus 2-bit of information could be encoded per ROM element, and thus 8-bit information for the whole 4x1 ROM. Transient measurements in Fig. 14 show data being read out successfully based on the decoded address.

The delay of the prototyped ROM element was about 10ns with an average power consumption of 39μW. The area requirement was 38mm$^2$. The printed prototyped ROM can be easily scaled to larger memory sizes by adding additional rows or columns and can, therefore, serve as the building block for lookup-based printed classifiers.

VI. ANALOG PRINTED CLASSIFIERS

One effective method for significantly reducing transistor count, and, therefore, overall area and power overhead, in printed classifiers could be to judiciously substitute complex logic (e.g., multi-bit comparators in Decision Trees) by small analog circuits with only few transistors [23]. For some applications, an analog classifier architecture may also allow sensor outputs to be connected directly to the classifier, avoiding costly analog-to-digital converter (and reverse).

In silicon-based (SI) classifiers, such substitutions are particularly challenging since a) it introduces additional verification and test challenges [58] [33], and b) noise and mismatch constraints force the analog devices to be large and, therefore, any area and power benefits may be lost [33]. In printed technologies, low fabrication costs allow iterative refinement to fix/reduce noise/mismatch issues. Therefore, such analog substitutions may be more feasible.

A. Analog Replacements in Decision Trees and SVMs

To build an analog Decision Tree architecture, we observe that, at each node of the Decision Tree, the binary decision can be formulated as an if-else-statement of the form: $x_k \leq \tau_j$, where $\tau_j$ is a pre-defined threshold determined by the learning phase. For an analog implementation, this binary comparison can be realized by a back-to-back inverter, which has a printed resistor in the pull-up network of one of the inverters, and a transistor in the pull-up network of the opposite inverter (see root node in Fig. 15). Moreover, the features $x_k$ are encoded as voltage signals, and are normalized to the interval [0V, 1V].
Subsequently, the input is then applied to the gate of a transistor in the pull-up network, and converted into a resistance value.

Dependent on the input voltage level, the equivalent transistor resistance changes its value in the range of certain On- and Off-resistances ($[R_{on}, R_{off}]$), dependent on the transistor characteristics. Next, the threshold $\tau_j$ is encoded as a resistor, by using the following mapping function:

$$R_j = \frac{\tau_j - \tau_{j_{\min}}}{\tau_{j_{\max}} - \tau_{j_{\min}}} \cdot (R_{max} - R_{min}) + R_{min},$$

where $R_{min}, R_{max}$ are the technology-dependent and feasible (printable) resistor values, and $\tau_{j_{\min}}, \tau_{j_{\max}}$ are determined by the trained decision tree model.

Based on the difference of resistances of the transistor and resistor, one output node ($S_1$ or $S_2$) is pulled up stronger to VDD than the other, and the bi-stable back-to-back inverter converges to a state, where the output nodes are complementary ('1'/0' or '0'/1'). These output signals are then passed to the child nodes, where only one is enabled at a time, based on the analog binary comparison. This structure guarantees that at any level of the tree, only one child is selected, and hence, at the leaf level, one and only one leaf is selected. An interesting side-effect of this is that switching activity is limited to the depth of the tree. In effect, there is implicit logic which gates off unused portions of the circuit from consuming dynamic power. For any Decision Tree design, the process of adding split nodes to the last layers is repeated until the desired Decision Tree architecture is reached, and the class labels are read out from the leaves of the last split nodes in each branch. Due to the insertion of selector transistors in the split nodes, the resulting voltage levels deteriorate from the root node down to the split nodes in the last layer. This signal attenuation across a cascade of split nodes can be compensated for by using additional inverters (buffers) before the input of the selector transistors, to improve the signal levels.

Fig. 16 shows the latency, area and power of analog implementation of bespoke maximally parallel trees in EGFET. Our calculations show that analog trees have $437 \times$ less area and $27 \times$ less power (on average) with slight increase in latency ($1.6 \times$) compared to digital bespoke maximally parallel trees.
We similarly developed an analog SVM implementation where we replaced the MAC operation by a one-time programmed resistive crossbar architecture, depicted in Fig. 15. The crossbar architecture is programmed by printing resistors with different geometries at the crossbar interconnects. The inputs to the crossbar are voltage signals, and the output current is sensed per each column separately.

The MAC operation is performed by applying Kirchoff’s rule to the resistor network. The output voltage \( V_{out}^{(i)} \) of a virtually grounded column \( i \) is computed by:

\[
V_{out}^{(i)} = \sum_{c=1}^{P} \frac{V_i}{R_{i,c}} \left( \sum_{c=1}^{P} \frac{1}{R_{i,c}} \right)^{-1} = \sum_{c=1}^{P} V_i w_i^{(c)}
\]

(1)

with

\[
w_i^{(c)} = \frac{1}{R_{i,c}^{(c)}} \left( \sum_{c=1}^{P} \frac{1}{R_{i,c}^{(c)}} \right)^{-1}
\]

(2)

where \( P \) is the number of rows per column.

Thus, the MAC operation can be directly derived from (1), where the values \( w_i \) can be determined by printing appropriate resistance values \( R_i \), which solve (2). As the voltages are set as the MAC inputs \( V_i = x_i \), the mathematical form of a multi-input MAC operation is obtained: \( y = \sum_{i=1}^{c} w_i x_i \).

Fig. 17 shows the latency, area and power of analog implementation of bespoke SVMs. Our calculations show that, in analog implementations, on average, area and power improve by 490× and 12× with slight increase in latency 1.3× respectively. To the best of our knowledge, these are the first printed implementations of analog Decision Trees and SVMs and quantification of their benefits.

**B. Analog Decision Tree Prototype**

Finally, to demonstrate the feasibility of analog printed classifiers, we fabricated an analog 2-level Decision Tree based on EGT-technology. The fabricated 2-level Decision Tree consists of one root node and two split nodes (Fig. 15) resulting in 11 EGTs and 3 printed resistors. In this layout, PEDOT resistor is printed into the gap between the source of \( T_5 \) and \( c_4 \). The overall fabrication process was similar to one described in Section IV-C. In addition to EGTs and resistors, crossovers were also inkjet printed to make the connections, by using an isolation layer (Dimethylsulfoxide (DMSO) and Polycarbonate (PC)) in combination with a conductive layer (PEDOT:PSS). A microscope photo of the fabricated 2-level Decision Tree is provided in Fig. 15.

The transient measurements for the root node are depicted in Fig. 15. As expected, when the input \( x_1 \) is at logical ‘1’, \( S_1/S_2 \) are in state ‘1’/‘0’. When \( x_1 \) is ‘0’, the state changes to ‘0’/‘1’. The transient response of the right split node to all 4 input combinations is also shown in Fig. 15. In the case the split node is unselected (\( x_1 \) is high), the output voltages \( C_3/C_4 \) are pulled down to 0V. On the other hand, if the split node is selected (\( x_1 \) is low), \( C_3/C_4 \) are pulled up or down, according to the input signal \( x_2 \). The worst case output signals of the split node are clearly distinguishable (405mV), and hence the printed analog Decision Tree was functioning correctly. This is the first prototype of an analog Decision Tree in a printed technology.

**V. DISCUSSION**

A printed ML classifier is only a component of a complete classification system (Fig. 18). Some classification applications (e.g., HAR, Pendigits, Red-Wine and White-Wine) require no feature extraction, as the accelerator performs inference directly on the sensed signals. For other applications, feature extraction can be performed either with a custom designed fixed-function unit, or with software running on a printed microprocessor [10]. Sensors are either integrated directly into computational units, bypassing ADCs [60], [61] or use printed ADCs [1]. Since the role of any printed system is known at print-time, interfaces can be custom. Additionally, due to the low cost of wires and high cost of flip-flops [10], parallel ports may be preferred to serial ports. Since these components are integrated to the same substrate at print time and are unpackaged, there are no concerns over pin counts. In general, the ideal applications for printed classifiers are those with non-categorical data (signals have to be measured in the field) with minimal feature extraction requirements, and relaxed latency requirements.

Fig. 19 shows that, unlike conventional classifier architectures (Fig. 3), most printing-specific classifier architectures can be powered by a printed battery or an energy harvester. However, we still see that several classifiers are currently infeasible due to power limitations. We also see that the power supply requirements of bespoke classifiers is dataset dependent (all bespoke analog SVMs can be powered by Blue Spark...
batteries, except for Arrhythmia). These results motivate the need for further research in printed batteries and harvesters. Additionally, it suggests that certain applications will need to sacrifice classification accuracy in order to meet power budget requirements.

Conventional printed classifiers are expensive even when a full system is considered. For example, we estimate EGT-printed 2-bit / 4-bit ADCs to cost 3.76 mm² / 25.4 mm² in area and 60 μW / 360 μW power [10]. Conventional EGT-printed classifiers (Tables III, IV, V) are often much bigger (~20 to 1445 cm²) and consume orders of magnitude more power (1.6 to 4200 mW). Similarly, a printed microprocessor based feature extraction (FE) may cost ∼2 to 3 cm² [10]; printed sensors can be as small as ∼0.5 mm² and consume <2 mW [38]. Again, conventional printed classifiers are much more expensive (Tables III, IV, V). A classifier’s system-level overhead will be even higher when ADCs and FE engines are optimized for area or power, or not used entirely (e.g., in case of direct interfacing [60]). The techniques proposed in this paper, therefore, would provide significant system-level benefits. This would be true even when power supply is considered; commercial printed batteries occupy 20 to 50 cm², while printed backscatter mechanisms (e.g., for RFIDs) are at least an order of magnitude smaller [75].

Design of printed classifiers is highly automatable. After training a decision tree on a dataset, scikit-learn’s DecisionTreeClassifier class exposes the internal structure of the trained tree, allowing us to traverse the tree and recursively transform it into RTL for a bespoke decision tree classifier (we perform a similar transformation for SVM). Ozer et al propose a similar design flow. For our bespoke parallel trees, the generated RTL module body consists of a single assignment statement for each node in the tree, as well as a ‘casex’ statement to choose the correct output class, a relatively straightforward transformation. For the serial trees, the tree is transformed into hexifiles representing the class and threshold values which are stored in ROM. For lookup-based classifiers, we replace comparisons in decision trees and MACs in SVMs by generating in RTL lookup tables in place of those operations. An automated process for finding what to replace with lookup tables and how to replace it has the potential to lead to better results. We know that one indicator of a beneficial replacement is MISD parallelism (since it allows decoder reuse). Finding MISD parallelism in a design tool can be done by finding high fanout nets. Then, for each output connected to a given net, the tool finds maximum amount of logic that only depends on that net and replaces that logic with a ROM. The correct methodology for automatically determining how to break up a given computation into multiple lookup tables and logic that combines their results is not obvious and requires further study. Generation of analog classifiers is understandably more difficult to automate (e.g., we do not know of any printed analog libraries) - we consider it a subject of future work. Our digital SVM and DT generator is published under an open source license at https://github.com/PrintedComputing.

Compared to CMOS baselines (Table IV), EGT-based analog / digital trees are 10× / 1000× larger, and consume 10³× / 10⁶× more energy per inference. Similarly large area and energy differences exist for SVMs. Thus it is unlikely that there exist system design points such that an EGT-based system outperforms a silicon CMOS system in terms of power, performance, or area (PPA) – argument for printing must be in non-PPA terms (cost, conformality, time-to-market, non-
toxicity, etc.)

Finally, envisioned applications for printed classifiers have resilience requirements (against bending, dirt, humidity, wear and tear, etc.). EGTs can be bent reliably to a radius of 10 mm over 100 times with >10% change in electrical characteristics [43]; resilience against dirt, humidity can be provided easily by a printed passivation layer [50]. This level of resilience is adequate for the short-shelf-life applications we target. For higher mechanical and temperature resilience, Kapton films [51] can be used.

VIII. CONCLUSION

A large number applications domains have not seen much penetration of computing due to the cost, conformity, and toxicity limitations of silicon-based computer systems. Recent low-voltage printed computer systems have the potential to address these limitations. A common computational task in these application domains is anticipated to be machine learning classification. In this work, we explored the hardware cost of inference engines for popular classification algorithms in EGT and CNT-TFT printed technologies. We found that Decision Trees and SVMs provide a good balance between accuracy and cost. Subsequently, we evaluated conventional Decision Tree and SVM architectures in these technologies and concluded that their area and power overhead must be reduced for them to be feasible. Then, we explored, through SPICE and gate-level simulations and multiple working prototypes, several printing-
specific classifier architectures that exploit the unique cost and implementation tradeoffs in printed technologies. Our evaluations showed that bespoke EGT printed Decision Trees have 48.9× lower area (average) and 75.6× lower power (average) than their conventional equivalents; corresponding benefits for bespoke SVMs are 12.8× and 12.7× respectively. Lookup-based Decision Trees outperformed their non-lookup bespoke equivalents by 38% and 70%; lookup-based SVMs were better by 8% and 0.6%. Analog printed Decision Trees provided 437× and 27× benefits over digital bespoke counterparts; analog SVMs yielded 490× and 12× improvements. Our results and prototypes demonstrate feasibility of fabricating and deploying battery and self-powered printed classifiers in the application domains of interest.

IX. ACKNOWLEDGMENTS

Authors would like to thank anonymous reviewers for their feedback and NSF for its partial support of the work.


